Modeling local repeats on genomic sequences

Abstract : This paper deals with the specification and search of repeats of biological interest, i.e. repeats that may have a role in genomic structures or functions. Although some particular repeats such as tandem repeats have been well formalized, models developed so far remain of limited expressivity with respect to known forms of repeats in biological sequences. This paper introduces new general and realistic concepts characterizing potentially useful repeats in a sequence: Locality and several refinements around the Maximality concept. Locality is related to the distribution of occurrences of repeated elements and characterizes the way occurrences are clustered in this distribution. The associated notion of neighborhood allows to indirectly exhibit words with a distribution of occurrences that is correlated to a given distribution. Maximality is related to the contextual delimitation of the repeated units. We have extended the usual notion of maximality, working on the inclusion relation between repeats and taking into account larger contexts. Mainly, we introduced a new repeat concept, largest maximal repeats, looking for the existence of a subset of maximal occurrences of a repeated word instead of a global maximization. We propose algorithms checking for local and refined maximal repeats using at the conceptual level a suffix tree data structure. Experiments on natural and artificial data further illustrate various aspects of this new setting. All programs are available on the genouest platform, at
