Modeling local repeats on genomic sequences

Abstract : This paper deals with the specification and search of repeats of biological interest, i.e. repeats that may have a role in genomic structures or functions. Although some particular repeats such as tandem repeats have been well formalized, models developed so far remain of limited expressivity with respect to known forms of repeats in biological sequences. This paper introduces new general and realistic concepts characterizing potentially useful repeats in a sequence: Locality and several refinements around the Maximality concept. Locality is related to the distribution of occurrences of repeated elements and characterizes the way occurrences are clustered in this distribution. The associated notion of neighborhood allows to indirectly exhibit words with a distribution of occurrences that is correlated to a given distribution. Maximality is related to the contextual delimitation of the repeated units. We have extended the usual notion of maximality, working on the inclusion relation between repeats and taking into account larger contexts. Mainly, we introduced a new repeat concept, largest maximal repeats, looking for the existence of a subset of maximal occurrences of a repeated word instead of a global maximization. We propose algorithms checking for local and refined maximal repeats using at the conceptual level a suffix tree data structure. Experiments on natural and artificial data further illustrate various aspects of this new setting. All programs are available on the genouest platform, at http://genouest.org/modulome.
Liste complète des métadonnées

Littérature citée [1 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00353690
Contributeur : Pierre Peterlongo <>
Soumis le : vendredi 16 janvier 2009 - 10:38:34
Dernière modification le : mercredi 16 mai 2018 - 11:23:05
Document(s) archivé(s) le : mardi 8 juin 2010 - 18:22:44

Fichier

RR-6802.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00353690, version 1

Citation

Jacques Nicolas, Christine Rousseau, Anne Siegel, Pierre Peterlongo, François Coste, et al.. Modeling local repeats on genomic sequences. [Research Report] RR-6802, INRIA. 2008, pp.43. 〈inria-00353690〉

Partager

Métriques

Consultations de la notice

720

Téléchargements de fichiers

124