Lossless filter for multiple repeats with bounded edit distance

Pierre Peterlongo 1, * Gustavo Sacomoto 2 Alair Do Lago 3 Nadia Pisanti 4 Marie-France Sagot 5, 6, *
* Auteur correspondant
1 SYMBIOSE - Biological systems and models, bioinformatics and sequences
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
6 BAMBOO - An algorithmic view on genomes, cells, and environments
Inria Grenoble - Rhône-Alpes, LBBE - Laboratoire de Biométrie et Biologie Evolutive
Abstract : Background
Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice.
The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment.
To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.
Type de document :
Article dans une revue
Algorithms for Molecular Biology, BioMed Central, 2009, 4 (1), pp.3. 〈10.1186/1748-7188-4-3〉
Liste complète des métadonnées

Littérature citée [16 références]  Voir  Masquer  Télécharger

Contributeur : Ed. Bmc <>
Soumis le : lundi 4 février 2013 - 13:12:17
Dernière modification le : jeudi 7 février 2019 - 17:05:03
Document(s) archivé(s) le : dimanche 5 mai 2013 - 07:30:14


Fichiers éditeurs autorisés sur une archive ouverte



Pierre Peterlongo, Gustavo Sacomoto, Alair Do Lago, Nadia Pisanti, Marie-France Sagot. Lossless filter for multiple repeats with bounded edit distance. Algorithms for Molecular Biology, BioMed Central, 2009, 4 (1), pp.3. 〈10.1186/1748-7188-4-3〉. 〈hal-00784457〉



Consultations de la notice


Téléchargements de fichiers