Optimal neighborhood indexing for protein similarity search

Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet. The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated \evalue parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.

Domaines

Algorithme et structure de données [cs.DS] Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

journal2.pdf (372.28 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Peterlongo : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00340510

Soumis le : vendredi 21 novembre 2008-09:39:44

Dernière modification le : vendredi 24 mars 2023-14:52:51

Archivage à long terme le : jeudi 11 octobre 2012-11:50:47

Dates et versions

inria-00340510 , version 1 (21-11-2008)

Identifiants

HAL Id : inria-00340510 , version 1
DOI : 10.1186/1471-2105-9-534

Citer

Pierre Peterlongo, Laurent Noé, Dominique Lavenier, van Hoa Nguyen, Gregory Kucherov, et al.. Optimal neighborhood indexing for protein similarity search. BMC Bioinformatics, 2008, 9 (534), ⟨10.1186/1471-2105-9-534⟩. ⟨inria-00340510⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 UNIV-LILLE3 CNRS INRIA INSA-RENNES IRISA LIFL IRISA-D7 CRISTAL INRIA2 CRISTAL-BONSAI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

423 Consultations

148 Téléchargements