A Scalable Indexing Solution to Mine Huge Genomic Sequence Collections

With High Throughput Sequencing (HTS) technologies, biology is experiencing a sequence data deluge. A single sequencing experiment currently yields 100 million short sequences, or reads, the analysis of which demands efficient and scalable sequence analysis algorithms. Diverse kinds of applications repeatedly need to query the sequence collection for the occurrence positions of a subword. Time can be saved by building an index of all subwords present in the sequences before performing huge numbers of queries. However, both the scalability and the memory requirement of the chosen data structure must suit the data volume. Here, we introduce a novel indexing data structure, called Gk arrays, and related algorithms that improve on classical indexes and state of the art hash tables.

Mots clés

Algorithms index data structure scalability DNA RNA Next Generation Sequencing Computational Biology

Domaines

Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

Rivals-etal-ERCIM-News-89-5p.pdf (1.05 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Eric Rivals : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00712653

Soumis le : mercredi 27 juin 2012-16:13:51

Dernière modification le : mercredi 17 avril 2024-15:30:03

Archivage à long terme le : vendredi 28 septembre 2012-02:41:41

Dates et versions

lirmm-00712653 , version 1 (27-06-2012)

Identifiants

HAL Id : lirmm-00712653 , version 1

Citer

Eric Rivals, Nicolas Philippe, Mikael Salson, Martine Léonard, Thérèse Commes, et al.. A Scalable Indexing Solution to Mine Huge Genomic Sequence Collections. ERCIM News, 2012, 2012 (89), pp.20-21. ⟨lirmm-00712653⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LILLE3 CNRS INRIA INSA-ROUEN CRBM LIFL LITIS ERCIM ERCIM-NEWS MAB LIRMM COMUE-NORMANDIE CRISTAL INRIA2 CRISTAL-BONSAI MIPS BS UNIV-MONTPELLIER UNIROUEN UNILEHAVRE INSA-GROUPE

377 Consultations

1759 Téléchargements