Bloocoo, a memory efficient read corrector

Gaëtan Benoit; Dominique Lavenier; Claire Lemaitre; Guillaume Rizk

Poster Année : 2014

Bloocoo, a memory efficient read corrector

(1) , (1) , (1) , (1)

Gaëtan Benoit

Fonction : Auteur
PersonId : 772810
IdRef : 224561936

Scalable, Optimized and Parallel Algorithms for Genomics

Dominique Lavenier

Fonction : Auteur
PersonId : 1401
IdHAL : dominique-lavenier
ORCID : 0000-0003-2557-680X

Scalable, Optimized and Parallel Algorithms for Genomics

Claire Lemaitre

Fonction : Auteur
PersonId : 914
IdHAL : claire-lemaitre
ORCID : 0000-0001-8675-170X
IdRef : 132940418

Scalable, Optimized and Parallel Algorithms for Genomics

Guillaume Rizk

Fonction : Auteur

Scalable, Optimized and Parallel Algorithms for Genomics

Résumé

Next generation sequencing technologies generate a high amount of short DNA sequences, but may contain imperfections. Some applications, such as assembly, yield better results with high quality data, triggering the need for short-read correction software. Most methods proposed in the past do not scale well when dealing with a large dataset, often requiring a very large amount of memory. Bloocoo is a k-mer-spectrum based read error corrector. It relies on k-mer frequency to discriminate between correct solid k-mer and k-mer containing sequencing errors. This is a traditional approach used by many read correctors. Bloocoo distinguishes itself by requiring an order of magnitude less memory than other state-of-the art correctors, while still providing equivalent correction. This is achieved thanks to a constant-memory k-mer counting algorithm, and a bloom filter to store solid k-mers. The read correction workflows consists in 3 main steps as follows: 1) k-mer counting. 2) Insertion of solid k-mer in a bloom filter. 3) Multi-stage read correction.The first stage is the constant-memory k-mer counting step provided by the DSK counter by Rizk et al. It outputs on disk the list of solid k-mer, i.e. k-mers with a high enough abundance. The second step is the insertion of all solid k-mers in a bloom filter. This probabilistic data structure allows to know if a given k-mer is solid, using only 11 bits per solid k-mer inserted, at the cost of introducing some false positives.The correction step is a multi-step approach largely similar to the Musket correction algorithm by Liu et al. All k-mers of a read are queried in the bloom filter and classified as solid or non- solid k-mers. Errors are then corrected through multiple iterations of two-sided conservative, one-sided aggressive and voting-based correction algorithm. To neutralize the effect of false positive solid k-mers due to the bloom filter, errors are corrected only if several solid k-mers cover the corrected nucleotide, greatly reducing the risk that all of them are false-positives and induce false correction.Bloocoo was also designed for fast correction. First, dispatching blocks of reads to several threads is an easy way to parallelize the correction procedure. Secondly, the bloom filter used is a blocked bloom filter, which greatly increases performance thanks to higher cache coherence. Tests were conducted on simulated datasets with different error rates and genomes against state-of the art competitors. For example, with reads from human chromosome 1 with 1% error rate and 70x coverage, Bloocoo completed correction in 1390 sec using 740 MB of memory, while Musket required 5330 sec and 12190 MB of memory. Quality of correction was 90.28 % recall / 96.93 % precision for Bloocoo and 90.92 % recall / 97.86 % precision for Musket. Bloocoo correction is roughly the same as Musket, while taking 16x less memory and 3.8x less time.

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

bloocoo_poster_ECCB.pdf (545.22 Ko)

bloocoo_abstract_ECCB.pdf (68.62 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Claire Lemaitre : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01092960

Soumis le : mardi 9 décembre 2014-17:50:39

Dernière modification le : vendredi 24 mars 2023-14:52:59

Archivage à long terme le : samedi 15 avril 2017-07:12:13

Dates et versions

hal-01092960 , version 1 (09-12-2014)

Identifiants

HAL Id : hal-01092960 , version 1

Citer

Gaëtan Benoit, Dominique Lavenier, Claire Lemaitre, Guillaume Rizk. Bloocoo, a memory efficient read corrector. European Conference on Computational Biology (ECCB), Sep 2014, Strasbourg, France. ECCB 2014, 2014. ⟨hal-01092960⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-D7 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

624 Consultations

304 Téléchargements

Bloocoo, a memory efficient read corrector

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager