kmtricks: Efficient construction of Bloom filters for large sequencing data collections

Téo Lemane; Paul Medvedev; Rayan Chikhi; Pierre Peterlongo

doi:10.1093/bioadv/vbac029

Article Dans Une Revue Bioinformatics Advances Année : 2022

kmtricks: Efficient construction of Bloom filters for large sequencing data collections

(1) , (2) , (3) , (1)

1
2
3

Téo Lemane

Fonction : Auteur
PersonId : 806245
ORCID : 0000-0002-7210-3178
IdRef : 271480785

Scalable, Optimized and Parallel Algorithms for Genomics

Paul Medvedev

Fonction : Auteur

Pennsylvania State University

Rayan Chikhi

Fonction : Auteur
PersonId : 14839
IdHAL : rayan-chikhi
ORCID : 0000-0003-1099-8735
IdRef : 16546769X

Département de Biologie Computationnelle - Department of Computational Biology

Pierre Peterlongo

Fonction : Auteur correspondant
PersonId : 171998
IdHAL : pierre-peterlongo
ORCID : 0000-0003-0776-6407
IdRef : 12482062X

Connectez-vous pour contacter l'auteur

Scalable, Optimized and Parallel Algorithms for Genomics

Résumé

When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI,.) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8x more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset.

Mots clés

indexing kmers bloom filters sequencing data genomics

Domaines

Bio-informatique [q-bio.QM] Algorithme et structure de données [cs.DS]

Fichier principal

kmtricks.pdf (1.01 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Peterlongo : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03166007

Soumis le : jeudi 11 mars 2021-09:16:18

Dernière modification le : mardi 16 janvier 2024-16:29:55

Archivage à long terme le : samedi 12 juin 2021-18:12:37

Dates et versions

hal-03166007 , version 1 (11-03-2021)

Identifiants

HAL Id : hal-03166007 , version 1
DOI : 10.1093/bioadv/vbac029

Citer

Téo Lemane, Paul Medvedev, Rayan Chikhi, Pierre Peterlongo. kmtricks: Efficient construction of Bloom filters for large sequencing data collections. Bioinformatics Advances, 2022, ⟨10.1093/bioadv/vbac029⟩. ⟨hal-03166007⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PASTEUR UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA CENTRALESUPELEC INRIA2 GENCI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES ANR PRAIRIE-IA UR1-MATH-NUM

261 Consultations

377 Téléchargements

kmtricks: Efficient construction of Bloom filters for large sequencing data collections

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager