HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

kmtricks: Efficient construction of Bloom filters for large sequencing data collections

Abstract : When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI,.) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8x more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset.
Complete list of metadata

Contributor : Pierre Peterlongo Connect in order to contact the contributor
Submitted on : Thursday, March 11, 2021 - 9:16:18 AM
Last modification on : Wednesday, May 18, 2022 - 10:43:12 AM
Long-term archiving on: : Saturday, June 12, 2021 - 6:12:37 PM


Files produced by the author(s)



Téo Lemane, Paul Medvedev, Rayan Chikhi, Pierre Peterlongo. kmtricks: Efficient construction of Bloom filters for large sequencing data collections. Bioinformatics Advances, Oxford academic, 2022, ⟨10.1093/bioadv/vbac029⟩. ⟨hal-03166007⟩



Record views


Files downloads