Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

kmtricks: Efficient construction of Bloom filters for large sequencing data collections

Abstract : When indexing large collection of sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI, ..) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting hashes instead of k-mers; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. In addition, our experimental results highlight that the usual yet crude filtering of low-abundant k-mers is inappropriate for complex data such as metagenomes.
Complete list of metadata

https://hal.inria.fr/hal-03166007
Contributor : Pierre Peterlongo <>
Submitted on : Thursday, March 11, 2021 - 9:16:18 AM
Last modification on : Saturday, March 13, 2021 - 3:32:24 AM

File

kmtricks.pdf
Files produced by the author(s)

Identifiers

Citation

Téo Lemane, Paul Medvedev, Rayan Chikhi, Pierre Peterlongo. kmtricks: Efficient construction of Bloom filters for large sequencing data collections. 2021. ⟨hal-03166007⟩

Share

Metrics

Record views

37

Files downloads

192