kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Preprints, Working Papers, ... (Preprint) Year : 2023

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Abstract

Abstract Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as one cannot efficiently search them for any sequence(s) of interest. We present kmindex , an innovative approach that can index thousands of highly complex metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. We demonstrate the scalability of kmindex by successfully indexing 1,393 complex marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server “Ocean Read Atlas” ( ORA ) at https://ocean-read-atlas.mio.osupytheas.fr/ , which enables real-time queries on the Tara Oceans dataset. The open-source kmindex software is available at https://github.com/tlemane/kmindex .

Dates and versions

cea-04321497 , version 1 (04-12-2023)

Licence

Attribution

Identifiers

Cite

Téo Lemane, Nolan Lezzoche, Julien Lecubin, Eric Pelletier, Magali Lescot, et al.. kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets. 2023. ⟨cea-04321497⟩
55 View
0 Download

Altmetric

Share

Gmail Facebook X LinkedIn More