Métagénomique comparative de novo à grande échelle

Gaëtan Benoit 1, 2
1 GenScale - Scalable, Optimized and Parallel Algorithms for Genomics
Inria Rennes – Bretagne Atlantique , IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE
Abstract : Metagenomics studies the genomic content of a sample extracted from a natural environment. Among available analyses, comparative metagenomics aims at estimating the similarity between two or more environmental samples at the genomic level. The traditional approach compares the samples based on their content in known identified species. However, this method is biased by the incompleteness of reference databases. By contrast, de novo comparative metagenomics does not rely on a priori knowledge. Sample similarity is estimated by counting the number of similar DNA sequences between datasets. A metagenomic project typically generates hundreds of datasets. Each dataset contains tens of millions of short DNA sequences ranging from 100 to 150 base pairs (called reads). In the context of this thesis, it would require years to compare such an amount of data with usual methods. This thesis presents novel de novo approaches to quickly compute the similarity between numerous datasets. The main idea underlying our work is to use the k-mer (word of size k) as a comparison unit of the metagenomes. The main method developed during this thesis, called Simka, computes several similarity measures by replacing species counts by k-mer counts (k > 21). Simka scales-up today’s metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Experiments on data from the Human Microbiome Project and Tara Oceans show that the similarities computed by Simka are well correlated with reference-based and OTU-based similarities. Simka processed these projects (more than 30 billions of reads distributed in hundreds of datasets) in few hours. It is currently the only tool able to scale-up such projects, while providing precise and extensive comparison results.
Document type :
Theses
Complete list of metadatas

Cited literature [163 references]  Display  Hide  Download

https://hal.inria.fr/tel-01659395
Contributor : Abes Star <>
Submitted on : Wednesday, February 28, 2018 - 11:45:18 AM
Last modification on : Friday, September 13, 2019 - 9:49:21 AM

File

BENOIT_Gaetan.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01659395, version 2

Citation

Gaëtan Benoit. Métagénomique comparative de novo à grande échelle. Bio-informatique [q-bio.QM]. Université Rennes 1, 2017. Français. ⟨NNT : 2017REN1S088⟩. ⟨tel-01659395v2⟩

Share

Metrics

Record views

498

Files downloads

760