Skip to Main content Skip to Navigation

Combining machine learning and evolution for the annotation of metagenomics data

Abstract : Metagenomics is used to study microbial communities by the analyze of DNA extracted directly from environmental samples. It allows to establish a catalog very extended of genes present in the microbial communities. This catalog must be compared against the genes already referenced in the databases in order to find similar sequences and thus determine their function. In the course of this thesis, we have developed MetaCLADE, a new methodology that improves the detection of protein domains already referenced for metagenomic and metatranscriptomic sequences. For the development of MetaCLADE, we modified an annotation system of protein domains that has been developed within the Laboratory of Computational and Quantitative Biology clade called (closer sequences for Annotations Directed by Evolution) [17]. In general, the methods for the annotation of protein domains characterize protein domains with probabilistic models. These probabilistic models, called sequence consensus models (SCMs) are built from the alignment of homolog sequences belonging to different phylogenetic clades and they represent the consensus at each position of the alignment. However, when the sequences that form the homolog set are very divergent, the signals of the SCMs become too weak to be identified and therefore the annotation fails. In order to solve this problem of annotation of very divergent domains, we used an approach based on the observation that many of the functional and structural constraints in a protein are not broadly conserved among all species, but they can be found locally in the clades. The approach is therefore to expand the catalog of probabilistic models by creating new models that focus on the specific characteristics of each clade. MetaCLADE, a tool designed with the objective of annotate with precision sequences coming from metagenomics and metatranscriptomics studies uses this library in order to find matches between the models and a database of metagenomic or metatranscriptomic sequences. Then, it uses a pre-computed step for the filtering of the sequences which determine the probability that a prediction is a true hit. This pre-calculated step is a learning process that takes into account the fragmentation of metagenomic sequences to classify them. We have shown that the approach multi source in combination with a strategy of meta-learning taking into account the fragmentation outperforms current methods.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Monday, December 18, 2017 - 1:11:09 AM
Last modification on : Thursday, April 28, 2022 - 3:59:13 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01666046, version 1


Ari Ugarte. Combining machine learning and evolution for the annotation of metagenomics data. Biotechnology. Université Pierre et Marie Curie - Paris VI, 2016. English. ⟨NNT : 2016PA066732⟩. ⟨tel-01666046⟩



Record views


Files downloads