High-dimensional compositional microbiota data: state-of-the-art of methods and software implementations

Abstract : Compositional data (CoDa) consist of a collection of nonnegative measurements that sum to a constant value, typically, proportions that sum to 1. Because knowing the sum, one component can be determined from the sum of the remainder, the parts that make up the composition are mathematically and statistically dependent. This distinct structure complicates analysis and does not allow standard statistical analyses. Aitchison (JRSS-B, 1982) and Egozcue and colleagues (Math. Geol., 2003), among others, provided a framework to analyze CoDa by mapping data from the constrained simplex space to the Euclidian space using nonlinear transforms such as the log-odds or the isometric log-ratio transforms. The increasing quality/reducing cost of high-throughput sequencing technology, in particular, 16S rRNA gene sequencing of the bacterial component of the human microbial community (microbiota), has enabled researchers to investigate human diseases. Subsequently, microbiota has been associated with numerous diseases, including inflammatory bowel disease, diabetes, cancer and cystic fibrosis. Because of the compositional structure and the high-dimensional data generated by microbiota sequencing, there is also a parallel development of specific statistical analysis methods and computational tools. Microbiota are usually measured as relative abundance of species and analyzed as CoDa. The objectives of this work are the following: - First, to review theory and usage of CoDa analysis in the microbiota setting, with particular emphasis on recent proposals adapted to high-dimensional problems (e.g. supervised –constrained Lasso, hierarchical Lasso, kernel methods, sPLS, or unsupervised – PCoA, PCA, Sparse inverse covariance estimation). - Second, to investigate the current state-of-the-art software implementations (basically, R packages: compositions, vegan, ALDex2, PERMANOVA, MiRKAT, MixMC . . . ) - Third, using toy examples and publicly available data (the 16S data from the Koren and colleagues’ study in March 2011’s PNAS, available in the MixMC R package), to implement and evaluate those methods with publicly available codes. Evaluation criteria are mainly based on computational and practical aspects.
Complete list of metadatas

Cited literature [12 references]  Display  Hide  Download

https://hal.inria.fr/hal-01667295
Contributor : Marta Avalos <>
Submitted on : Tuesday, December 19, 2017 - 11:52:32 AM
Last modification on : Tuesday, May 14, 2019 - 6:50:14 PM

File

SORET_GdR_Stat&Santé_2017.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01667295, version 1

Collections

Relations

Citation

Perrine Soret, Marta Avalos, Soon Cheng, Rodolphe Thiebaut. High-dimensional compositional microbiota data: state-of-the-art of methods and software implementations. 2017 - GDR « Statistiques et santé », Oct 2017, Bordeaux, France. ⟨hal-01667295⟩

Share

Metrics

Record views

805

Files downloads

147