PhylteR: efficient identification of outlier sequences in phylogenomic datasets - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2023

PhylteR: efficient identification of outlier sequences in phylogenomic datasets

Résumé

In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, are known to decrease the signal-to-noise ratio and complicate species tree inference. The amount of data handled today in classical phylogenomic analyses precludes manual error detection and removal. However, a simple and efficient way to automate the identification of outlier sequences is still missing. Here, we present PhylteR, a method that allows a rapid and accurate detection of outlier sequences in phylogenomic datasets, i.e. species from individual gene trees that do not follow the general trend. PhylteR relies on DISTATIS, an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once. In PhylteR, distance matrices obtained either directly from multiple sequence alignments or extracted from individual gene phylogenies represent evolutionary distances between species according to each gene. On simulated datasets, we show that PhylteR identifies outliers with more sensitivity and precision than a comparable existing method. On a biological dataset of 14,463 genes for 53 species previously assembled for Carnivora phylogenomics, we show (i) that PhylteR identifies as outliers sequences that can be considered as such by other means, and (ii) that the removal of these sequences improves the concordance between the gene trees and the species tree. Thanks to the generation of numerous graphical outputs, PhylteR also allows for the rapid and easy visual characterisation of the dataset at hand, thus aiding in the precise identification of errors. PhylteR is distributed as an R package on CRAN and as containerized versions (docker and singularity).
Fichier principal
Vignette du fichier
2023.02.02.526888v1.full.pdf (2.47 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Licence : CC BY NC ND - Paternité - Pas d'utilisation commerciale - Pas de modification

Dates et versions

hal-03995366 , version 1 (12-06-2023)
hal-03995366 , version 2 (06-11-2023)
hal-03995366 , version 3 (20-12-2023)

Licence

Paternité - Pas d'utilisation commerciale - Pas de modification

Identifiants

Citer

Aurore Comte, Théo Tricou, Eric Tannier, Julien Joseph, Aurélie Siberchicot, et al.. PhylteR: efficient identification of outlier sequences in phylogenomic datasets. 2023. ⟨hal-03995366v1⟩

Collections

INSERM
249 Consultations
60 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More