Skip to Main content Skip to Navigation
Book sections

Accurate alignment of (meta)barcoding data sets using MACSE

Abstract : Twenty years of standardized DNA barcoding practice have resulted in millions of sequences being produced for a handful of molecular markers in a wide range of fungi, animal and plant species. Despite some basic quality controls, reference barcoding data sets deposited in the Bar-code of Life Datasystem (BOLD) database are not immune to sequencing errors and undetected pseudogenes. Such database inaccuracies can significantly bias subsequent species delimitation and biodiversity estimation based on DNA barcoding. These potential problems are amplified in metabarcoding studies containing thousands of sequences produced using high throughput se-quencing technologies. Here, we propose a pipeline based on MACSE v2, an extended version of our codon-aware multiple sequence alignment software accounting for frameshifts and stop codons. The MACSE_BARCODE pipeline allows the accurate alignment of hundreds of thousands of protein-coding barcode sequences. Re-analyses of published data sets confirm that MACSE v2 is able to automatically detect most sequencing errors previously identified manually. The proposed alignment strategy hence alleviates the risk of incorrect species delimitation due the incorporation of sequencing errors or undetected pseudogenes. By applying the MACSE_BARCODE pipeline to mammal, ant, and flowering plant barcode sequences available in BOLD, we highlight several cases of database errors and provide curated reference alignments for the main protein-coding barcode genes. We anticipate our approach to be particularly useful for metabarcoding studies in which thousands of new sequences need to be compared to a reference database for subsequent taxonomic assignment. This might prove particularly helpful for diet characterization studies and large-scale biodiversity assessments through environmental DNA metabarcoding. The new MACSE_BARCODE pipeline is distributed as Nextflow workflows that are available from the MACSE project webpage (https://bioweb.supagro.inra.fr/macse/).
Complete list of metadatas

Cited literature [60 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02541199
Contributor : Frederic Delsuc <>
Submitted on : Monday, April 13, 2020 - 11:35:42 AM
Last modification on : Sunday, November 15, 2020 - 10:44:01 AM

File

Delsuc&Ranwez-PhyloBook-2020.p...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02541199, version 1

Citation

Frédéric Delsuc, Vincent Ranwez. Accurate alignment of (meta)barcoding data sets using MACSE. Scornavacca, Celine; Delsuc, Frédéric; Galtier, Nicolas. Phylogenetics in the Genomic Era, No commercial publisher | Authors open access book, pp.2.3:1--2.3:31, 2020. ⟨hal-02541199⟩

Share

Metrics

Record views

561

Files downloads

447