DiNAMO: Exact method for degenerate IUPAC motifs discovery, characterization of sequence-specific errors

Abstract : Next generation sequencing technologies are still associated with relatively high error rates, about 1%, which correspond to thousands of errors in the scale of a complete genome. Each region needs therefore to be sequenced several times and variants are usually filtered based on depth criteria. The significant number of artifacts, in spite of those filters, shows the limit of conventional approaches and indicates that some sequencing artifacts are recurrent. This recurrence underlines that sequencing errors can depend on the upstream nucleotide sequence context. Our goal is to search for overrepresented motifs that tend to induce sequencing errors. Previous studies showed that some motifs, such as GGT [1,2], induce sequencing errors in the Illumina technologies. However, these studies were dedicated to exact motifs, and did not take into account approximate motifs, limiting the statistical power of such approaches. On the other hand, some tools, such as FIRE [3], DREME [4] and Discrover [5], were developed to search for degenerate motifs over the 15-letter IUPAC alphabet in the context of chip-seq studies. However, these tools use greedy algorithms, implying a lack of sensitivity. So we developed an exact algorithm to search for degenerate motifs by enumerating all possible IUPAC motifs. This algorithm is based on mutual information and uses hashtables with graphs data structure to store the motifs. It is independent from the sequencing technology. Experimental results on real data show that there are many overrepresented motifs upstream of sequencing artifacts. These latter are identified through the strand bias between forward and reverse reads. The homopoly-mer of length 3 CCC seems to be sufficient to induce errors on IonTorrent. On Illumina, motifs are mainly composed of GGC followed by GGT (like: TGGCNGGT) or homopolymers. We have also noticed a base quality fall after the detected motifs. Our exact algorithm requires less than one minute (Intel R Core TM i5-4570 CPU, 3.20GHz), and less than 2GB of RAM to search for full degenerate motifs of length 6 on a dataset of approximately 24000 sequences, extracted from 11 exomes sequenced on IonTorrent Proton.
Type de document :
Poster
JOBIM 2017 - Journées Ouvertes en Biologie, Informatique et Mathématiques, Jul 2017, Lille, France. 2017
Liste complète des métadonnées

Littérature citée [5 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01574630
Contributeur : Helene Touzet <>
Soumis le : mardi 10 octobre 2017 - 13:41:00
Dernière modification le : mardi 10 octobre 2017 - 14:44:04

Fichier

JOBIM2017_paper_80.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01574630, version 1

Citation

Chadi Saad, Laurent Noé, Hugues Richard, Julie Leclerc, Marie-Pierre Buisine, et al.. DiNAMO: Exact method for degenerate IUPAC motifs discovery, characterization of sequence-specific errors. JOBIM 2017 - Journées Ouvertes en Biologie, Informatique et Mathématiques, Jul 2017, Lille, France. 2017. 〈hal-01574630〉

Partager

Métriques

Consultations de
la notice

92

Téléchargements du document

8