Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds

Laurent Noé 1
1 BONSAI - Bioinformatics and Sequence Analysis
Université de Lille, Sciences et Technologies, Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189, CNRS - Centre National de la Recherche Scientifique
Abstract : Background : Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. Results : We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.
Type de document :
Article dans une revue
Algorithms for Molecular Biology, BioMed Central, 2017, 12 (1), 〈10.1186/s13015-017-0092-1〉
Liste complète des métadonnées

https://hal.inria.fr/hal-01467970
Contributeur : Laurent Noé <>
Soumis le : mardi 14 février 2017 - 22:56:50
Dernière modification le : jeudi 11 janvier 2018 - 06:27:32

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Citation

Laurent Noé. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms for Molecular Biology, BioMed Central, 2017, 12 (1), 〈10.1186/s13015-017-0092-1〉. 〈hal-01467970〉

Partager

Métriques

Consultations de la notice

102