Searching for Smallest Grammars on Large Sequences and Application to DNA

Rafael Carrascosa 1 François Coste 2, 3 Matthias Gallé 2, 3, * Gabriel Infante-Lopez 1
* Auteur correspondant
2 SYMBIOSE - Biological systems and models, bioinformatics and sequences
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
3 Dyliss - Dynamics, Logics and Inference for biological Systems and Sequences
IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE, Inria Rennes – Bretagne Atlantique
Abstract : Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10\%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms.
Liste complète des métadonnées

Littérature citée [26 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00536633
Contributeur : François Coste <>
Soumis le : mardi 9 octobre 2012 - 14:34:48
Dernière modification le : mercredi 4 octobre 2017 - 16:08:21
Document(s) archivé(s) le : mardi 13 décembre 2016 - 18:18:38

Fichier

preprint_jda.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Rafael Carrascosa, François Coste, Matthias Gallé, Gabriel Infante-Lopez. Searching for Smallest Grammars on Large Sequences and Application to DNA. Journal of Discrete Algorithms, Elsevier, 2012, Special issue on Stringology, Bioinformatics and Algorithms, 11, pp.62-72. 〈http://www.sciencedirect.com/science/article/pii/S1570866711000517〉. 〈10.1016/j.jda.2011.04.006〉. 〈inria-00536633〉

Partager

Métriques

Consultations de
la notice

410

Téléchargements du document

134