Searching for Smallest Grammars on Large Sequences and Application to DNA

Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10\%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms.

Domaines

Bio-Informatique, Biologie Systémique [q-bio.QM] Bio-informatique [q-bio.QM] Apprentissage [cs.LG] Théorie de l'information [cs.IT] Théorie de l'information et codage [math.IT]

Fichier principal

preprint_jda.pdf (324.46 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

François Coste : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00536633

Soumis le : mardi 9 octobre 2012-14:34:48

Dernière modification le : samedi 27 avril 2024-03:11:20

Archivage à long terme le : mardi 13 décembre 2016-18:18:38

Dates et versions

inria-00536633 , version 1 (09-10-2012)

Identifiants

HAL Id : inria-00536633 , version 1
DOI : 10.1016/j.jda.2011.04.006

Citer

Rafael Carrascosa, François Coste, Matthias Gallé, Gabriel Infante-Lopez. Searching for Smallest Grammars on Large Sequences and Application to DNA. Journal of Discrete Algorithms, 2012, Special issue on Stringology, Bioinformatics and Algorithms, 11, pp.62-72. ⟨10.1016/j.jda.2011.04.006⟩. ⟨inria-00536633⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM ENPC EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA INSMI PARISTECH LIGM IRISA-D7 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM UNIV-EIFFEL JSE2024

519 Consultations

222 Téléchargements