Learning Automata on Protein Sequences

François Coste; Goulven Kerbellec

Communication Dans Un Congrès Année : 2006

Learning Automata on Protein Sequences

(1) , (1)

François Coste

Fonction : Auteur
PersonId : 9592
IdHAL : francois-coste
ORCID : 0000-0001-9134-6557
IdRef : 133160203

Biological systems and models, bioinformatics and sequences

Goulven Kerbellec

Fonction : Auteur
PersonId : 830324

Biological systems and models, bioinformatics and sequences

Résumé

Pattern discovery is limited to position-specific characterizations like Prosite's patterns or profile-HMMs which are unable to handle, for instance, dependencies between amino acids distant in the sequence of a protein, but close in its three-dimensional structure. To overcome these limitations, we propose to learn automata on proteins. Inspired by grammatical inference and multiple alignment techniques, we introduce a sequence-driven approach based on the idea of merging ordered partial local multiple alignments (PLMA) under preservation or consistency constraints and on an identification of informative positions with respect to physico-chemical properties . The quality of the characterization is asserted experimentally on two difficult sets of proteins by a comparison with (semi)-manually designed patterns of Prosite and with state-of-the-art pattern discovery algorithms. Further leave-one-out experimentations show that learning more precise automata allows to gain in accuracy by increasing the classification margins.

Mots clés

Grammatical Inference Automata Proteins Pattern Discovery

Domaines

Apprentissage [cs.LG] Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

coste_kerbellec_jobim06.pdf (403.58 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

François Coste : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00180429

Soumis le : vendredi 19 octobre 2007-10:25:08

Dernière modification le : vendredi 24 mars 2023-14:52:49

Archivage à long terme le : dimanche 11 avril 2010-23:18:24

Dates et versions

inria-00180429 , version 1 (19-10-2007)

Identifiants

HAL Id : inria-00180429 , version 1

Citer

François Coste, Goulven Kerbellec. Learning Automata on Protein Sequences. JOBIM, Jul 2006, Bordeaux, France. pp.199--210. ⟨inria-00180429⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-D7 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

184 Consultations

259 Téléchargements

Learning Automata on Protein Sequences

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager