Learning Automata on Protein Sequences

François Coste; Goulven Kerbellec

Conference Papers Year : 2006

Learning Automata on Protein Sequences

(1) , (1)

François Coste

Function : Author
PersonId : 9592
IdHAL : francois-coste
ORCID : 0000-0001-9134-6557
IdRef : 133160203

Biological systems and models, bioinformatics and sequences

Goulven Kerbellec

Function : Author
PersonId : 830324

Biological systems and models, bioinformatics and sequences

Abstract

Pattern discovery is limited to position-specific characterizations like Prosite's patterns or profile-HMMs which are unable to handle, for instance, dependencies between amino acids distant in the sequence of a protein, but close in its three-dimensional structure. To overcome these limitations, we propose to learn automata on proteins. Inspired by grammatical inference and multiple alignment techniques, we introduce a sequence-driven approach based on the idea of merging ordered partial local multiple alignments (PLMA) under preservation or consistency constraints and on an identification of informative positions with respect to physico-chemical properties . The quality of the characterization is asserted experimentally on two difficult sets of proteins by a comparison with (semi)-manually designed patterns of Prosite and with state-of-the-art pattern discovery algorithms. Further leave-one-out experimentations show that learning more precise automata allows to gain in accuracy by increasing the classification margins.

Keywords

Grammatical Inference Automata Proteins Pattern Discovery

Domains

Machine Learning [cs.LG] Quantitative Methods [q-bio.QM]

Fichier principal

coste_kerbellec_jobim06.pdf (403.58 Ko)

Origin : Files produced by the author(s)

François Coste : Connect in order to contact the contributor

https://inria.hal.science/inria-00180429

Submitted on : Friday, October 19, 2007-10:25:08 AM

Last modification on : Friday, March 24, 2023-2:52:49 PM

Long-term archiving on: Sunday, April 11, 2010-11:18:24 PM

Dates and versions

inria-00180429 , version 1 (19-10-2007)

Identifiers

HAL Id : inria-00180429 , version 1

Cite

François Coste, Goulven Kerbellec. Learning Automata on Protein Sequences. JOBIM, Jul 2006, Bordeaux, France. pp.199--210. ⟨inria-00180429⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-D7 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

186 View

259 Download

Learning Automata on Protein Sequences

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share