Learning Automata on Protein Sequences

François Coste 1 Goulven Kerbellec 1
1 SYMBIOSE - Biological systems and models, bioinformatics and sequences
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Pattern discovery is limited to position-specific characterizations like Prosite's patterns or profile-HMMs which are unable to handle, for instance, dependencies between amino acids distant in the sequence of a protein, but close in its three-dimensional structure. To overcome these limitations, we propose to learn automata on proteins. Inspired by grammatical inference and multiple alignment techniques, we introduce a sequence-driven approach based on the idea of merging ordered partial local multiple alignments (PLMA) under preservation or consistency constraints and on an identification of informative positions with respect to physico-chemical properties . The quality of the characterization is asserted experimentally on two difficult sets of proteins by a comparison with (semi)-manually designed patterns of Prosite and with state-of-the-art pattern discovery algorithms. Further leave-one-out experimentations show that learning more precise automata allows to gain in accuracy by increasing the classification margins.
Complete list of metadatas

Cited literature [35 references]  Display  Hide  Download

https://hal.inria.fr/inria-00180429
Contributor : François Coste <>
Submitted on : Friday, October 19, 2007 - 10:25:08 AM
Last modification on : Friday, November 16, 2018 - 1:24:20 AM
Long-term archiving on : Sunday, April 11, 2010 - 11:18:24 PM

Files

coste_kerbellec_jobim06.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00180429, version 1

Citation

François Coste, Goulven Kerbellec. Learning Automata on Protein Sequences. JOBIM, Jul 2006, Bordeaux, France. pp.199--210. ⟨inria-00180429⟩

Share

Metrics

Record views

389

Files downloads

276