Motif Statistics

Pierre Nicodème 1 Bruno Salvy 1 Philippe Flajolet 1
1 ALGO - Algorithms
Inria Paris-Rocquencourt
Abstract : We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers «motifs» widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata and formal language theory), in particular, the rationality of generating functions of regular languages; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra for determining generating functions explicitly, analysing generating functions and extracting coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces multivariate generating functions for the statistics under study. A fast computation of Taylor coefficients of the generating functions then yields exact values of the moments with typical application to random texts of size 30,000 while precise symptotic formulæ\ allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.
Type de document :
[Research Report] RR-3606, INRIA. 1999
Liste complète des métadonnées
Contributeur : Rapport de Recherche Inria <>
Soumis le : mercredi 24 mai 2006 - 11:47:16
Dernière modification le : samedi 17 septembre 2016 - 01:35:38
Document(s) archivé(s) le : jeudi 24 mars 2011 - 12:27:55



  • HAL Id : inria-00073074, version 1



Pierre Nicodème, Bruno Salvy, Philippe Flajolet. Motif Statistics. [Research Report] RR-3606, INRIA. 1999. 〈inria-00073074〉



Consultations de la notice


Téléchargements de fichiers