HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Motif Statistics

Pierre Nicodème 1 Bruno Salvy 1 Philippe Flajolet 1
1 ALGO - Algorithms
Inria Paris-Rocquencourt
Abstract : We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers «motifs» widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata and formal language theory), in particular, the rationality of generating functions of regular languages; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra for determining generating functions explicitly, analysing generating functions and extracting coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces multivariate generating functions for the statistics under study. A fast computation of Taylor coefficients of the generating functions then yields exact values of the moments with typical application to random texts of size 30,000 while precise symptotic formulæ\ allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.
Document type :
Complete list of metadata

Contributor : Rapport de Recherche Inria Connect in order to contact the contributor
Submitted on : Wednesday, May 24, 2006 - 11:47:16 AM
Last modification on : Thursday, February 3, 2022 - 11:14:35 AM
Long-term archiving on: : Thursday, March 24, 2011 - 12:27:55 PM


  • HAL Id : inria-00073074, version 1



Pierre Nicodème, Bruno Salvy, Philippe Flajolet. Motif Statistics. [Research Report] RR-3606, INRIA. 1999. ⟨inria-00073074⟩



Record views


Files downloads