Motif Statistics

Pierre Nicodème 1 Bruno Salvy 1 Philippe Flajolet 1
1 ALGO - Algorithms
Inria Paris-Rocquencourt
Abstract : We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers «motifs» widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata and formal language theory), in particular, the rationality of generating functions of regular languages; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra for determining generating functions explicitly, analysing generating functions and extracting coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces multivariate generating functions for the statistics under study. A fast computation of Taylor coefficients of the generating functions then yields exact values of the moments with typical application to random texts of size 30,000 while precise symptotic formulæ\ allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.
Document type :
Reports
Complete list of metadatas

https://hal.inria.fr/inria-00073074
Contributor : Rapport de Recherche Inria <>
Submitted on : Wednesday, May 24, 2006 - 11:47:16 AM
Last modification on : Friday, May 25, 2018 - 12:02:02 PM
Long-term archiving on : Thursday, March 24, 2011 - 12:27:55 PM

Identifiers

  • HAL Id : inria-00073074, version 1

Collections

Citation

Pierre Nicodème, Bruno Salvy, Philippe Flajolet. Motif Statistics. [Research Report] RR-3606, INRIA. 1999. ⟨inria-00073074⟩

Share

Metrics

Record views

154

Files downloads

283