Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules

1748-7188-2-13 1748-7188 Research Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules Boeva Valentina valeyo@yandex.ru Clément Julien Julien.Clement@info.unicaen.fr Régnier Mireille Mireille.Regnier@inria.fr Roytberg A Mikhail mroytberg@impb.psn.ru Makeev J Vsevolod makeev@genetika.ru

Institute of Genetics and Selection of Industrial Microorganisms, GosNIIGenetika, 117545 Moscow, Russia

MIGEC, INRIA Rocquencourt, 78153 Le Chesnay, France

GREYC, CNRS UMR 6072, Laboratoire d'informatique, 14032 Caen, France

Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Puschino, Moscow Region, Russia

Puschino State University, Puschino, Moscow Region, Russia

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia

Algorithms for Molecular Biology 1748-7188 2007 2 1 13 http://www.almob.org/content/2/1/13 17927813 10.1186/1748-7188-2-13 13 7 2007 10 10 2007 10 10 2007 2007 Boeva et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap.

Results

We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k₁, ..., k_sor more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.

Method

The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|σ|^K) ∏_ik_i) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, |ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is the total number of words in motifs, K is the order of Markov model, and k_iis the number of occurrences of the ith motif.

Conclusion

The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.

Availability

Project web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/

Background

During the past few years, a number of computational tools have been designed 123 for locating potential transcription factor binding sites (TFBSs) in nucleotide sequences, e.g., in compilations of sequences upstream of putative co-regulated genes. In parallel, experimental approaches were developed 4, which allowed identification of binding motifs for many different transcription factors. Experimental 5 and bioinformatical 6 studies demonstrated that sequences of regulatory DNA that bind transcription factors can exhibit many different types of architecture. In eukaryotes TFBSs found in DNA sequences often form rather dense clusters: this was demonstrated both by experimental 57 and computational 89 methods. Such clusters can contain sites binding the same factor or several different factors 10. The cis-regulatory module (CRM) in this case contains respectively homotypic or heterotypic clusters of motifs specifically recognized by binding proteins 11.

The particular arrangement of motifs in a homotypic or heterotypic cluster is not random, and it is commonly accepted, that the motif arrangement within a CRM is important for its functionality 121314151617181920. Bioinformatics studies indicate that antagonistic factors often bind to overlapping sites 21 whereas synergetic factors are often positioned within a fixed distance 20, often close to the multiple of 10.2 bp, the DNA double-helix pitch value 21.

Non-random arrangements of TFBSs within regulatory segments of DNA sequences are exploited in several TFBS identification tools, and it was observed that cooperativity-based discrimination of TFBSs surpasses the performance of models for individual TFBSs 22.

On observing a cluster of TFBSs in some genome segment one can calculate the probability of observing similar site arrangements in a random sequence. This idea of evaluating the statistical significance of heterotypic clusters of sites was implemented in many programs including ClusterDraw 23, ModuleSearcher 24, MCAST 25, eCIS-ANALYST 26, Cister 27, Cluster-Buster 28 and TargetExplorer 29. At the moment, such programs use empirical procedures like motif counting in biological and simulated sequences to assess the significance of observed site clustering. But it is highly desirable to have a good statistical measure of site clustering, and we believe that the best measure is the p-value of obtaining the observed cluster by chance in a random sequence of a Markov or Bernoulli (common name for Markov chain of order 0) type. In the case of heterotypic clusters one needs to take into account possible overlapping occurrences of different motifs, a problem that was considered difficult until now 30. In the case of homotypic clusters, an approximate statistical scoring function was constructed 831; this approach has been implemented in algorithms like FLYENHANCER 32, SCORE 33, and CLUSTER 34. However, this approximation performs poorly for highly overlapping TFBSs. One cannot ignore site overlapping if the motifs are fuzzy (highly degenerate), which is often the case for so-called "shadow sites" 31. In the case of heterotypic clusters, competing factors can bind even to very well determined motifs that overlap.

Representation of protein binding motifs in nucleotide sequences

Experimental methods on protein binding to DNA usually locate some DNA segment, or word in DNA text, as a probable binding target. Proteins can bind to similar DNA words 4, the whole assembly of which can be called a motif. The simplest motif representation is the enumeration of sequences that can be bound by a transcription factor (TF) 35. Sometimes, information about binding sites can be found in SELEX 3637 or Protein Binding Microarray (PBM) experiments 38. However, it is possible that such experiments do not give the exhaustive list of sequences of binding sites, so one needs to expand the list of putative binding sites using an appropriate criterion, which brings about the problem of the generalization of several known examples.

For instance, several words aligned with mismatches, can be generalized to IUPAC string (like RSTGACTNMNW for AP-1 binding sites 39) by disregarding correlated substitutions in different motif positions 40. Another example of generalization is the set of words that can deviate from a consensus word for less than a given number of mismatches.

The most popular way to represent binding sites is a Position Weight Matrix (PWM), which is also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM) 41. For a text with length D over an alphabet Σ with |Σ| symbols, a PWM is a |Σ| × D matrix: each row corresponding to a symbol of the alphabet Σ, and each column to a position in the motif. For DNA texts, one has Σ = {A, C, G, T}. The PWM score is defined as ∑i=1Lmω(i),i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabd2gaTnaaBaaaleaaiiGacqWFjpWDcqGGOaakcqWGPbqAcqGGPaqkcqGGSaalcqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdYeambqdcqGHris5aaaa@3BC0@, where i represents a position in the D-substring, ω(i) the symbol at position i in the substring, and m_{α, i}the score in row α, column i of the matrix. So, given a cutoff value, one gets a list of D-sequences that score higher than this cutoff; thus representing possible DNA binding sites for the protein.

Any of the three motif representations above can be converted to a list of words. The same is true for many other representations of motifs. In this study, we consider only the motifs that can be represented as a set of words.

P-value for clusters of motif occurrences, problem formulation

The objective of this work is to develop a statistical criterion to assess clustering of TFBS. Intuitively, a TFBS cluster is a DNA segment simultaneously containing "too many" TFBSs for given factor proteins; such a segment can often operate as a CRM regulated by these TFs. From a formal point of view, the problem we address here is as follows. Let s sets of words ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@ be given. Typically, each set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_iis associated to a TF motif. Given a s-tuple of integers (k₁, ..., k_s), we compute the corresponding p-value, that is the probability to find at least k_ioccurrences of words from each set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_iin a random text of size n. We assume that the texts where motifs are searched are randomly generated by a Bernoulli process or a Markov model of order K. If (k₁, ..., k_s) occurrences of motifs ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@ are found in a DNA segment, the p-value can be used to infer if such numbers of occurrences could be found by chance.

Related work

Most previous works address counting problems for one set of several words ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. In contrast, in this paper we deal with a separate counting for several sets of several words ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@, each set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_jrepresents one TFBS motif.

All methods of solving the problem of p-value calculations for multiple occurrences of words from a set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ study some basic languages. Let L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k) be the set of texts of length n containing at least k occurrences of ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. The desired p-value would therefore be the probability P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k)). Let ℛℋk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFBeIudaqhaaWcbaGae83cHGeabaGaem4AaSgaaaaa@3A01@ be the set of texts of all lengths that contain exactly k words of ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, the last one occurring as a suffix 42. For any H_jin ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, let ℛHjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFBeIudaqhaaWcbaGaeeisaG0aaSbaaWqaaiabdQgaQbqabaaaleaacqWGRbWAaaaaaa@3BB4@ be the subset of ℛℋk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFBeIudaqhaaWcbaGae83cHGeabaGaem4AaSgaaaaa@3A01@ where H_jis a suffix. One observes that a text contains at least k occurrences if and only if it admits a prefix in ℛℋk=∪Hj∈ℋℛHjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFBeIudaqhaaWcbaGae83cHGeabaGaem4AaSgaaOGaeyypa0JaeSOkIu1aaSbaaSqaaiabbIeainaaBaaameaacqWGQbGAaeqaaSGaeyicI4Sae83cHGeabeaakiab=TrisnaaDaaaleaacqqGibasdaWgaaadbaGaemOAaOgabeaaaSqaaiabdUgaRbaaaaa@46ED@. One defines rjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaaaa@3102@ (p) as the probability that a text of size p be in set ℛHjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFBeIudaqhaaWcbaGaeeisaG0aaSbaaWqaaiabdQgaQbqabaaaleaacqWGRbWAaaaaaa@3BB4@. If no word in ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ is a subword of another word in ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, the probability P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k)) to find at least k occurrences of words from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ in a random text of length n satisfies

P ( L n ( ℋ ; k ) ) = ∑ p ≤ n ∑ H j ∈ ℋ r j k ( p ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFqbaucqGGOaakcqWGmbatdaWgaaWcbaGaemOBa4gabeaakiabcIcaOmrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+TqiijabcUda7iabdUgaRjabcMcaPiabcMcaPiabg2da9maaqafabaWaaabuaeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaOGaeiikaGIaemiCaaNaeiykaKcaleaacqqGibasdaWgaaadbaGaemOAaOgabeaaliabgIGiolab+Tqiibqab0GaeyyeIuoaaSqaaiabdchaWjabgsMiJkabd6gaUbqab0GaeyyeIuoaaaa@577F@

Therefore, one tries to compute the sequence of (rjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaaaa@3102@ (p)) values.

Linear induction

In the first class of methods 43444546, one computes, implicitly or explicitly, probabilities P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k)) up to a given text length n. Such methods are intrinsically linear in n. In 43444546 one relies on a recurrence relation on rjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaaaa@3102@ (n) that extends the one originally given in 47. Typically, one step will cost O (|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|m), where ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ is a set of words of length m and |ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is its cardinality. Time complexity is O (n|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|m) and, relying on a combinatorial property, 44 achieves optimal space complexity O (|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| log |ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|m). However the authors of 44 do not consider several motifs occurrences and restrict themselves to the Bernoulli model. The authors of 43 consider the Markov model, still using one motif for TFBS.

Algebraic Formulae

In a second class of methods 474849505152, a preprocessing computes generating functions

r j k ( z ) = ∑ n r j k ( n ) z n . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaOGaeiikaGIaemOEaONaeiykaKIaeyypa0ZaaabuaeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaOGaeiikaGIaemOBa4MaeiykaKIaemOEaO3aaWbaaSqabeaacqWGUbGBaaaabaGaemOBa4gabeqdcqGHris5aOGaeiOla4caaa@4432@

In a second step, probabilities P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k)) are either extracted from the generating function or approximated.

In 4953, rjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaaaa@3102@ (z) are the solutions of a system of equations. To derive these equations, the authors build an automaton that recognizes these languages ℛHjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFBeIudaqhaaWcbaGaeeisaG0aaSbaaWqaaiabdQgaQbqabaaaleaacqWGRbWAaaaaaa@3BB4@ (one can prove that they are regular).

A language approach 50 or an induction 48 leads to a formal expression that depends on the words overlaps. The main drawback is that these methods need to compute the determinant of a matrix of polynomials with a huge dimension, e.g. O (|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|). This O (|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|²) symbolic computation may be more expensive than the extraction step or the linear computation above, that involve arithmetic operations on real numbers.

When the preprocessing step is achievable, the extraction step is amenable to the solution of a linear recurrence of degree m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|; therefore, its complexity is O (m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|n) and a classical optimization yields O (m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| log n). There exists some good implementations that are numerically stable. One may cite the REGEXPCOUNT 54 or EXCEP 55 programs that rely on Fast Fourier Transform.

Finally, approximations are available, the computation of which is constant with respect to n, but not to ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. One approach is the compound Poisson approximation 56, but this approximation is not precise enough 57. Asymptotic results can also be derived from the algebraic formulae above 4458, not needing an explicit expression for rjk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCdaqhaaWcbaGaemOAaOgabaGaem4AaSgaaaaa@3102@ (z), and therefore avoiding the expensive determinant computation. Time complexity, typically, is the one for computing all possible overlaps, that is approximately O (|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|²). This yields extremely precise results when the expectation of the number of occurrences, nP (H) is very small 59 or close to 1 51 (the case studied the most often). Case nP (H) ~2 is achieved in 60. Nevertheless, extension to larger values of k or multioccurrences and multisets is still open.

Methods

Here we consider in detail the approach we suggest.

A motif assigned to a TF is a finite set of words ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = (H₁, ..., H_r) where each word represents one putative TF binding site in DNA. Note that words in motif can generally be of different lengths. However, no word from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ can contain another word from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ as a substring. We consider, as an occurrence of motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ in text T, any occurrence of any word ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_j∈ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ in T. Below all texts and words in motifs are sequences on a given alphabet Σ.

To be more precise, there is a probability distribution defined on the set Σⁿof all texts of length n in the alphabet Σ; the most widely used models are random Bernoulli trials and a Markov model of order K. Denote as L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s) the set of all texts of length n containing at least k_ipossibly overlapping occurrences of each motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_i; i = 1, ..., s. Then the desired p-value is the probability P (L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s)) of the set L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s) with respect to the given probability distribution on Σⁿ.

Our approach to the calculation of this p-value is similar to that published in 61, which was used there to calculate seed sensitivity in local alignment search. The approach exploits the fact that the algorithm of Aho and Corasick 62 can be modified to efficiently determine whether a given text belongs to the set L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s) or not. Ideas published in 61 and 62 can be adopted to compute the probability P (L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s)) that the random text T_n∈ Σⁿbelongs to the set L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s).

We start from the simplest case of one motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ for which we calculate the probability P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; 1)) that text T_ncontains at least one occurrence of the motif with respect to a Bernoulli probability distribution. More complicated cases (arbitrary number of occurrences; arbitrary number of motifs; Markov distribution) will be discussed in the following sections.

Construction of Aho-Corasick traversal

Aho and Corasick 62 have proposed the algorithm determining if a given text T contains an occurrence of a word from a given set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. The basic data structure is a prefix tree which is a variant of the classical trie T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@42 that may be built on the set of words ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. Let Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ denote the set of prefixes of these words. In the following, we identify a word q ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ with node Node (q) at the end of the branch labeled by q. In particular, the root is identified with the empty string ε. The length of a prefix is the depth of Node (q).

The classic Aho-Corasick algorithm is a tree traversal determined by a transition function δ:Qℋ×Σ→Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF0oazcqGG6aGocqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae43cHGeabeaakiabgEna0kabfo6atjabgkziUkabdgfarnaaBaaaleaacqGFlecsaeqaaaaa@4341@ defined as follows. For any pair (p, a) in Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ × Σ, δ (p, a) is the largest suffix of concatenation pa that belongs to Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@. Remark that δ (p, a) = pa iff pa ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@.

Given a text T read from left to right, let T [i] denote the letter of T at position i. Let q_ibe the largest suffix in text T[1] ⋯ T [i] that belongs to Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@. The sequence of nodes visited during the traversal are defined by words q_ithat satisfy the inductive relationship

∀i ≥ 0, q_i+1= δ (q_i, T [i + 1]),

with the initial condition q₀= ε.

Example: Let ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ be the set {AAA, AAC, ACA, ACA, CCT}. The corresponding tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ is depicted in Figure 1. Values of δ function are given in Table 1. Aho-Corasick traversal of tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ according to text T = 'ATGCCAACCTT' produces the following sequence of nodes {q_i}_{i ≥ 1}in Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ (the numbers of corresponding nodes in Figure 1 are shown in square brackets): A[1], ε[0], ε[0], C[2], CC[5], A[1], AA[3], AAC[7], ACC[9], CCT[10], ε[0].

Table 1

Values of δ function for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {aaa, aac, aca, acc, cct}.

q\α

Values of δ (q, α) function for q ∈ Q and α = A, C, G, T constructed for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {AAA, AAC, ACA, ACC, CCT}.

Figure 1

Tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {aaa, aac, aca, acc, cct} with dashed links for δ function. Tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {AAA, AAC, ACA, ACC, CCT}. Dashed colored links represent δ function for internal node (5) – in red, and for marked node (7) corresponding to the word AAC ∈ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ – in purple.

T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ and transition function δ can be efficiently constructed with an algorithm proposed by Aho and Corasick 62. Both time and space of the algorithm is proportional to the sum of lengths of all words from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@.

The combination of tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ and transition function δ allows solving numerous pattern matching problems: search of the first occurrence of a word from a given set, search of all occurrences, word counting, etc.

Bernoulli text model. Probability to find at least one occurrence of a single motif

In this section we consider the simplest case. One computes the p-value for a single motif in a text T_nof length n, assuming that T_nis generated by independent Bernoulli random trials over alphabet Σ. The algorithm computes probabilities P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; 1)) by induction on n.

To describe the algorithm we divide the set Σⁱof all texts T_iof length i into classes that do and do not contain occurrences of ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@.

Definition 1 A text T_ibelongs to class C_i(0; q) iff

1. Length of T_iis i,

2. T_idoes not contain words from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@,

3. A traversal AC (T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@, T_i) ends at node q.

A text T_ibelongs to class G_i(1) iff

(i) Length of T_iis i,

(ii) T_idoes contain at least one occurrence of a word from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@.

For a given number i larger than m, the union for classes C_i(0; q), where q is in Qℋ\ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaakiabcYfaCjab=Tqiibaa@3AFC@ and the class G_i(1) form a partition of the set Σⁱof all texts of length i, i.e., any texts of length i belongs either to a class C_i(0; q) for some q in Qℋ\ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaakiabcYfaCjab=Tqiibaa@3AFC@, or to a class G_i(1). Indeed, condition 3. means that the largest suffix of T_iin Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ is q. It follows from condition 2. that classes C_i(q; 0) are empty if q is in ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. A text T_iof length i is in G_i(1) if and only if a node of ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ was visited during the traversal.

Let P (C_n(0; q)) and P (G_n(1)) denote probabilities that a text T_nbelongs to class C_n(0; q) and G_n(1), respectively. Then, L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; 1) = G_n(1); therefore the desired p-value P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; 1)) is equal to P (G_n(1)).

The algorithm calculates probabilities P (C_i(0; q)) and P (G_i(1)) using induction on length i. For i = 0, these probabilities obviously comply with: P (C₀(0; ε)) = 1; P (C₀(0; q)) = 0, for any q ≠ ε; P (G₀(1)) = 0.

The values of P (C_i+1(0; q)) and P (G_i+1(1)) are calculated using values of P (C_i(0; q)) and P (G_i(1)). Therefore, the needed space is proportional to the size of Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ (see section Extensions and complexity below).

Calculation of values P (C_i+1(0; q)) and P (G_i+1(1)) is based on the following observations. Let U be a set of texts of the same length over the alphabet Σ, P (U) the probability of U in the Bernoulli model and a a character in Σ. Let U·a be the set of all possible concatenations, i.e., U·a = {xa|x ∈ U}. And in the case of the Bernoulli model

P (U·a) = P (U) P (a).

Then the following relations hold for any i ∈ {1, ..., n - 1} and Σ:

(i) if the text T_icontains a word from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ then all its concatenations with characters from Σ would contain a word from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; i.e.,

G_i(1)·a ⊂ G_i+1(1).

(ii) if the text T_idoes not contain a word from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ and belongs to C_i+1(0; q), i.e., ends with q ∈ Qℋ\ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaakiabcYfaCjab=Tqiibaa@3AFC@, then its concatenation T_i·a belongs to the class determined by the result of the Aho-Corasick transition function δ (q, a); i.e.,

if δ (q, a) ∈ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, then C_i(0; q)·a ⊂ C_i+1(0; δ (q, a))

otherwise C_i(0; q) ⊂ G_i+1(1).

Remembering that classes C_i(0; q) for different q and G_i(1) form a partition of Σⁱ, we obtain the following relation for the texts containing words from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@:

G i + 1 ( 1 ) = { ∪ a ∈ Σ G i ( 1 ) ⋅ a } ∪ { ∪ ( q , a ) ; δ ( q , a ) ∈ ℋ C i ( 0 ; q ) ⋅ a } . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGhbWrdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcIcaOiabigdaXiabcMcaPiabg2da9iabcUha7naatafabaGaem4raC0aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqaIXaqmcqGGPaqkcqGHflY1cqWGHbqycqGG9bqFcqGHQicYcqGG7bWEdaWeqbqaaiabdoeadnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeGimaaJaei4oaSJaemyCaeNaeiykaKIaeyyXICTaemyyaeMaeiyFa0haleaacqGGOaakcqWGXbqCcqGGSaalcqWGHbqycqGGPaqkcqGG7aWoiiGacqWF0oazcqGGOaakcqWGXbqCcqGGSaalcqWGHbqycqGGPaqkcqGHiiIZt0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqGFlecsaeqaniablQIivbGccqGGUaGlaSqaaiabdggaHjabgIGiolabfo6atbqab0GaeSOkIufaaaa@72A7@

Similarly, classes of texts that do not contain words from ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ satisfy

∀ q ′ ∈ Q ℋ \ ℋ : C i + 1 ( 0 ; q ′ ) = ∪ ( q , a ) ; δ ( q , a ) = q ′ C i ( 0 ; q ) ⋅ a . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaeyiaIiIafmyCaeNbauaacqGHiiIZcqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaakiabcYfaCjab=TqiijabcQda6aqaaiabdoeadnaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeiikaGIaeGimaaJaei4oaSJafmyCaeNbauaacqGGPaqkcqGH9aqpdaWeqbqaaiabdoeadnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeGimaaJaei4oaSJaemyCaeNaeiykaKIaeyyXICTaemyyaegaleaacqGGOaakcqWGXbqCcqGGSaalcqWGHbqycqGGPaqkcqGG7aWoiiGacqGF0oazcqGGOaakcqWGXbqCcqGGSaalcqWGHbqycqGGPaqkcqGH9aqpcuWGXbqCgaqbaaqab0GaeSOkIufakiabc6caUaaaaaa@67F3@

Classes C_i(0; q) for different q in Qℋ\ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaakiabcYfaCjab=Tqiibaa@3AFC@ and G_i(1) form a partition of Σⁱ; classes C_i(0; q) are empty if q is in ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. Relations (5) and (6) with the help of (1) yield the recursive expressions for probabilities P (C_i+i(0; q)) and P (G_i+1(1)) in the Bernoulli case:

P ( G i + 1 ( 1 ) ) = P ( G i ( 1 ) ) + ∑ ( q , a ) : δ ( q , a ) ∈ ℋ P ( C i ( 0 ; q ) ) ⋅ p ( a ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFqbaucqGGOaakcqWGhbWrdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcIcaOiabigdaXiabcMcaPiabcMcaPiabg2da9iab=bfaqjabcIcaOiabdEeahnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeGymaeJaeiykaKIaeiykaKIaey4kaSYaaabuaeaacqWFqbaucqGGOaakcqWGdbWqdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabicdaWiabcUda7iabdghaXjabcMcaPiabcMcaPiabgwSixlabdchaWjabcIcaOiabdggaHjabcMcaPaWcbaGaeiikaGIaemyCaeNaeiilaWIaemyyaeMaeiykaKIaeiOoaOdcciGae4hTdqMaeiikaGIaemyCaeNaeiilaWIaemyyaeMaeiykaKIaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae03cHGeabeqdcqGHris5aOGaeiilaWcaaa@6E5E@

P ( C i + 1 ( 0 ; q ′ ) ) = ∑ ( q , a ) : δ ( q , a ) = q ′ P ( C i ( 0 ; q ) ) ⋅ p ( a ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFqbaucqGGOaakcqWGdbWqdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcIcaOiabicdaWiabcUda7iqbdghaXzaafaGaeiykaKIaeiykaKIaeyypa0ZaaabuaeaacqWFqbaucqGGOaakcqWGdbWqdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabicdaWiabcUda7iabdghaXjabcMcaPiabcMcaPiabgwSixlabdchaWjabcIcaOiabdggaHjabcMcaPaWcbaGaeiikaGIaemyCaeNaeiilaWIaemyyaeMaeiykaKIaeiOoaOdcciGae4hTdqMaeiikaGIaemyCaeNaeiilaWIaemyyaeMaeiykaKIaeyypa0JafmyCaeNbauaaaeqaniabggHiLdGccqGGUaGlaaa@5E0F@

The run-time for each step of the computation of C_i+1(0; q) and G_i+1(1) is O (|Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@|·|Σ|); therefore the total time of all n stages of p-value computation is O (|Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@|·|Σ|·n).

The approach described in this section can be readily extended to the case of multiple occurrences of motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@. The detailed procedure can be found in Additional file 1.

Additional file 1

Bernoulli text model. Probability to find multiple occurrences of a single motif. The detailed description of the algorithm for the p-value calculation in the case of multiple occurrences of a single motif.

Click here for file

Bernoulli text model. Probability to find multiple occurrences of multiple motifs

DNA transcription is usually regulated with several factors simultaneously interacting with DNA and specifically recognizing different DNA sites. Individual regulatory segment of DNA can contain many binding sites for several factors, often substantially overlapping with each other 5. This brings about a problem of studying of co-occurring motifs.

Let (ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@) be s different motifs. Our objective is to calculate the probability that motifs (ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@) have respectively at least (k₁, ..., k_s) possibly overlapping occurrences in the random text T_nof the length n. This p-value is the probability P (L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s)) to obtain text T_nbelonging to the set of texts L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s). In this section, we will suppose that the probability of each text is given by Bernoulli model. The Markov case will be considered in the next subsection. The recursion for multiple occurrences of multiple motifs obtained here is rather tricky. Therefore we suggest the reader to see Additional file 1 where we describe the recursion for the simpler case of multiple occurrences of a single motif

Let us consider the union ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ of individual motifs ℋ=ℋ1∪⋯∪ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecscqGH9aqpcqWFlecsdaWgaaWcbaGaeGymaedabeaakiabgQIiilabl+UimjabgQIiilab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@4249@. It contains all words that belong to any of motifs ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_i. The tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ is constructed for the overall set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, its nodes Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ contain all possible prefixes of all motifs from (ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@). A node of the tree q ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ can belong to some motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_kor simultaneously to several different motifs from {ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_j}_1≤j≤s. Let each node q ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ be marked with numbers j of motifs ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_jto which it belongs. Nodes, corresponding to proper prefixes of ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, remain unmarked. The transition function δ:Qℋ×Σ→Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF0oazcqGG6aGocqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae43cHGeabeaakiabgEna0kabfo6atjabgkziUkabdgfarnaaBaaaleaacqGFlecsaeqaaaaa@4341@ is defined as it was defined in the case of a single motif for the unified motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@.

All texts T_nof length n are classified into classes depending on occurrences of different ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_j. In this case it is difficult to introduce the target class G, since when the target number of occurrences k_iis attained for some motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_i, the corresponding value k_jmay not yet be attained for another motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_j. Therefore we need to introduce the occurrence index of a set of motifs.

Definition 2 Let the target number of occurrences of motif ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_ibe k_i. Then, the occurrence index Λ(k1,...,ks) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqHBoatdaWgaaWcbaGaeiikaGIaem4AaS2aaSbaaWqaaiabigdaXaqabaWccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGRbWAdaWgaaadbaGaem4CamhabeaaliabcMcaPaqabaaaaa@39F8@ (l₁, ..., l_s) of a set of motifs (ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@) in the text T_ncontaining l_ipossibly overlapping occurrences of each ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_iis an s-vector the ith component of which can be calculated as follows:

[ Λ ( k 1 , ... , k s ) ( l 1 , ... , l s ) ] i = λ i = { l i i f l i ≤ k i , k i i f l i > k i . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGGBbWwcqqHBoatdaWgaaWcbaGaeiikaGIaem4AaS2aaSbaaWqaaiabigdaXaqabaWccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGRbWAdaWgaaadbaGaem4CamhabeaaliabcMcaPaqabaGccqGGOaakcqWGSbaBdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdYgaSnaaBaaaleaacqWGZbWCaeqaaOGaeiykaKIaeiyxa01aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpiiGacqWF7oaBdaWgaaWcbaGaemyAaKgabeaakiabg2da9maaceqabaqbaeaabiWaaaqaaiabdYgaSnaaBaaaleaacqWGPbqAaeqaaaGcbaGaemyAaKMaemOzaygabaGaemiBaW2aaSbaaSqaaiabdMgaPbqabaGccqGHKjYOcqWGRbWAdaWgaaWcbaGaemyAaKgabeaakiabcYcaSaqaaiabdUgaRnaaBaaaleaacqWGPbqAaeqaaaGcbaGaemyAaKMaemOzaygabaGaemiBaW2aaSbaaSqaaiabdMgaPbqabaGccqGH+aGpcqWGRbWAdaWgaaWcbaGaemyAaKgabeaakiabc6caUaaaaiaawUhaaaaa@6BCA@

Definition 3 A text T_ibelongs to class C_i(λ₁, ..., λ_s; q), 0 ≤ λ_i≤ k_iiff

1. Length of T_iequals i,

2. The occurrence index of motifs (ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@) in text T_iis equal to (λ₁, ..., λ_s),

A text T_ibelongs to class G_i(k₁, ..., k_s) if it belongs to the union of classes

G i ( k 1 , ... , k s ) = ∪ q ∈ ℋ C i ( k 1 , ... , k s ; q ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGhbWrdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdUgaRnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4AaS2aaSbaaSqaaiabdohaZbqabaGccqGGPaqkcqGH9aqpdaWeqbqaaiabdoeadnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaem4AaS2aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGRbWAdaWgaaWcbaGaem4CamhabeaakiabcUda7iabdghaXjabcMcaPaWcbaGaemyCaeNaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeqdcqWIQisvaOGaeiOla4caaa@5CF8@

The desired p-value P (L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s)) is equal to P (G_n(k₁, ..., k_s)). The value is calculated iteratively. Again, we have a sum over all possible tree nodes q and symbols a. Now, q', the image of the transition function δ (q, a) can belong simultaneously to several motifs {ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_j}_1≤j≤s. Thus, the resulting probability P (C_i+1(λ₁, ..., λ_s; q')) that text T_i+1belongs to class C_i+1(λ₁, ..., λ_s; q') calculates as

P ( C i + 1 ( λ 1 , ... , λ s ; q ′ ) ) = ∑ ( q , a ) : δ ( q , a ) = q ′ ∑ ( r 1 , ... , r s ) ∈ J P ( C i ( r 1 , ... , r s ; q ) ) ⋅ p ( a ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFqbaucqGGOaakcqWGdbWqdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcIcaOGGaciab+T7aSnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIae43UdW2aaSbaaSqaaiabdohaZbqabaGccqGG7aWocuWGXbqCgaqbaiabcMcaPiabcMcaPiabg2da9maaqafabaWaaabuaeaacqWFqbaucqGGOaakcqWGdbWqdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdkhaYnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOCai3aaSbaaSqaaiabdohaZbqabaGccqGG7aWocqWGXbqCcqGGPaqkcqGGPaqkcqGHflY1cqWGWbaCcqGGOaakcqWGHbqycqGGPaqkaSqaaiabcIcaOiabdkhaYnaaBaaameaacqaIXaqmaeqaaSGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOCai3aaSbaaWqaaiabdohaZbqabaWccqGGPaqkcqGHiiIZcqWFkbGsaeqaniabggHiLdaaleaacqGGOaakcqWGXbqCcqGGSaalcqWGHbqycqGGPaqkcqGG6aGocqGF0oazcqGGOaakcqWGXbqCcqGGSaalcqWGHbqycqGGPaqkcqGH9aqpcuWGXbqCgaqbaaqab0GaeyyeIuoaaaa@8070@

where the summation in the second sum is performed over all allowed s-tuples of indexes (r₁, ..., r_s) which together make the set of s-tuples J. A s-tuple of indexes (r₁, ..., r_s) belongs to J if it complies with the following conditions:

1. if q' ∉ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_jthen r_j= λ_j,

2. if q' ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_jand λ_j<k_jthen r_j= λ_j- 1,

3. if q' ∈ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_jand λ_j= k_jthen r_j= k_jor r_j= k_j- 1.

Implementation details

Our basic data structure is the prefix tree; we use its standard representation 42 [see also Additional files 2 and 3 for Tree construction from PWM motif representation]. Each tree node q ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ is supplied with several additional variables.

Additional file 2

Tree construction from PWM motif representation. The brief description of the procedure of the prefix tree construction from PWM motif representation.

Click here for file

Additional file 3

Tree construction from PWM motif representation. Steps of the prefix tree construction for a PWM and a given cut-off.

Click here for file

At stage (i + 1) of probability computation the values P (C_i+1(λ₁, ..., λ_s; q)) become computed from the values P (C_i(λ₁, ..., λ_s; q)) obtained at the previous stage of induction. Therefore, at stage (i + 1), one no longer needs the values calculated at stage (i - 1). Thus, each node is supplied with two k₁× ⋯ × k_s-arrays of real values C₀and C₁for storing P (C_i(λ₁, ..., λ_s; q)) and P (C_i+1(λ₁, ..., λ_s; q)) for different λ_j. C₀is used to store probabilities for even text lengths while C₁for odd.

In implementation the calculation of values P (C_i+1(λ₁, ..., λ_s; q')) from P (C_i(λ₁, ..., λ_s; q)) for all q', q ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ and (λ₁, ..., λ_s): 0 ≤ λ_j≤ k_j, 1 ≤ j ≤ s, is performed in the parallel way. Initially we set all the values P (C_i+1(λ₁, ..., λ_s; q')) to 0. Then we look over all tuples (r₁, ..., r_s; q), where q ∈ Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@ and (r₁, ..., r_s): 0 ≤ r_j≤ k_j, 1 ≤ j ≤ s. For each tuple (r₁, ..., r_s; q) and all letters a ∈ Σ we find the prefix q' = δ (q, a) and the value P (C_i(r₁, ..., r_s; q))·p(a). Then we add P (C_i(r₁, ..., r_s; q))·p(a) to the value P (C_i+1(λ₁, ..., λ_s; q')) where (λ₁, ..., λ_s; q') meet the conditions inverse to those of formula (11):

2. if q' ∈ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@_jand r_j<k_jthen λ_j= r_j+ 1,

At the stage i = n the desired p-value is the sum

P ( G n ( k 1 , ... , k s ) ) = ∑ q ∈ ℋ P ( C n ( k 1 , ... , k s ; q ) ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFqbaucqGGOaakcqWGhbWrdaWgaaWcbaGaemOBa4gabeaakiabcIcaOiabdUgaRnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4AaS2aaSbaaSqaaiabdohaZbqabaGccqGGPaqkcqGGPaqkcqGH9aqpdaaeqbqaaiab=bfaqjabcIcaOiabdoeadnaaBaaaleaacqWGUbGBaeqaaOGaeiikaGIaem4AaS2aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGRbWAdaWgaaWcbaGaem4CamhabeaakiabcUda7iabdghaXjabcMcaPiabcMcaPiabc6caUaWcbaGaemyCaeNaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae43cHGeabeqdcqGHris5aaaa@6328@

Markov text model

Tree approach and the recursion (11) can be readily extended to calculate p-values of motif occurrences in random texts generated by the Markov model of order K. Given the order K of the Markov model, the probability p(a) in (11) depends on K previous letters. Thus, if the length |q| of the prefix q is less than K, one cannot calculate p(a) knowing only the prefix q. To overcome this we divide each class C_i(r₁, ..., r_s; q), where |q| = d <min (K, i) into subclasses C_i(r₁, ..., r_s; q, w); each subclass corresponds to a word w of length min (K, i) - d. Then, a text T_iof length i belongs to class C_i(r₁, ..., r_s; q, w) if the suffix of T_iof length min (K, i) equals to w·q.

Figure 2 gives an example for Markov model of order K = 1. The tree is constructed for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {AAA, AAC, ACA, ACC, CCT}. The text T = ATGCCAACCTT produces the following sequence of nodes {q_i}_i≥1(the numbers of the corresponding nodes in Figure 2 are shown in square brackets): A[4], (ε, T)[3], (ε, G)[2], C[5], CC[8], A[4], AA[6], AAC[10], ACC[12], CCT[13], (ε, T)[3].

Figure 2

Tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {aaa, aac, aca, acc, cct} with dashed links for δ function under Markov(1) model. Tree T(ℋ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecscqGGPaqkaaa@3AF1@ for the set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ = {AAA, AAC, ACA, ACC, CCT} under Markov model of order 1. Dashed colored links represent δ function for internal node (8) – in red, and for marked node (10) corresponding to the word AAC ∈ ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ – in purple.

The recursive equations for probabilities P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; 1)), P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k)), and P (L_n(ℋ1,...,ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@3F88@; k₁, ..., k_s)) can be obtained from the corresponding formulae (7-8), (11–13) and (16) by substituting probabilities p(a) with p(a|t[1] ⋯ t [K]), where

t [ 1 ] ⋯ t [ K ] = { w ⋅ q if 0 ≤ d < K , K -suffix of q otherwise . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG0baDcqGGBbWwcqaIXaqmcqGGDbqxcqWIVlctcqWG0baDcqGGBbWwcqWGlbWscqGGDbqxcqGH9aqpdaGabeqaauaabaqaciaaaeaacqWG3bWDcqGHflY1cqWGXbqCaeaacqqGPbqAcqqGMbGzcqqGGaaicqaIWaamcqGHKjYOcqWGKbazcqGH8aapcqWGlbWscqGGSaalaeaacqWGlbWscqqGTaqlcqqGZbWCcqqG1bqDcqqGMbGzcqqGMbGzcqqGPbqAcqqG4baEcqqGGaaicqqGVbWBcqqGMbGzcqqGGaaicqWGXbqCaeaacqqGVbWBcqqG0baDcqqGObaAcqqGLbqzcqqGYbGCcqqG3bWDcqqGPbqAcqqGZbWCcqqGLbqzcqGGUaGlaaaacaGL7baaaaa@67AD@

The Markov extension is currently implemented for K = 1.

Complexity

To resume, the computation of P (L_n(ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@; k)) for one set ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@ requires a computation of (P(Ci(l,q)))0≤l<k,q∈Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqadaqaaGqabiab=bfaqjabcIcaOiabdoeadnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaemiBaWMaeiilaWIaemyCaeNaeiykaKIaeiykaKcacaGLOaGaayzkaaWaaSbaaSqaaiabicdaWiabgsMiJkabdYgaSjabgYda8iabdUgaRjabcYcaSiabdghaXjabgIGiolabdgfarnaaBaaameaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqGFlecsaeqaaaWcbeaaaaa@4F8E@ for i ≤ n. For each iteration, the time complexity is O (k|Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@| |Σ|), where |Σ| is the size of the alphabet. One traverses the tree n times. As |Qℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHGeabeaaaaa@38B9@| is upper bounded by (m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|), where m is the maximal length of word in ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@, this yields the overall O (nkm|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@||Σ|) time complexity and a O (km|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|) space complexity.

When several sets are involved, the number of nodes in the tree T(ℋ1∪⋯∪ℋs) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFtepvcqGGOaakcqWFlecsdaWgaaWcbaGaeGymaedabeaakiabgQIiilabl+UimjabgQIiilab=TqiinaaBaaaleaacqWGZbWCaeqaaOGaeiykaKcaaa@43E3@ becomes O (m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|) with m equal to the maximal length of word in ℋ=ℋ1∪⋯∪ℋs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecscqGH9aqpcqWFlecsdaWgaaWcbaGaeGymaedabeaakiabgQIiilabl+UimjabgQIiilab=TqiinaaBaaaleaacqWGZbWCaeqaaaaa@4249@. Additional memory in each node is ∏_ik_i. Therefore, the time complexity is O (nm|Σ|∏_ik_i|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|) and the space complexity is O (m ∏_ik_i|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@|). In the Markov model of order K, one memorizes |Σ|^{K - d}predecessors for each node at depth d, 0 = d <K. In other words, the number of classes becomes (m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|Σ|^K). Therefore, the space memory is O ((m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K |Σ|^K) ∏_ik_i) and the running time is O (n|Σ|(m|ℋ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K |Σ|^K)∏_ik_i). This additive increment compares favorably to simple induction methods 4553 that introduce a multiplicative O (K|Σ|^K) factor in time and space complexity for the Markov(K) model.

Results and discussion

We developed an algorithm for precise calculation of the p-value for multiple occurrences of multiple motifs with possible overlaps. The running time is linear in the text length and depends on the alphabet size, the maximal motif length, the number of words in the motifs, and the number of occurrences of each motif. The algorithm was implemented in the AHOPRO software. Below we give examples of how p-values can be used for studying gene regulation in silico, particularly for selecting optimal cutoff values for motifs represented by PWMs. In the subsection 'Comparison with simulation and approximation methods' we compare our p-value computations with the result of Monte Carlo simulations and the Poisson approximation. Our results confirm the accuracy of our algorithm and show in what cases the Poisson approximation 811 cannot be employed. In the subsection 'Optimal cutoffs', we apply AHOPRO to choose an appropriate cutoff score for Position Weights Matrices. In the subsection 'Assessment of gene regulation', we show how AHOPRO can be used for studying regulatory regions containing heterotypic clusters of TFBSs to distinguish genes that are regulated by given transcription factors from those that are not.

As a model example, we use in this section data published in 34 on regulatory clusters in D. melanogaster. This compilation includes information on

(i) known binding motifs for transcription factors,

(ii) known CRM regions, and

(iii) known regulatory interactions.

Comparison with simulation and approximation methods

In our first example we use the even-skipped stripe 2 enhancer (eve2) 63 of length 728 bp that is known to contain binding sites for TFs bicoid, kruppel and hunchback. Below we compare p-values calculated by the AHOPRO program and those calculated using compound Poisson approximation with p-values computed through Monte Carlo simulations.

AhoPro and Monte Carlo comparisons

Table 2 displays results of comparison of p-values calculated with AHOPRO and with Monte Carlo simulation assuming the Bernoulli model M0. The corresponding results for the first order Markov model M1 are displayed in Table 3. Letters probabilities for M0 and the transition matrix for M1 were evaluated from eve2 sequence. We used the PWM cutoff values taken from 34, i.e., 5.3, 5.0, and 6.2 for bicoid, kruppel, and hunchback respectively. With these threshold values in sequence eve2 we have found 3, 4, and 2 occurrences of motifs of each type respectively. In Tables 2 and 3 we listed the p-values, i.e, the probabilities to find no less than the observed number of occurrences of motifs in a random text of length L, where L is the length of eve2 enhancer. The number of Monte Carlo simulations was set to 10⁶everywhere, except for the triplet (bcd&kr&hb), where we did 10⁷simulations. The probability to find the observed number of occurrences of (bcd&kr&hb) simultaneously in the same simulated sequence is extremely low; thus we increased the number of simulations so that the product of the probability by the number of simulations be greater than 1.

Table 2

Comparison of p-values calculated by the AHOPRO program, by Monte Carlo simulations and by compound Poisson distribution formula under the M0 model

MOTIF, CUTOFF

OCC.

AHOPRO

MONTE CARLO

POISSON

AHOPRO/MC

AHOPRO/POISSON

bcd, 5.3

0.012

0.010

1.00

1.10

kr, 5.0

0.0044

0.0033

1.01

1.34

hb, 6.2

0.013

0.012

0.99

1.04

bcd & kr

3&4

0.00025

0.00026

3.6E-05

0.99

7.10

bcd & kr & hb

3&4&2

6.54E-06

5.8E-06

4.34E-07

1.13

7.13

Comparison of p-values calculated for the Markov(0) model by the AHOPRO program with p-values calculated by Monte Carlo simulations and by Poisson formula for motifs of D. melanogaster developmental transcription factors bicoid, kruppel and hunchback.

Table 3

Comparison of p-values calculated by the AHOPRO program, by Monte Carlo simulation and by compound Poisson distribution formula under the M1 model

MOTIF, CUTOFF

OCC.

AHOPRO

MONTE CARLO

POISSON

AHOPRO/MC

AHOPRO/POISSON

bcd, 5.3

0.013

0.014

0.012

0.998

1.11

kr, 5.0

0.011

0.008

1.01

1.43

hb, 6.2

0.14

0.11

0.9987

1.25

bcd & kr

3&4

0.00051

9.62E-05

0.9991

5.34

bcd & kr & hb

3&4&2

6.9E-05

6.97E-05

1.08E-05

0.9889

6.36

Comparison of p-values calculated by the AHOPRO program for the Markov(1) model with those calculated by Monte Carlo simulations and by Poisson formula for motifs of D. melanogaster developmental transcription factors bicoid, kruppel, and hunchback.

The results of comparison of the AHOPRO computation with those obtained from simulated random sequences presented in Tables 2 and 3 confirm the accuracy of our algorithm.

Poisson approximation

In practical application, compound Poisson distribution 64 is widely used to assess p-values of multiple motif occurrences 283465. Here we apply it to compute the probability to observe the given number of motif occurrences when the probabilities of individual words are calculated adopting the M0 or M1 models described above. The results of the comparison given in corresponding columns in Tables 2 and 3 show that the p-value calculated using Poisson approximation can be significantly underestimated. This happens most probably because the Poisson approximation does not take into account possible overlaps between motif occurrences and considers motif occurrences as independent. The error increases when the p-value is calculated for simultaneous occurrences of several factors, as it is done in the last two rows. In this case, the Poisson approximation p-value for a combination of several TFs is calculated as a product of p-values calculated independently for each TF. Actually, the motif occurrences can overlap especially when the motifs resemble each other, thus there is no independence, which brings about the error.

Optimal cutoffs

Below, we use AHOPRO to determine the optimal cutoff values for PWMs of regulatory factors, given the sequences of regulatory region assumedly interacting with the factors. The distribution of occurrences of TF binding sites in corresponding experimentally confirmed regulatory regions is strongly biased 34. In CRMs binding sites often tend to occur in clusters, which is not the case for random sequences.

Different cutoff values correspond to different numbers of putative binding sites of different quality. The higher the cutoff value, the closer the motif occurrences are to the consensus and the smaller the number of motif occurrences. Therefore, for a given factor it is reasonable to select a cutoff value that minimizes the probability of finding in the random sequence the number of motif occurrences observed in the sequence of the regulatory region.

As an example, we considered again transcription factors bicoid, kruppel, which are known to regulate the even-skipped stripe 2 (eve2) enhancer. To select the optimal cutoff value we used the following procedure: first, in the sequence of eve2 we counted occurrences of motifs with a score greater than the cutoff with cutoff values varied from 3 to 8.5. Therefore, each pair of cutoff values (S₁, S₂) corresponded to (k₁, k₂) occurrences for motifs of bicoid and kruppel respectively. For each pair (k₁, k₂), we computed p-value P_n(k₁(S₁), k₂(S₂)), which is denoted below as P (S₁, S₂). That is the probability to obtain at least k₁occurrences of bicoid, with scores greater than S₁, and at least k₂occurrences of kruppel, with scores greater than S₂. In Figure 3, a 3D-surface is shown, where (x, y, z) corresponds to (S₁, S₂, - log₁₀P (S₁, S₂)), the cutoff value for bicoid motif, the cutoff value for kruppel motif and -logarithm of the corresponding p-value calculated for the M1 model respectively. The view to the surface from the above is shown in Figure 3C. The maximal value for – log₁₀P (S₁, S₂), 6.3044, is attained when the bicoid cutoff is equal to S₁= 5.1 and the kruppel cutoff is equal to S₂= 5.6. With such cutoff values in the sequence of the eve2 enhancer there are k₁= 6 and k₂= 4 occurrences of bicoid and kruppel motifs defined by corresponding PWMs. We believe that the sites that are found with this optimal p-value are the best candidates for functional TF binding sites.

Figure 3

P-value distribution for eve2 and random sequences

P-value distribution for eve2 and random sequences. Distribution of log₁₀(Pvalue) calculated for the M1 model as a function of cutoff values for PWMs for BICOID and KRUPPEL in the even-skipped stripe 2 enhancer (A), in a random sequence (B). View from above: eve2 sequence (C), random sequence (D).

For comparison, we simulated random sequences with the same length as the eve2 enhancer and the same dinucleotide probabilities. In most of simulated sequences, for the cutoff values for bicoid and kruppel equal to (S₁, S₂) = (5.1, 5.6) we found no more than one occurrence of each motif. The average number of occurrences is 0.54 for bicoid and 0.31 for kruppel. The average p-value is 0.633. We took one of the random sequences and compared p-values calculated for various cutoff values in this random sequence (Figures 3B, 3D) and in the real biological sequence of the eve2 enhancer (Figures 3A, 3C). One can see that there are two major differences between p-value distributions in really regulated sequences and in the random sequence. First, p-values in the random sequence are much greater than those in the enhancer sequence. In particular, maximal – log(pvalue) for this random sequence is about 1.02 which is 6.17 times smaller than maximal – log(pvalue) for the enhancer sequence (see also Table 4). Second, the shapes of p-value distributions are different. For the enhancer sequence, there are only few distinct peaks (4.3, 5.6),(4.3, 6.8), (5.1, 5.6), (5.1, 6.8) whereas for the random sequence we see ridges between (2.2, 2.0) and (2.2, 4.8), and (2.8, 2.0) and (2.8, 4.8). As we expected, it is impossible to choose the appropriate cutoff for PWMs of factors from the random sequence data (Figures 3B and 3D).

Table 4

Comparison of p-values and cutoff for different sets of DNA sequences

regulatory regions bicoid regulated

minimal pvalue

Cut-off

regulatory regions not regulated by bicoid

minimal pvalue

Cut-off

random seq.

minimal pvalue

Cut-off

Btd crm

3.24E-05

3.4

Gt p. enh.

0.023

2.7

seq. 1

0.16

2.6

Hb P2

4.13E-05

3.7

Hb upstream enh.

0.053

4.4

seq. 2

0.12

1.7

Kni cis element

0.01

5.3

Eve stripe 4+6 enh.

0.41

3.6

seq. 3

0.25

1.2

Kr CD-1 enh.

0.0001

5.1

Eve stripe 3+7 enh.

0.58

2.5

seq. 4

0.065

1.6

Otd early enh.

0.024

Ftz upstream enh.

0.037

5.8

seq. 5

0.11

Sal blastoder. enh.

8.62E-04

6.5

Ftz

0.28

3.3

seq. 6

0.0087

3.8

Tll PD enh.

0.26

4.2

Ubx PBX enh.

0.196

6.7

seq. 7

0.024

2.9

Tll AD+PD enh.

0.025

8.1

Ubx BXD enh.

0.698

4.6

seq. 8

0.17

3.4

Eve stripe 2 enh.

4.04E-05

5.1

Ubx BX enh. (BRE)

0.05

7.5

seq. 9

0.092

2.8

Eve stripe 1 enh.

8.09E-06

5.2

Ems upstream enh.

0.276

4.4

seq. 10

0.052

3.6

Eve stripe 5 enh.

0.27

3.8

En stripe enh. (intr. 1)

0.049

seq. 11

0.13

1.7

Median

8.62E-04

5.1

Median

0.196

4.4

Median

0.1128

2.6

Comparison of minimal p-values and best found cutoffs for bicoid PWM calculated (i) in regulatory regions which are regulated by bicoid, (ii) in regulatory regions which are not regulated by bicoid, and (iii) in random sequences of the same length and with the same dinucleotide distribution as in the even-skipped stripe 2 enhancer.

We also would like to address the choice between the M0 and M1 models. We observed, that in almost all cases the p-value calculated for the M0 model is smaller than the p-value calculated for the M1 model. This can probably be explained by the fact that using the M1 model we take into account more information about the real sequence than in the M0 model. Nevertheless, the difference is not crucial; for instance, the greatest value of the ratio between p-values calculated adopting the M0 and M1 for bicoid and kruppel is about 3.62 for the eve2 enhancer. So, the M0 model can be equally used in practical applications.

Assessment of gene regulation

Enhancers may contain clusters of TF binding sites for gene regulators. In such cases, p-value computation can be used to distinguish genes that are regulated by a given transcription factor from those that are not. To illustrate this, we took PWM for TF bicoid and calculated p-values for different cutoff values in various sets of sequences:

- regulatory regions which are regulated by bicoid, the positive set;

- regulatory regions which are not regulated by bicoid, the negative set;

- random sequences of the same length as eve2 enhancer and with the same dinucleotide distribution, the random set.

Minimal p-value and the corresponding cutoff value for 11 sequences in each set are presented in Table 4. Comparing the p-values we observed that p-values calculated for the positive set generally were significantly smaller than those, calculated for the negative and for the random sets.

The median for the p-value in the positive set is equal to 8.62E-04. But there are some exceptions, for instance, the tailless PD enhancer with a minimal p-value that is equal to 0.26 and the even-skipped stripe 5 enhancer with the minimal p-value that is equal to 0.27. Despite the fact that these genes are reported to be regulated by bicoid and that there are experimentally confirmed individual bicoid binding sites in these sequences, these sequences do not contain clusters of bicoid binding sites.

Most p-values calculated for the negative set, (second set in Table 4), are significantly higher than p-values calculated for the positive set. But we observed rather small p-values for sequences of the giant posterior enhancer (0.023), the hunchback upstream enhancer (0.053), the fushi tarazu upstream enhancer (0.037), the ultrabithorax BX enhancer (0.05), and the engrailed stripe enhancer (0.049). We believe that this can be explained by the fact that these regions contain clusters of binding sites of regulatory factors with motifs that are similar to the bicoid motif. Indeed, it was experimentally shown that TF kruppel regulates the giant posterior enhancer, TF tailless regulates the hunchback upstream enhancer and the ultrabithorax BX enhancer, and TF fushi tarazu regulates the fushi tarazu upstream enhancer, the ultrabithorax BX enhancer and the engrailed stripe enhancer. All these motifs of kruppel, tailless and fushi tarazu exhibit some similarity to the bicoid motif. This observation shows the necessity to use some sort of conditional p-values in order to distinguish between the true bicoid clusters and the clusters of weak bicoid sites induced by presence of the clusters of other TF sites 67. Moreover, the apparent false positive hit (p-value = 0.05, cutoff = 7.5) in a region that was not reported to be regulated by bicoid seems to be related to the real bicoid binding, although not necessarily functional.

For the random set, i.e., sequences simulated with the same dinucleotide probabilities as in the even-skipped stripe 2 enhancer, we observe a rather broad range of minimal p-values, from 0.0087 for the 6th sample to 0.25 for the 3rd sample. It shows that the predictive power of this approach is limited to the case of regulatory sequences containing clusters of motifs.

Conclusion

In this work we have developed an algorithm inspired by the Aho-Corasick pattern matching algorithm that allows precise calculation of the probability to find given motif conformation in a random text. It was implemented in the AHOPRO software for the Bernoulli model and the Markov model of order 1 of random sequences. There would be no difficulty in extending our approach for Markov models of order k, k > 1. We compared probabilities computed with AHOPRO with those computed by compound Poisson distribution and showed that in the case of multiple occurrences of multiple motifs the Poisson approximation often substantially underestimate the p-value.

As we have demonstrated, the statistical significance of multiple motif occurrence in the text can be efficiently calculated with a simple algorithm. This can give an independent criteria to improve the results of site extraction algorithms, which still performs rather poorly. P-values or E-values are used in such programs as BLAST and make quantities to which practicing biologists are used to. Thus, adopting this measure to motif extraction (for a single or multiple motif occurrences) would greatly help the users who use motif extraction analysis as a preliminary stage for experiments in the lab. On the other hand, our algorithm is not connected with a particular motif extraction program, and uses a most general motif representation, the list of the allowed words 35, as input. Thus, it can be used when the results of several motif extraction algorithms are compared, for instance in the interpretation of ChIP-chip experiments 5. In addition, our algorithm AHOPRO can easily be extended to amino acid sequences and applied in identification of protein domain signatures.

Authors' contributions

VM initiated the study by pointing at the biological problem. JC suggested the initial idea of using Aho-Corasick structure. The final version of the algorithm was developed in discussions between JC, VB, MR and MAR. JC and VB developed the implementation. VB obtained results on simulated and biological sequences. VB designed the web site. MR, MAR, VB and VM participated in manuscript writing. MR and VM coordinated the study. All authors read and approved the final manuscript.

Acknowledgements

Thanks to Andrey Mironov, Stephen Small, Dmitri Papatsenko, Bruno Salvy and Philippe Flajolet for helpful comments and suggestions. Thanks to Alexander Favorov for help with the programming. Thanks to Tim Barker for correcting the English in the manuscript. This research was partially supported by INTAS #04-83-3994 and #05-1000008-8028, French Program EcoNet-12635WG, the RFBR grants 07-04-01584 and 06-04-49249, and by Russian Federation Agency in Science and Innovation State Contract 02.531.11.9003.

Practical strategies for discovering regulatory DNA sequence motifs MacIsaac KD Fraenkel E PloS Comput Biol 2006 2 4 e36 1447654 16683017 10.1371/journal.pcbi.0020036 A survey of motif discovery methods in an integrated framework Sandve GK Drablos F Biol Direct 2006 1 11 1479319 16600018 10.1186/1745-6150-1-11 Computational approaches to identify promoters and cis-regulatory elements in plant genomes Rombauts S Florquin K Lescot M Marchal K Rouze P van de Peer Y Plant Physiol 2003 132 3 1162 1176 Review. 167057 12857799 10.1104/pp.102.017715 DNA microarray technologies for measuring protein-DNA interactions Bulyk ML Curr Opin Biotechnol 2006 17 4 422 30 10.1016/j.copbio.2006.06.015 16839757 Transcriptional regulatory code of a eukaryotic genome Harbison CT Gordon B Lee TI Rinaldi NJ Macisaac KD Danford T Hannett NM Tagne JB Reynolds DB Yoo J Jennings EG Zeitlinger J Pokholok DK Kellis M Rolfe PA Takusagawa KT Lander ES Gifford DK Fraenkel E Young RA Nature 2004 431 99 104 10.1038/nature02800 15343339 Discovering functional transcription-factor combinations in the human cell cycle Zhu Z Shendure J Church GM Genome Res 2005 15 6 848 55 1142475 15930495 10.1101/gr.3394405 A self-organizing system of repressor gradients establishes segmental complexity in Drosophila Clyde DE Corado MS Wu X Pare A Papatsenko D Small S Nature 2003 426 6968 849 53 10.1038/nature02189 14685241 Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes Wagner A Bioinformatics 1999 15 10 776 784 10.1093/bioinformatics/15.10.776 10705431 Homotypic regulatory clusters in Drosophila Lifanov AP Makeev VJ Nazina AG Papatsenko DA Genome Res 2003 13 4 579 88 430164 12670999 10.1101/gr.668403 New computational approaches for analysis of cis-regulatory networks Brown CT Rust AG Clarke PJ Pan Z Schilstra MJ De Buysscher T Griffin G Wold BJ Cameron RA Davidson EH Bolouri H Dev Biol 2002 246 86 102 10.1006/dbio.2002.0619 12027436 A computational genomics approach to the identification of gene networks Wagner A Nucleic Acids Res 1997 25 18 3594 3604 146952 9278479 10.1093/nar/25.18.3594 Control of tailless expression by bicoid, dorsal and synergistically interacting terminal system regulatory elements Liaw GJ Lengyel JA Mech Dev 1993 40 1–2 47 61 10.1016/0925-4773(93)90087-E 8443106 Cooperative interactions between paired domain and homeodomain Jun S Desplan C Development 1996 122 9 2639 50 8787739 [Constructive synergism of regulatory genes expressed in the course of the eye and muscle development and regeneration] Mitashev VI Koussoulakos S Zinov'eva RD Ozerniuk ND Mikaelian AS Shmukler E Smirnova Iu A Izv Akad Nauk Ser Biol 2001 3 261 75 11433936 Regulatory modules shared within gene classes as well as across gene classes can be detected by the same in silico approach Klingenhoff A Frech K Werner T In Silico Biol 2002 2 S17 26 11808874 Identifying combinatorial regulation of transcription factors and binding motifs Kato M Hata N Banerjee N Futcher B Zhang MQ Genome Biol 2004 5 8 R56 Epub 2004 Jul 28. 507881 15287978 10.1186/gb-2004-5-8-r56 Combinatorial motif analysis and hypothesis generation on a genomic scale Hu YJ Sandmeyer S McLaughlin C Kibler D Bioinformatics 2000 16 3 222 32 10.1093/bioinformatics/16.3.222 10869015 Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes Jegga AG Sherwood SP Carman JW Pinski AT Phillips JL Pestian JP Aronow BJ Genome Res 2002 12 9 1408 17 186658 12213778 10.1101/gr.255002 Identification of the binding sites of regulatory proteins in bacterial genomes Li H Rhodius V Gross C Siggia ED Proc Natl Acad Sci USA 2002 99 18 11772 7 Epub 2002 Aug 14. 129344 12181488 10.1073/pnas.112341999 A regulatory code for neurogenic gene expression in the Drosophila embryo Markstein M Zinzen R Markstein P Yee KP Erives A Stathopoulos A Levine M Development 2004 131 10 2387 94 10.1242/dev.01124 15128669 Distance preferences in distribution of binding motifs and hierarchical levels in organization of transcription regulatory information Makeev V Lifanov A Nazina A Papatsenko D Nucleic Acids Res 2003 31 20 6016 26 219477 14530449 10.1093/nar/gkg799 Exploring genetic regulatory networks in metazoan development: methods and models Halfon MS Michelson AM Physiol Genomics 2002 10 3 131 43 12209016 ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors Papatsenko D Bioinformatics 2007 23 8 1032 1034 10.1093/bioinformatics/btm047 17308342 Computational detection of cis -regulatory modules Aerts S Loo PV Thijs G Moreau Y Moor BD Bioinformatics 2003 19 2 II5 II14 10.1093/bioinformatics/btg1052 14534164 Searching for statistically significant regulatory modules Bailey T Noble W Bioinformatics 2003 19 2 II16 II25 10.1093/bioinformatics/btg1054 14534166 Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura Berman B Pfeiffer B Laverty T Salzberg S Rubin G Eisen M Celniker S Genome Biol 2004 5 9 R61 522868 15345045 10.1186/gb-2004-5-9-r61 Detection of cis-element clusters in higher eukaryotic DNA Frith M Hansen U Weng Z Bioinformatics 2001 17 10 878 889 10.1093/bioinformatics/17.10.878 11673232 Cluster-Buster: Finding dense clusters of motifs in DNA sequences Frith MC Li MC Weng Z Nucleic Acids Res 2003 31 13 3666 3668 10.1093/nar/gkg540 12824389 168947 Target Explorer: an automated tool for the identification of new target genes for a specified set of transcription factors Sosinsky A Bonin C Mann R Honig B Nucleic Acids Research 2003 31 13 3589 3592 168951 12824372 10.1093/nar/gkg544 Searching for transcription factor binding site clusters: how true are true positives? Krivan W J Bioinform Comput Biol 2004 2 2 413 6 10.1142/S021972000400065X 15297989 Extraction of Functional Binding Sites from Unique Regulatory Regions: The <it>Drosophila </it>Early Developmental Enhancers Papatsenko D Makeev V Lifanov A Régnier M Nazina A Desplan C Genome Research 2002 12 470 481 [Preliminary version in Drosophila Workshop, Washington 2001]. 155290 11875036 10.1101/gr.212502. Article published online before print in February 2002 Genome-wide Analysis of Clustered Dorsal Binding Sites Identifies Putative Target Genes in the Drosophila Embryo Markstein M Markstein P Markstein V Levine M PNAS 2002 99 2 763 768 117379 11752406 10.1073/pnas.012591199 SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation Rebeiz M Reeves NL Posakony JW Proc Natl Acad Sci USA 2002 99 15 9888 93 Epub 2002 Jul 09. 125053 12107285 10.1073/pnas.152320899 Uniform clusters in Drosophila Lifanov A Makeev V Nazina A Papatsenko D Genome Res 2003 13 4 579 588 430164 12670999 10.1101/gr.668403 Methods for calculating the probabilities of finding patterns in sequences Staden R Comput Appl Biosci 1989 5 2 89 96 2720468 In vitro selection of RNA molecules that bind specific ligands Ellington A Szostak J Nature 1990 346 818 822 10.1038/346818a0 1697402 Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase Tuerk C Gold L Science 1990 249 505 510 10.1126/science.2200121 2200121 Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities Berger MF Philippakis AA Qureshi AM He FS Estep PW Bulyk ML Nat Biotechnol 2006 24 1429 1435 10.1038/nbt1246 16998473 Modeling Transcriptional Regulation in Chondrogenesis Using Particle Swarm Optimization Liu Y Yokota H IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB2005 2005 311 317 IUPAC codes http://bioinformatics.org/sms2/iupac.html Selection of DNA binding sites by regulatory proteins. Functional specificity and pseudosite competition Berg OG J Biomol Struct Dyn 1988 6 2 275 297 3271524 Knuth DE The Art of Computer Programming, Sorting and Searching Addison-Wesley 1973 3 Computing exact P-values for DNA motifs Zhang J Jiang B Li M Tromp J Zhang X Zhang M Bioinformatics 2007 23 5 531 537 10.1093/bioinformatics/btl662 17237046 Finding Motifs in Promoter Regions Hertzberg L Zuk O Getz G Domany E Journal of Computational Biology 2005 12 3 314 330 10.1089/cmb.2005.12.314 15857245 Exact distribution of word occurrences in a random sequence of letters Robin S Daudin JJ J Appl Prob 1999 36 179 193 10.1239/jap/1032374240 The Occurrence of Sequence of Patterns in Repeated Dependent Experiments Chrysaphinou C Papastavridis S Theory of Probability and Applications 1990 79 167 173 String Overlaps, Pattern Matching and Nontransitive Games Guibas L Odlyzko A Journal of Combinatorial Theory, Series A 1981 30 183 208 10.1016/0097-3165(81)90005-4 Central Limit Theorem for Renewal Theory for Several Patterns Tanushev M Arratia R Journal of Computational Biology 1997 4 35 44 9109036 Motif Statistics Nicodème P Salvy B Flajolet P Theoretical Computer Science 2002 287 2 593 618 [Preliminary version at ESA'99]. 10.1016/S0304-3975(01)00264-X A Unified Approach to Word Occurrences Probabilities Régnier M Discrete Applied Mathematics 2000 104 259 280 [Special issue on Computational Biology;preliminary version at RECOMB'98]. 10.1016/S0166-218X(00)00195-5 Szpankowski W Average Case Analysis of Algorithms on Sequences New York: John Wiley and Sons 2001 Counting occurrences for a finite set of words: an inclusion-exclusion approach Bassino F Clément J Fayolle J Nicodème P 2007 International Conference on Analysis of Algorithms (AofA'07), Discrete Mathematics and Theoretical Computer Science 2007 12 Searching for Multiple Words inMarkov Sequences Park Y Spouge J INFORMS journal of Computing 2004 16 4 341 347 10.1287/ijoc.1040.0095 Regexpcount, a symbolic package for counting problems on regular expressions and words Nicodème P Fundamenta Informaticae 2003 56 1–2 71 88 Detecting localized repeats in genomic sequences: A new strategy and its application to <it>B. subtilis </it>and <it>A. thaliana </it>sequences Klaerr-Blanchard M Chiapello H Coward E Comput Chem 2000 24 57 70 10.1016/S0097-8485(99)00047-9 10642880 Compound Poisson Approximation for Occurrences of Multiple Words in Markov Chains Reinert G Schbath S Journal of Computational Biology 1998 5 2 223 253 9672830 Comparison of statistical significance criteria Régnier M Vandenbogaert M J Bioinform Comput Biol 2006 4 2 537 551 10.1142/S0219720006002028 16819801 Mathematical Tools for Regulatory Signals Extraction Régnier M Bioinformatics of Genome Regulation and Structure Kluwer Academic Publisher Kolchanov N, Hofestaedt R 2004 61 70 [Preliminary version at BGRS'02]. Rare events and Conditional Events on random strings Régnier M Denise A DMTCS 2004 6 2 191 214 Assessing the significance of Sets of Words Boeva V Clément J Régnier M Vandenbogaert M CPM'05, of Lecture Notes in Computer Science Springer-Verlag 2005 3537 358 370 [Proc. CPM'05, Jeju Island, Korea]. Multi-seed lossless filtration Kucherov G Noé L Roytberg M Proceedings of the 15th Annual Combinatorial Pattern Matching Symposium (CPM), Istanbul (Turkey), of Lecture Notes in Computer Science Springer Verlag Sahinalp S, Muthukrishnan S, Dogrusoz U 2004 3109 297 310 Efficient String Matching Aho A Corasick M CACM 1975 18 6 333 340 Regulation of even-skipped stripe 2 in the Drosophila embryo Small S Blair A Levine M Embo Journal 1992 11 13 4047 4057 556915 1327756 Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains Reinert G Schbath S J Comput Biol 1998 5 2 223 53 9672830 Identification of regulatory regions which confer muscle-specific gene expression Wasserman W Fickett J J Mol Biol 1998 278 167 81 10.1006/jmbi.1998.1700 9571041 An Assessment of Computational Tools for the Discovery of Transcription Factor Binding Sites Tompa M Li N Bailey T Church G De Moor B Eskin E Favorov A Frith M Fu Y Kent J Makeev V Mironov A Noble W Pavesi G Pesole G Régnier M Simonis N Sinha S Thijs G van Helden J Vandenbogaert M Weng Z Workman C Ye C Zhu Z Nature Biotechnology 2005 23 137 144 10.1038/nbt1053 15637633 Separating real motifs from their artifacts Blanchette M Sinha S Bioinformatics 2001 17 Suppl 1 S30 8 11472990