Efficient Computation of Clustered-Clumps in Degenerate Strings

. Given a ﬁnite set of patterns, a clustered-clump is a maximal overlapping set of occurrences of such patterns. Several solutions have been presented for identifying clustered-clumps based on statisti-cal, probabilistic, and most recently, formal language theory techniques. Here, motivated by applications in molecular biology and computer vision, we present eﬃcient algorithms, using String Algorithm techniques, to identify clustered-clumps in a given text. The proposed algorithms compute in O ( n + m ) time the occurrences of all clustered-clumps for a given set of degenerate patterns ˜ P and/or degenerate text ˜ T of total lengths m and n , respectively; such that the total number of non-solid symbols in ˜ P and ˜ T is bounded by a ﬁxed positive integer d .


Introduction
The ability to identify and compute various repeated patterns in strings is known to play a central role in many aspects of computer science fields including data compression, computer vision, computer-assisted music analysis and molecular biology. One of the most fundamental questions arising in such studies is to locate the regions/windows of overlapping occurrences of patterns in a given longer string named as text. This question is of particular interest in molecular biology, e.g. finding patterns with unexpectedly high or low frequencies and gene recognition.
In this paper, we consider a recently studied problem of computing clumps in a given text [10,2]. In particular, given a finite set of patterns P, we compute all factors in a given text T such that each factor is composed of the maximal overlapping occurrences of patterns from P; these will be referred to as clusteredclumps hereafter. Such findings may be utilised, for example, for gene prediction, that is to find genes within a genome, based on the occurrences of specific DNA sequence motifs before or after them. Examples of such motifs include gene promoters; start and stop codons; and poly(A) tails. An example of overlapping motifs, specifically, recognition sites to which proteins bind, is presented in [6].
In molecular biology (where sequences are considered as stings over fixed size alphabet Σ) if the specific nature of biological data is to be accommodated, it is required to allow some positions in the sequence to contain, instead of a single letter from Σ, a subset of Σ. Such degenerate (indeterminate) symbols can be interpreted as information that the exact letter at the given position is not known, but is suspected to be one of the specified letters.
Other than the aforementioned applications in genomics, identification of clustered-clumps in degenerate data also finds applications in areas such as computer vision or image processing. One such application can be the matching and retrieval of roughly aligned images containing the same scene, except nuances of some regions, and allowing for transformations like shifting, scaling, rotation etc.
Several solutions have been presented for identifying clustered-clumps based on statistical, probabilistic, and most recently, formal language theory techniques [10,2]. To the best of our knowledge, no solution heretofore explores the problem accounting for degeneracy in data. Here, we present efficient algorithms, using String Algorithm techniques, to identify clustered-clumps in a given text. Our solution considers degenerate strings arising from the nature of real data. The proposed algorithms compute in O(n + m) time the occurrences of all clusteredclumps for a given set of degenerate patternsP and/or degenerate textT of total lengths m and n, respectively; such that the total number of non-solid symbols inP andT is bounded by a fixed positive integer d.
The rest of the paper is organised in the following format: The next section introduces the vocabulary and the notions that will be used throughout paper. The algorithmic tools and data-structures required to build the solutions have been described in Section 3. Section 4 formally defines the problem and its variations along with presenting and analysing the algorithms. Finally, Section 5 concludes the paper.

Terminology and Technical Background
We begin with basic definitions and notation. We think of a string X of length n as an array X[1 . . n], where every X[i], 1 ≤ i ≤ n, is a letter drawn from some fixed alphabet Σ of size |Σ| = O(1). The empty string is denoted by ε. The set of all strings over Σ (including the empty string ε) is denoted by Σ * . A string Y is a factor of a string X if there exist two strings U and V , such that X = U Y V . Hence, we say that there is an occurrence of Y in X, or, simply, that Y occurs in X. Consider the strings X, Y, U , and V , such that A degenerate symbolσ over an alphabet Σ is a non-empty subset of Σ, i.e., σ ⊆ Σ andσ = ∅. |σ| denotes the size of the set and we have 1 ≤ |σ| ≤ |Σ|. A degenerate string is built over the potential 2 |Σ| − 1 non-empty sets of letters belonging to Σ. In other words, a degenerate stringX =X[1 . . n], is a string such that everyX[i] is a degenerate symbol, 1 ≤ i ≤ n. For example, X = {a,b}{a}{c}{b,c}{a}{a,b,c} is a degenerate string of length 6 over Σ = {a,b,c}. If |X[i]| = 1, that is,X[i] is a single letter of Σ, then we say thatX[i] is a solid symbol and i is a solid position. Otherwise,X[i] and i are said to be a non-solid symbol and a non-solid position, respectively. For convenience, we often writeX[i] = σ (σ ∈ Σ), instead ofX[i] = {σ} in case of solid symbols. Consequently, the degenerate stringX mentioned previously will be written asX = {a,b}ac{b,c}a{a,b,c}. A string containing only solid symbols will be called a solid string. A conservative degenerate string is a degenerate string where its number of non-solid symbols is upper-bounded by a fixed positive constant.
For degenerate strings, the notion of symbol equality is extended to singlesymbol match between two degenerate symbols in the following way. Two degenerate symbolsα 1 andα 2 are said to match (represented asα 1 ≈α 2 ) if α 1 ∩α 2 = ∅. Extending this notion to degenerate strings, we say that two degen- . Note that for a fixed-sized alphabet, the matching relation can be implemented in O(1) time if degenerate symbols are represented by bit-vectors of size |Σ|.
A set of strings P = {P 1 , · · · , P r } is reduced if no P i is factor of a P j with i = j. For instance, the set {aa,aba} is reduced whereas the sets {aa,aab}, {aa,baa}, and {aa,baab} are non-reduced. In [1], a clustered-clump of a given reduced set of strings P = {P 1 , · · · , P r }, where each P i of length at least 2, is defined as follows: Definition 1 (Clustered-Clump). A clustered-clump of a given reduced set of strings P = {P 1 , · · · , P r } is a string W such that any two consecutive positions in W are covered by the same occurrence in W of a string P ∈ P. The position i of the string W is covered by a string P if P = W [j . . j + |P | − 1] for some j ∈ {1, · · · , |W | − |P | + 1} and j ≤ i ≤ j + |P | − 1. More formally, W is a clustered-clump for the set P such that ∀i ∈ {1, · · · , |W |} ∃P ∈ P, ∃j ∈ P os W (P ) such that j ≤ i ≤ j + |P | − 1, where P os W (P ) is the set of positions of occurrences of P in W .
For a given text (string) T , a factor W is a clustered-clump if it is maximal in the sense that there exists no occurrence of the set P in T that overlaps W without being a factor of it.
Example 1. Consider the set P = {aba, bba} and the text T = bbbabababababb bbabaababb, we have the following clumps underlined: Notice that the factor ababa at position 6 is not a clustered-clump since it is not maximal. Also, the factor bbabaaba at position 15 does not form a single clustered-clump, because its two-letter factor aa is not covered by an occurrence of either aba or bba.

Algorithmic Tools
In the following we present two fundamental data structures supporting a wide variety of string matching algorithms. Both data structures are used in the proposed algorithms presented in Section 4.

Suffix Tree:
The suffix tree S(X) of a non-empty string X of length n is a compact trie representing all the suffixes of X such that S(X) has n leaves, labelled from 1 to n. Additionally, each edge is labelled with a factor of X. For any i, 1 ≤ i ≤ n, the concatenation of the edges' labels on the path from the root of S(X) to leaf i is precisely the suffix X[i . . n]. For any two suffixes U = X[i . . n] and V = X[j . . n] of X, if W is the longest common prefix of U and V , then the path in S(X) corresponding to W is the same for U and V . For a general introduction of suffix trees, see [3].
The construction of the suffix tree S(X) of the input string X takes O(n) time and space, for string over a fixed-sized alphabet [12,8,11]. Once the suffix tree of a given string (called text) has been constructed, it can be used to support queries that return the occurrences of a given string (called pattern) in time linear in the length of the pattern.

Aho-Corasic Automaton:
The Aho-Corasic automaton of a set of strings P, denoted A(P), is the minimal partial deterministic finite automaton accepting the set of all strings having a string of P as a suffix (see [5,Section 7.1] for more description and for efficient construction); an example is given in Figure 1. This data structure has an initial state, denoted s 0 , and a transition function represented by the edges in the figure. A state is marked as terminal if the string it represents is in the set P; note that all the leaves are terminal states. Let 'goto' denote the transition function, then the suffix-link, represented by the dotted line; is defined as follows: For a given non empty string X such that s i = goto(s 0 , X), the suffix-link of state s i points at s j = goto(s 0 , X ′ ), where X ′ is the longest suffix of X such that s i = s j .
The construction of the suffix automaton A(P) together with the suffix-links can be done in linear time and space [3] independent of the alphabet size. Note that the transition function can be implemented in O(1) time for a fixed size alphabet.
Once the automaton A(P) has been constructed, searching a text T for occurrences of the patterns in P can be realized in time linear in the length of T ; such a problem is known as the dictionary matching problem. The matching involves the Aho-Corasic automaton scanning the text, reading every letter exactly once. If the automaton is in state s i and reads letter α of the text, it moves to state s j = goto(s i , α) if defined, otherwise, it moves to the nearest s k such that s k = goto(s j , α) is defined and s j is the state identified by the following of suffix-links starting from s i . Also, if the automaton encounters a terminal state, it outputs an occurrence(s) of one or more patterns. Note that if P is reduced then at most one pattern from P occurs at each position of the text. In the rest of the paper, we assume that P is a reduced set.

Clustered-Clumps Algorithms
The Clustered-Clump problem is formally defined as follows: Finding Clustered-Clumps Input: A text T of length n and a set of patterns P = {P 1 , · · · , P r }, such that m = 1≤i≤r |P i |. Output: All clustered-clumps in T .
While the above problem can efficiently be solved using any standard dictionary matching algorithm, extending the definition to include degenerate strings makes the problem more interesting, challenging, and useful for practical applications. In the following, we reformulate the definition to generate three variations of the problem -the patterns are degenerate (but the text is solid), the text is degenerate (but the patterns are solid), and both the text and the patterns are degenerate. Further, we assume that the number of non-solid symbols in either the text or the set of patterns is bounded by a given constant, i.e. we will deal with conservative degenerate strings.

Problem 1: Solid Text & Degenerate Patterns
Problem 1: Finding Clustered-Clumps in Solid Text given Degenerate Patterns Input: A text T of length n, a set of conservative degenerate patterns P = {P 1 , · · · ,P r }, and integers d and m, such that total number of nonsolid symbols inP ≤ d, and m = 1≤i≤r |P i |. Output: All clustered-clumps in T .

A C T A A C A T A A C G A A G C T A A T C T T
The solution we propose for this problem is based on the idea used in [9]. Each degenerate patternP i ∈P can be seen as consisting of solid subpatterns, interspersed with non-solid regions. Let P i,j be a solid subpattern ofP i , 1 ≤ i ≤ r, 1 ≤ j ≤ sub(i), where sub(i) denotes the number of solid subpatterns inP i . Additionally, let ℜ i j−1,j represents a non-solid region between the subpatterns P i,j−1 and P i,j . In other words, a patternP i can be viewed as: Note that if the patternP i ends with a non-solid symbol(s), then the last non-solid region is represented as ℜ i sub(i),∞ . The following steps outline our solution: Step 1: Split: In this step, each degenerate pattern inP is split into its component subpatterns; we call the set of all these solid subpatterns so obtained P. Effectively, we are breaking every degenerate pattern into subpatterns by chopping out non-solid regions so that each of the subpatterns is solid. Step 2: Find occurrences of solid subpatterns in T : We next build the Aho-Corasick automaton of the set P; denoted by A(P). Using the automaton, we compute all the occurrences of the solid subpatterns in the text T .
The occurrences of the subpatterns of the set P are maintained using a boolean matrix Valid of size |P| × n such that we can test in constant time whether or not a specific solid subpattern occurs at a given text position. If an occurrence of P i,j for which ℜ i j−1,j exists is found at a position (say k), then we need to check: Whether the non-solid symbols in ℜ i j−1,j match the corresponding positions in T . If j = sub(i), then the non-solid region ℜ i sub(i),∞ is also tested for matching.
If both conditions are true, then an occurrence of P i,j is marked true in the matrix Valid. Notice that proceeding in this way, an occurrence marked true for P i,sub(i) corresponds to an occurrence of the degenerate patternP i in T .
Step 3: Compute the locations of clustered-clumps: Using the information about the occurrences of the degenerate patterns in the text, we populate an array LongestOcc of size n that stores the length of the longest pattern occurring at each position of the text. It is easy to see that simple calculations done in a single scan of this array can report the positions of all the clustered-clumps in T ; see Function 1 below for more details.

Problem 2: Degenerate Text & Solid Patterns
Problem 2: Finding Clustered-Clumps in Degenerate Text given Solid Patterns Input: A conservative degenerate textT of length n, a set of patterns P = {P 1 , · · · , P r }, and integers d and m, such that the total number of non-solid symbols inT ≤ d, and m = 1≤i≤r |P i |. Output: All clustered-clumps inT . Our solution for Problem 2 is built using the recently developed algorithm described in [4] that solves, in linear time, pattern matching problem in conservative degenerate text (applying an adapted version of Landau and Vishkin's [7] algorithm for approximate pattern matching). Please refer to [4] for full details, but for the sake of completeness, a brief description has been provided in the following steps: Step1: Substitute: In this step, each of the non-solid symbols occurring in the given degenerate text is replaced by a unique symbol which is not present in Σ. Let Λ be the set of these unique symbols i.e. Λ = {λ i } such that 0 < i ≤ d. It is to be noted that the text, T λ , obtained by such a substitution will be a solid string. For example, if T = CATTA{A,G}GAGC{T,G}CTTTA as in Example 4 then T λ = CATTAλ 1 GAGCλ 2 CTTTA; here Λ = {λ 1 , λ 2 }.
Step 2: Find occurrences of solid patterns in T λ : We concatenate the text T λ and the patterns as follows: where each delimiting symbol # i , 1 ≤ i ≤ r is a unique symbol that is not present in Σ ∪ Λ. Next, the suffix tree S(T ) of T is constructed. As detailed in

Conclusion
In this paper, we studied the problem of identifying clustered-clumps in conservative degenerate strings and presented O(n + m)-time algorithms that compute the occurrences of all clustered-clumps for a given set of degenerate patternsP or/and a degenerate textT of total lengths m and n, respectively; such that the total number of non-solid symbols inP andT is bounded by a given constant d. The presented algorithms are promising for applications in genomics and computer vision. We intend to conduct larger-scale experiments, using genomic as well as digitized-images datasets. Furthermore, other domains that involve webmining applications may find the presented solutions interesting and beneficial.