State Complexity of Single-Word Pattern Matching in Regular Languages

. The state complexity κ ( L ) of a regular language L is the number of states in the minimal deterministic ﬁnite automaton recognizing L . In a general pattern-matching problem one has a set T of texts and a set P of patterns; both T and P are sets of words over a ﬁnite alphabet Σ . The matching problem is to determine whether any of the patterns appear in any of the texts, as preﬁxes, or suﬃxes, or factors, or subsequences. In previous work we examined the state complexity of these problems when both T and P are regular languages, that is, we computed the state complexity of the languages ( PΣ ∗ ) ∩ T , ( Σ ∗ P ) ∩ T , ( Σ ∗ PΣ ∗ ) ∩ T , and ( Σ ∗ P ) ∩ T , where is the shuﬄe operation. It turns out that the state complexities of these languages match the naïve upper bounds derived by composing the state complexities of the basic operations used in each expression. However, when P is a single word w , and Σ has two or more letters, the bounds are drastically reduced to the following: κ (( wΣ ∗ ) ∩ T ) (cid:1) m + n − 1 ; κ (( Σ ∗ w ) ∩ T ) (cid:1) ( m − 1) n − ( m − 2) ; κ (( Σ ∗ wΣ ∗ ) ∩ T ) (cid:1) ( m − 1) n ; and κ (( Σ ∗ w ) ∩ T ) (cid:1) ( m − 1) n . The bounds for factor and subsequence matching are the same as the naïve bounds, but this is not the case for preﬁx and suﬃx matching. For unary languages, we have a tight upper bound of m + n − 2 in all four cases.


Introduction
The state complexity of a regular language L, denoted κ(L), is the number of states in the minimal deterministic finite automaton (DFA) recognizing L. The state complexity of an operation on regular languages is the worst-case state complexity of the resulting language, expressed in terms of the the input languages' state complexities. A language attaining this worst-case state complexity is called a witness for the operation.
The state complexities of "basic" regular operations such as intersection and concatenation have been thoroughly studied [7,8,9]. There has also been some attention devoted towards "combined" operations such as concatenation with Σ * to form languages called ideals [3]. A practical application of ideals is in pattern matching, or finding occurrences of a pattern in a text, commonly as either prefixes, suffixes, factors, or subsequences. (For a detailed treatment of pattern matching, see [4].) Brzozowski et al. [1] formulated several pattern matching problems as the construction of a regular language, using the intersection between a text language T and an ideal of a pattern language P . In the general case, given that κ(T ) n and κ(P ) m, and denoting as the shuffle operation, the following state complexity bounds were shown to be tight: These bounds are in fact the naïve bounds derived from composing the state complexity of the intersection between the Σ * -concatenated pattern language and the text language. However, these bounds are exponential in m, which leads to the following question: to what degree would restricting P lower the bounds? In this paper, we focus on restricting P to be a single word; that is, P = {w}.
Single-word pattern matching has many practical applications. For example, a common use of the grep utility in Unix is to search for the files in a directory in which a search word appears. In bioinformatics, a DNA sequence t is often searched to locate a sequence of nucleotides w [5]. There has also been work in distributed systems to "learn" common execution patterns from log files and use them to identify anomalous executions in new logs [6].
In this paper, we show that for languages T and {w} such that κ(T ) n and κ({w}) m, the following upper bounds hold: Furthermore, in each case there exist languages T n and {w} m that meet the upper bounds. All of these bounds can be achieved using a binary alphabet, but not using a unary alphabet.

Terminology and Notation
A deterministic finite automaton (DFA) is a 5-tuple D = (Q, Σ, δ, q 0 , F ), where Q is a finite non-empty set of states, Σ is a finite non-empty alphabet, δ : Q×Σ → Q is the transition function, q 0 ∈ Q is the initial state, and F ⊆ Q is the set of final states. We extend δ to functions δ : Q × Σ * → Q and δ : 2 Q × Σ * → 2 Q as usual.
A language L(D) is accepted by D if, for all w ∈ L(D), δ(q 0 , w) ∈ F . If q is a state of D, then the language L q (D) of q is the language accepted by the DFA (Q, Σ, δ, q, F ). Let L be a language over Σ. The quotient of L by a word x ∈ Σ * is the set it has the smallest number of states and the smallest alphabet among all DFAs accepting L(D). It is well known that a DFA is minimal if it uses the smallest alphabet, all of its states are reachable, and no two states are indistinguishable.
We sometimes define transition functions as transformations induced by letters, written as a : t where t : Q → Q, for all a ∈ Σ. In particular, we use ½ to denote the identity transformation (i.e., δ(q, a) = q for all q ∈ Q), and (q 0 , q 1 , . . . , q k−1 ) to denote a k-cycle, where δ(q i , a) = q i+1 for 0 i k − 2 and δ(q k−1 , a) = q 0 . For states not in {q 0 , q 1 , . . . , q k−1 }, the k-cycle acts as the identity transformation.
Throughout the paper, we fix w = a 1 · · · a m−2 , where a i ∈ Σ for 1 i m − 2. Let w 0 = ε (where ε denotes the empty word) and for 1 i m − 2, let w i = a 1 · · · a i . We write W = {w 0 , w 1 , . . . , w m−2 } for the set of all prefixes of w. Note that if the state complexity of {w} is m, then w is of length m − 2.

Matching a Single Prefix
Furthermore, these upper bounds are tight. Remark 1. When |Σ| = 1 (that is, P and T are languages over a unary alphabet), the tight upper bound m + n − 2 actually holds in all four cases we consider in this paper. This is because if L is a language over a unary alphabet Σ, then the ideals LΣ * , Σ * L, Σ * LΣ * and Σ * L coincide; thus the prefix, suffix, factor and subsequence matching cases coincide.
Proof. We first derive upper bounds for the two cases of |Σ|.
Upper Bounds: Also define α(∅, a) = ∅ for all a ∈ Σ. Let the state reached by w in D T be q r = δ(q 0 , w); we construct a DFA D L that accepts L = (wΣ * ) ∩ T . As shown in Figure 1, otherwise.
Arbitrary DFA DT ; the qi j are not necessarily distinct. Recall that in a DFA D, if state q is reached from the initial state by a word u, then the language of q is equal to the quotient of L(D) by u. Thus the language of state q r is the quotient of T by w, that is, the set w −1 T = {y ∈ Σ * | wy ∈ T }. The DFA D L accepts a word x if and only if it has the form wy for y ∈ w −1 T ; we need the prefix w to reach the arbitrary DFA D T , and w must be followed by a word that sends q r to an accepting state, that is, a word y in the language That is, L is the set of all words of T that begin with w, as required. It follows that the state complexity of L is less than or equal to m + n − 1. If |Σ| = 1, all the Σ \ {a i } are empty and state ∅ is not needed. Hence the state complexity of L is less than or equal to m + n − 2 in this case.
where κ(T ) = n, δ 1 is defined by a : (0, 1, . . . , n − 1), and r = δ 1 (0, a m−2 ). Let D L be the DFA shown in Figure 2 for the language L = (P Σ * ) ∩ T . Obviously D L has m + n − 2 states and they are all reachable. Since the shortest word accepted from any state is distinct from that of any other state, all the states are pairwise distinguishable. Hence P and T constitute witnesses that meet the required bound. Construct the DFA D L for the language L = (P Σ * ) ∩ T as is shown in Figure 3. It is clear that all states are reachable and distinguishable by their shortest accepted words.

Matching a Single Suffix
Let w, x, y, z ∈ Σ * . We introduce some notation: -x p y means x is a prefix of y, and x s y means x has y as a suffix.
-If x s y and y p z, we say y is a bridge from x to z or that y connects x to z. We also denote this by x → y → z.
x y z means that y is the longest bridge from x to z. That is, x → y → z, and whenever x → w → z we have |w| |y|. Equivalently, y is the longest suffix of x that is also a prefix of z. a) is defined to be the maximal-length bridge from w i a to w, or equivalently, the longest suffix of w i a that is also a prefix of w.
We observe that every state w i ∈ W is reachable from w 0 by the word w i , and that each state w i is distinguished from all other states by a i+1 · · · a m−2 . It remains to be shown that Σ * w = L(A). In the following, for convenience, we simply write δ rather than δ A .
We claim that for That is, the defining property of the transition function extends nicely to words. Recall that the extension of δ to words is defined inductively in terms of the behavior of δ on letters, so it is not immediately clear that this property carries over to words.
We prove this claim by induction on |x|. If x = ε, this is clear. Now suppose x = ya for some y ∈ Σ * and a ∈ Σ, and that , a), by definition we have w j a w k w. Since δ(w i , y) = w j , we have w i y w j w. In particular, w i y s w j and thus w i x = w i ya s w j a. Thus w i x s w j a s w k as required.
Next, we show that whenever w i x → w → w, we have |w | |w k |. If w = ε, this is immediate, so suppose w = ε. Since w i x = w i ya s w , and w is nonempty, it follow that w ends with a. Thus w = w −1 a. Since w i ya s w −1 a, we have w i y s w −1 . Additionally, w −1 p w, so w i y → w −1 → w. Since w i y w j w, we have |w −1 | |w j |. Since w i y s w j and w i y s w −1 and |w j | |w −1 |, we have w j s w −1 . Thus w j a s w −1 a = w . It follows that w j a → w → w. But recall that δ(w i , x) = δ(w j , a) = w k , so w j a w k w, and |w | |w k | as required. Now, we show that A accepts the language Σ * w. Suppose x ∈ Σ * w and write x = yw. The initial state of A is w 0 = ε. We have yw δ(ε, yw) w, that is, δ(ε, yw) is the longest suffix of yw that is also a prefix of w. But this longest suffix is simply w itself, which is the final state. So x is accepted. Conversely, suppose x ∈ Σ * is accepted by A. Then δ(ε, x) = w, and thus x w w by definition. In particular, this means x s w, and so x ∈ Σ * w.
Next we establish an upper bound on the state complexity of (Σ * w) ∩ T . The upper bound in this case is quite complicated to derive. Suppose w has state complexity m and T has state complexity at most n, for m 3 and n 2. Let A be the (m − 1)-state DFA for Σ * w defined in Proposition 1, and let D be an n-state DFA for T with state set Q n , transition function α, and final state set We claim that this direct product has at most (m − 1)n − (m − 2) reachable and pairwise distinguishable states, and thus the state complexity of (Σ * w) ∩ T is at most (m − 1)n − (m − 2).
Since A has m − 1 states and D has n states, there are at most (m − 1)n reachable states. It will suffice to show that for each word w i with 1 i m − 2, there exists a word w f (i) = w i and a state p i ∈ Q n such that (w i , p i ) is indistinguishable from (w f (i) , p i ). This gives m − 2 states that are each indistinguishable from another state, establishing the upper bound. We In other words, w f (i) is the longest suffix of w i that is also a proper prefix of w i . To find p i , first observe that there exists a non-final state q ∈ Q n and a state r ∈ Q n such that α(r, w) = q. Indeed, if no such states existed, then for all states r, the state α(r, w) would be final. Thus we would have Σ * w ⊆ T , and the state complexity of (Σ * w) ∩ T = Σ * w would be m − 1, which is lower than our upper bound since n 2. Now, set p i = α(r, w i ), and note that α(p i , a i+1 ) = p i+1 , and α(p i , a i+1 · · · a m−2 ) = q.
To establish the upper bound, we will need two technical lemmas. Their proofs can be found in [2]. and (w f (m−2) , p m−2 ). By Lemma 1, we have δ A (w m−2 , a) = δ A (w f (m−2) , a) for all a ∈ Σ. Thus non-empty words cannot distinguish the states. But recall that p m−2 = q is a non-final state, so the states we are trying to distinguish are both non-final, and thus the empty word does not distinguish the states either. So these states are indistinguishable. Now, suppose m−2−i > 0, that is, i < m−2. Assume that states (w i+1 , p i+1 ) and (w f (i+1) , p i+1 ) are indistinguishable. We want to show that (w i , p i ) and (w f (i) , p i ) are indistinguishable. Since f (i) < i < m−2, both states are non-final, and thus the empty word cannot distinguish them. By Lemma 1, if a = a i+1 . a) for all a ∈ Σ. So only words that start with a i+1 can possibly distinguish the states. But by Lemma 2, letter a i+1 sends the states to (w i+1 , p i+1 ) and (w f (i+1) , p i+1 ), which are indistinguishable by the induction hypothesis. Thus the states cannot be distinguished.
Next we show that the upper bound of Proposition 2 is tight.    Proof. Let A = (W, Σ, δ A , w 0 , {w m−2 }) be the DFA with transitions defined as follows: for all a ∈ Σ and w i ∈ W , we have w i a δ A (w i , a) w. Recall from Proposition 1 that A recognizes Σ * w. We modify A to obtain a DFA A that accepts Σ * wΣ * as follows.

Matching a Single Factor
Let where δ A is defined as follows for each a ∈ Σ: δ A (w i , a) = δ A (w i , a) for i < m − 2, and δ A (w m−2 , a) = w m−2 . Note that A is minimal: state w i can be reached by the word w i , and states w i and w j with i < j are distinguished by a j+1 · · · a m−2 . It remains to show that A accepts Σ * wΣ * . To simplify the notation, we write δ instead of δ A and δ instead of δ A . Suppose x is accepted by A . Write x = yz, where y is the shortest prefix of x such that δ (ε, y) = w m−2 . Since y is minimal in length, for every proper prefix y of y, we have δ (ε, y ) = w i for some i < m − 2. It follows that δ (ε, y) = δ(ε, y) by the definition of δ . So δ(ε, y) = w m−2 , and hence y is accepted by A. It follows that y ∈ Σ * w. This implies x = yz ∈ Σ * wΣ * .
Fix w with state complexity m, and let A and A be the DFAs for Σ * w and Σ * wΣ * , respectively, as described in the proof of Proposition 3. Fix T with state complexity at most n, and let D be an n-state DFA for T with state set Q n and final state set F . The direct product DFA A × D with final state set {w} × F recognizes (Σ * wΣ * ) ∩ T . Since A × D has (m − 1)n states, this gives an upper bound of (m − 1)n on the state complexity of (Σ * wΣ * ) ∩ T . Proof. Let Σ = {a, b} and let w = b m−2 . Let A be the DFA for Σ * wΣ * . Let T be the language of Definition 1. The DFA A × D is illustrated in Figure 6.
We show that A × D has (m − 1)n reachable and pairwise distinguishable states. For reachability, for 0 i m − 2 and 0 q n − 1, we can reach (b i , q) from the initial state (ε, 0) by the word a q b i . For distinguishability, suppose we have states (b i , q) and (b j , q) in the same column q, with i < j. By b m−2−j we reach (b m−2+i−j , q) and (w, q), with b m−2+i−j = w. Then by a we reach (ε, qa) and (w, qa), which are distinguishable by a word in a * . For states in different columns, suppose we have (b i , p) and (b j , q) with p < q. By a sufficiently long word in b * , we reach (w, p) and (w, q). These states are distinguishable by a n−1−q . So all reachable states are pairwise distinguishable.

Proof. Define a DFA
Note that A is minimal: state w i is reached by word w i and states w i , w j with i < j are distinguished by a j+1 · · · a m−2 . We claim that A recognizes Σ * w. Write δ rather than δ A to simplify the notation. Suppose x ∈ Σ * w. Then we can write x = x 0 a 1 x 1 a 2 x 2 · · · a m−2 x m−2 , where x 0 , . . . , x m−2 ∈ Σ * . We claim that δ(ε, x 0 a 1 x 1 · · · a i x i ) = w j for some j i. We proceed by induction on i. The base case i = 0 is trivial. Now, suppose that i > 0 and δ(ε, x 0 a 1 x 1 · · · a i−1 x i−1 ) = w j for some j i−1. Then δ(ε, x 0 a 1 x 1 · · · a i x i ) = δ(w j , a i x i ). We consider two cases: This completes the inductive proof. It follows then that δ(ε, x) = w m−2 = w, and so x is accepted by A. Conversely, if x is accepted by A, then it is clear from the definition of the transition function that the letters a 1 , a 2 , . . . , a m−2 must occur within x in order, and so x ∈ Σ * w.
Fix w with state complexity m, and let A be the DFA for Σ * w described in the proof of Proposition 4. Fix T with state complexity at most n, and let D be an n-state DFA for T with state set Q n and final state set F . The direct product DFA A × D with final state set {w} × F recognizes (Σ * w) ∩ T . Since A × D has (m − 1)n states, this gives an upper bound of (m − 1)n on the state complexity of (Σ * w) ∩ T . Proof. Let Σ = {a, b} and let w = b m−2 . Let A be the DFA for Σ * w. Let T be the language of Definition 1. The DFA A × D is illustrated in Figure 7.
We show that A × D has (m − 1)n reachable and pairwise distinguishable states. For reachability, for 0 i m − 2 and 0 q n − 1, we can reach (b i , q) from the initial state (ε, 0) by the word a q b i . For distinguishability, suppose we have states (b i , q) and (b j , q) in the same column q, with i < j. By b m−2−j we reach (b m−2+i−j , q) and (w, q), with b m−2+i−j = w. These states are distinguishable by a word in a * . For states in different columns, suppose we have (b i , p) and (b j , q) with p < q. By a sufficiently long word in b * , we reach (w, p) and (w, q). These states are distinguishable by a n−1−q . So all reachable states are pairwise distinguishable.

Conclusions
Building on previous work, we investigated the state complexity of "pattern matching" operations on regular languages, based on finding all words in a text language T which contain the single word w as either a prefix, suffix, factor, or subsequence. In all cases, the bounds were significantly lower than the general case, where w is replaced by a regular language P . Prefix matching is now linear in the input languages' state complexities, and the remaining cases are polynomial in the input state complexities. The general bounds were polynomial for prefix matching and exponential in the other cases. It is also worth noting that a binary alphabet is sufficient to reach all these bounds, including subsequence matching, whose bound was defined in terms of a growing alphabet in the general case. For languages with a unary alphabet, the state complexity was linear in all four cases.