State Complexity of Suffix Distance

. The neighbourhood of a regular language with respect to the preﬁx, suﬃx and subword distance is always regular and a tight bound for the state complexity of preﬁx distance neighbourhoods is known. We give upper bounds for the state complexity of the neighbourhood of radius k of an n state DFA (deterministic ﬁnite automaton) language with respect to the suﬃx distance and the subword distance, respectively. For restricted values of k and n we give a matching lower bound for the state complexity of suﬃx distance neighbourhoods.


Introduction
Distances between strings and languages are used in many applications [4,7,10,9]. Perhaps the most commonly used distance, the Levenshtein distance (a.k.a. the edit distance), is defined in terms of the number of substitution, insertion and deletion operations needed to transform one string into another. The prefix distance [3,11] of strings x and y is the sum of the lengths of the suffixes of x and y after their longest common prefix. The suffix distance (respectively, the subword distance) of two strings is defined analogously in terms of the longest common suffix (respectively, subword) of the strings.
Calude et al. [2] have shown that additive quasi-distances preserve regularity in the sense that a neighbourhood of a regular language is always regular. The edit distance is the best known example of additive distances. However, not all regularity preserving distances are additive. The prefix, suffix, and subword distances are not additive, but are known to preserve regularity [3].
In general, since the 90's there has been much work on the state complexity of regular languages. Recent surveys on the descriptional complexity of regular languages include [5,6,12]. For regularity preserving distances an important question is to determine the state complexity of the distance, that is, what is the optimal size of a DFA (deterministic finite automaton) recognizing a neighbourhood of radius k of an n state DFA language. In the context of error correction this can be viewed also as the descriptional complexity of error detection [14]. The descriptional complexity of error systems has been considered from a different point of view by Kari and Konstantinidis [8]. They establish upper and lower bounds for the size of DFAs needed to recognize a given error system.
A neighbourhood of a language recognized by a DFA A with respect to the prefix distance, roughly speaking, can be recognized by simulating the computation of A and, for each non-final state, keeping track of the shortest path (up to the radius of the neighbourhood) to a final state of A. Additionally, we just need a number of error states equal to the radius of the neighbourhood. This means that prefix distance is an "inexpensive" operation in terms of state complexity. A tight lower bound for the state complexity of prefix distance neighbourhoods is known both for general regular languages and for finite languages [15,16].
On the other hand, suffix distance (and subword distance) neighbourhoods are considerably more "difficult", that is, more expensive in terms of state complexity, to recognize by a DFA because the computation has no way of knowing where the longest common suffix begins. This means that the computation has to be inherently nondeterministic and as can, perhaps, be expected the state complexity of the neighbourhood depends exponentially on the size of the original DFA and the radius of the neighbourhood. This paper shows that the suffix distance neighbourhood of radius k of an n state DFA language over an alphabet of size ≥ 2 can be recognized by a DFA with k −1 −1 + 2 n − 1 states when k < n. If A recognizes a finite language, the upper bound for the state complexity of the neighbourhood is We give matching lower bound constructions both for general regular languages and for finite languages using a binary alphabet in the case when n is roughly equal to 2 · k. For k > n, we show that the suffix distance neighbourhood can be recognized by a DFA with (k − n) + 2 n+1 − 2 states and give matching lower bound constructions for both general regular languages and finite languages over an alphabet of size n+1. We show also that for the class of suffix-closed languages, the neighbourhood is recognized by a DFA with at most n + k + 1 states and that this bound is tight for all k ∈ N. Finally, we derive an upper bound for the state complexity of subword distance neighbourhoods but it remains open whether the bound is tight.

Preliminaries
We recall some basic definitions on regular languages and distance measures. For all unexplained notions on finite automata and regular languages the reader may consult the textbook by Shallit [18] or the survey by Yu [19]. A survey of distances is given by Deza and Deza [4].
In the following Σ is always a finite alphabet, the set of strings over Σ is Σ * and ε is the empty string. The set of nonnegative integers is N 0 . The cardinality of a finite set S is denoted |S| and the powerset of A deterministic finite automaton (DFA) is a tuple A = (Q, Σ, δ, q 0 , F ) where Q is a finite set of states, Σ is an alphabet, δ is a partial function δ : Q × Σ → Q, q 0 ∈ Q is the initial state, and F ⊆ Q is a set of final states. We extend the transition function δ to a partial Q × Σ * → Q in the usual way. A DFA A is complete if δ is defined for all q ∈ Q and a ∈ Σ.
A string w ∈ Σ * is accepted by A if δ(q 0 , w) ∈ F . The language recognized by A is L(A) = {w ∈ Σ * | δ(q 0 , w) ∈ F }. Two states p and q of A are equivalent if δ(p, w) ∈ F if and only if δ(q, w) ∈ F for every string w ∈ Σ * . A DFA A is minimal if each state q ∈ Q is reachable from the initial state and no two states are equivalent.
A nondeterministic finite automaton (NFA) is an extension of a DFA where the transition function is allowed to be multivalued, that is, δ is a function Note that our definition of a DFA allows some transitions to be undefined, that is, by a DFA we mean an incomplete DFA. It is well known that, for a regular language L, the sizes of the minimal incomplete and complete DFAs differ by at most one. The constructions in this paper are more convenient to formulate using incomplete DFAs but our results would not change in any significant way if we were to require that all DFAs are complete. The (incomplete deterministic) state complexity of a regular language L, sc(L), is the size of the minimal DFA recognizing L.

Distances and neighbourhoods of regular languages
We recall definitions of the distance measures used in the following. Generally, a function d : Let x, y ∈ Σ * . The prefix distance of x and y counts the number of symbols which do not belong to the longest common prefix of x and y [3]. It is defined by Similarly, the suffix distance of x and y counts the number of symbols which do not belong to the longest common suffix of x and y and is defined The subword distance measures the similarity of x and y based on their longest common continuous subword and is defined The term "subword distance" is taken from Choffrut and Pighizzini [3]. However, "subword distance" has also been used for a distance defined in terms of the longest common noncontinuous subword [13]. It is known that neighbourhoods of regular languages with respect to the prefix, suffix and subword distance are always regular [3,15]. We refer to the size of the minimal DFA recognizing the radius k neighbourhood of an n state DFA language with respect to a distance X simply as the state complexity of distance X. Tight bounds for the state complexity of the prefix distance are known [15]. Optimal bounds for the size of an NFA recognizing a suffix distance, or subword distance, neighbourhood of a regular language are also known [15]. The bounds on the size of the NFAs imply the following upper bounds for deterministic state complexity of suffix distance and subword distance, respectively. Proposition 1. Suppose L is a regular language recognized by a DFA with n states and k ∈ N. Then Finally, we define the function ψ A : Q → N 0 to give the length of the shortest path from the initial state q 0 to the state q. Formally, ψ A is defined by Note that under this definition, ψ A (q 0 ) = 0 for the initial state q 0 .

State Complexity of Suffix Neighbourhoods
In this section, we consider the deterministic state complexity of suffix distance neighbourhoods. First, we construct a DFA for the neighbourhood of an n-state DFA of radius k with respect to the suffix distance d s , when k < n and then give a matching lower bound when k = n 2 for an n state DFA. Proposition 2. Let n > k ≥ 0 and L be a regular language recognized by a DFA with n states over an alphabet Σ, with |Σ| ≥ 2. Then there is a DFA recognizing E(L, d s , k) with at most |Σ| k −1 |Σ|−1 + 2 n − 1 states.
Proof. Let L be recognized by the DFA A = (Q, Σ, δ, q 0 , F ) with |Q| = n. We construct a DFA A = (Q , Σ, δ , q 0 , F ) that recognizes the neighbourhood E(L, d s , k). First, let us consider what it means if w ∈ E(L(A), d s , k). If w is in the neighbourhood, then this means that there exists a word x recognized by A such that d(w, x) ≤ k. In other words, we can write w = w z and x = x z for words w , x , z ∈ Σ * such that |w | + |x | ≤ k. However, when A reads w, it is not known when such a common suffix z might begin. A common suffix may begin in each of the first k symbols of w, so A must keep track of and compute all possible common suffixes that begin on each of the first k symbols of w.
We define the state set and we define the initial state by q 0 = (0, {q ∈ Q | ψ A (q) ≤ k}) the set of final states is given by In other words, a state (i, P ) of A is final if and only if P contains a final state of A.
The state set consists of subsets of the original state set with a counter component. The operation of the machine begins by counting the first k steps of computation. On the ith step of the initial k steps, the machine reaches a state containing those states reachable from direct transitions from the set of states from the (i − 1)th computation step and adds every state reachable from q 0 within k − i steps and the counter component is incremented. After the kth computation step, no further steps need to be counted and the counter is no longer incremented since states are no longer added to the existing state sets.
The transition function δ is defined for a ∈ Σ by We now show that reading a word w ∈ Σ * reaches the state (i, P ) if and only if there exists a word x ∈ Σ * such that w = w z and First, suppose that δ (q 0 , w) = (i, P ). We write w = w z with w , z ∈ Σ * which may possibly be empty. By definition, δ (q 0 , w ) = (|w |, P ) if |w | ≤ k and P contains all states q such that ψ A (q) ≤ k − |w |. In other words, these are states δ(q 0 , x ) where x ∈ Σ * is of length ≤ k − |w |. Choose q to be one of these states and consider the state δ(q , z) = q. Since q ∈ P and δ (q 0 , w) = δ ((|w |, P ), z) = (i, P ), we have q ∈ P . Thus, there exists a word x = x z such that |x | ≤ k − i and δ(q 0 , x) ∈ P . Now, conversely, suppose that for an input word w = w z with |w | ≤ i, there exists a word x = x z with |x | ≤ k − i such that q = δ(q 0 , x) ∈ P . Since |x | ≤ k − i, let q = δ(q 0 , x ) and we have ψ A (q ) ≤ k − i. Then this means we have δ (q 0 , w ) = (|w |, P ) with q ∈ P . Since δ(q , z) = q, we have δ ((|w |, P ), z) = (i, P ) with q ∈ P as desired.
Thus, δ(q 0 , w) ∈ F if and only if there exists x ∈ L such that |w | + |x | ≤ k for w = w z and x = x z.
However, not all (k + 1) · 2 n in {0, . . . , k} × 2 Q are reachable. Note that for i < k, the only words that can be read to reach a state (i, P ) are those of length exactly i. However, there are only |Σ| i words of length exactly i. Thus, the maximum number of reachable states for 0 ≤ i < k is Furthermore, the state ∅ ⊆ Q is unreachable. Thus, A has at most |Σ| k −1 |Σ|−1 +2 n −1 reachable states.
The statement of Proposition 2 assumes that the cardinality of the alphabet is at least two. For suffix distance neighbourhoods of unary languages we have the following bounds. We note that in the unary case the suffix distance coincides with the prefix distance and leave the easy proof for the reader. For a constant size alphabet, the bound of Proposition 2 is significantly better than the bound implied by known results on nondeterministic state complexity in Proposition 1. Next we show that, at least for some values of the radius k, the bound of Proposition 2 is tight.
Then there exists a DFA A n with n states over a binary alphabet such that Proof. Let A n = (Q n , {a, b}, δ n , 0, {0}), shown in Figure 1. The following theorem then follows from Proposition 2 and Lemma 2.
Theorem 1. Let n > k and let L be a regular language recognized by an nstate DFA over an alphabet Σ with |Σ| ≥ 2. Then a DFA recognizing E(L, d s , k) requires at most |Σ| k −1 |Σ|−1 + 2 n − 1 states. There is a family of DFAs with n states over a binary alphabet which reaches this bound when k = n 2 .
Now we will consider the case when the distance k is greater than the number of states n of the given DFA and give a matching lower bound. Proposition 3. Let k > n > 0 and L be a regular language recognized by a DFA with n states over an alphabet Σ with |Σ| ≥ 2. Then there is a DFA recognizing E(L, d s , k) with at most (k − n) + 2 n+1 − 2 states.
Proof. Let A = (Q, Σ, δ, q 0 , F ) with |Q| = n. Then we follow the construction given in the proof of Proposition 2 to obtain the DFA A = (Q , Σ, δ , q 0 , F ) that recognizes the neighbourhood E(L(A), d s , k) with k > n. We note that ψ A (q) ≤ n for all q ∈ Q and thus by the definition of the transition function, we have for 0 ≤ i ≤ k − n and all words w of length i, δ(q 0 , w) = (i, Q). This gives us k − n states. Then on the following n steps, we proceed as in the rest of Proposition 2. This suggests that there are at most |Σ| n −1 |Σ|−1 states. However, in this case, there are far fewer states than this.
To consider how many states there are, we observe that the above bound requires that each word of length i > k − n reaches a different state (i, P ), giving us a total of |Σ| i−(k−n) states for each i. Then we must consider how many different subsets P ⊆ Q are reachable. Recall that by definition, all states q with ψ A (q) ≤ k − i are contained in P for (i, P ). Thus, on step i, two states (i, P ) and (i, P ) both P and P contain the subset {q ∈ Q | ψ A (q) ≤ k − i}. Then if P and P are different, they must contain different subsets of the set Let j be the size of the set {q ∈ Q | ψ A (q) > k − i}. Then in order for each word of length i to reach a different state, we must have |Σ| i−(k−n) ≤ 2 j different subsets. This means that we must have at least (i − (k − n)) · log 2 |Σ| states q with ψ A (q) > k − i on step i of a computation on A . In other words, for each 1 ≤ i ≤ max q∈Q ψ A (q), there are at least log 2 |Σ| states q with ψ A (q) = i. However, since k > n, the number of states of A are further restricted by this condition. Let = max q∈Q ψ A (q). Then there are at most k − n log 2 |Σ| + |Σ| n log 2 |Σ| −1 |Σ|−1 reachable states for words of length up to k. We observe that this is maximized when |Σ| = 2. That is, for any alphabet of size at least 2, the maximum is achieved when we have for each i exactly one state q such that ψ A (q) = i. This gives us a maximum of 2 n − 1 reachable states of the form (i, P ) for i < k.
After the kth step of computation, there are 2 n − 1 reachable states of the form (k, P ) as usual. This gives us a total of at most (k −n)+2 n+1 −2 states.
We will show that the bound from Proposition 3 is reachable for a family of n state DFAs over an alphabet of size n + 1.
Lemma 3. Let k > n > 0. Then there exists a DFA B n with n states over an alphabet of size n + 1 such that Proof. Let B n = (Q n , Σ n , δ n , 0, {0}), shown in Figure 2, with Σ n = {a 0 , a 1 , . . . , a n } and the transition function is defined by δ(i, a j ) = i + 1 mod n for all 0 ≤ i ≤ n − 1, 0 ≤ j ≤ n, and i = j. Proposition 3 and Lemma 3 can then be summarized in the following theorem.
Theorem 2. Let k > n and let L be a regular language recognized by an nstate DFA over an alphabet Σ with |Σ| ≥ 2. Then a DFA recognizing E(L, d s , k) requires at most (k − n) + 2 n+1 − 2 states. There is a family of DFAs with n states over an alphabet of size n + 1 which reaches this bound.

State Complexity of Subword Distance
Now, we give an upper bound on the deterministic state complexity of subword neighbourhoods by giving a construction for a DFA for the neighbourhood of radius k with respect to the subword distance d f . In the construction we again assume that the cardinality of the alphabet is at least two. For unary alphabets, the subword distance coincides with the suffix distance and a tight bound is obtained from Lemma 1.
The bound of Proposition 4 is significantly better than the bound implied by nondeterministic state complexity [14] (in Proposition 1) for a fixed alphabet Σ. However, we do not know whether the bound is the best possible. Languages Here, we consider the state complexity of neighbourhoods with respect to the suffix distance of languages which belong to subregular language classes. First, we consider neighbourhoods of finite languages.
Proposition 5. Let n > k ≥ 0 and L be a finite language recognized by a DFA with n states over a binary alphabet. Then there is a DFA recognizing E(L, d s , k) with at most 2 k + k · 2 n 2 − 1 states.
Proof. We use the construction for A from the proof of Proposition 2. Observe that, as is the case for general regular languages, not all (k + 1) · 2 n states that are defined are reachable. Recall that the states of A are pairs (i, P ) where i is a counter from 0 to k and P is a subset of states of A and that a word w reaches a state (i, P ) if and only if there exists a word x ∈ Σ * such that w = w z and x = x z where |w | ≤ i, |x | ≤ k − i and δ(q 0 , x) ∈ Q. We also note that for i < k, any state (i, P ) with P ⊆ Q is reachable on a word of length exactly i. This gives us at most i<k 2 i = 2 k − 1 reachable states of the form (i, P ) for i < k.
It remains to show how many states of the form (k, P ) with P ⊆ Q are reachable. Since P is a subset of the set of states of A, we would like to know how many different subsets P exist such that (k, P ) is reachable. Since A recognizes a finite language, there exists at least one state q of A with ψ A (q) = i that is reachable on some string of length i and is not reachable on any string of length j > i.
Recall that A recognizes a finite language and in each state (k, P ) of A , the set P is a subset of states of A. First, we observe that the above property does not hold for subsets P ⊆ Q in states of the form (i, P ) with i < k. To see this, we consider some i and observe that every state q ∈ Q with ψ A (q) ≤ k − i is in some subset P with (i, P ) reachable for all i < k by definition. Hence, why we can narrow our focus only to those states of the form (k, P ).
Let (k, T ) be a state that is reached on a word w of length k. Since A is deterministic, there are up to 2 k possible such states.
Let R i ⊆ Q denote the set of states of A that are not contained in any state P ⊆ Q, where (k, P ) is reachable on a word of length greater than k + i. In other words, R i is the set of states of A which become unreachable in A on a word of length i. We note that R i must contain at least one element, since A recognizes a finite language.
We write T = R ∪ S, where R ⊆ 0≤i≤k R i and S ⊆ Q \ R. We have |Q \ R| ≤ n − k, since k < n. From this, we can see that to maximize the number of states that are reachable, each R i must contain at most one element. This gives us a total of 2 n−k possible subsets S.
Then for each set T = R ∪ S that is reachable on a word of length k, there is a state T i = (R \ i j=0 R j ) ∪ S that is reachable on a word of length k + i for 1 ≤ i ≤ k. Since each R i has one element, each subset S is contained in up to k different subsets of Q that are reachable in A . This gives k · 2 n 2 possible subsets that can be reached on each string of length greater than k.
The statement of Proposition 5 assumes that the alphabet is binary. A tight bound is known from Lemma 1 also for finite languages.
Lemma 4. Let k = n 2 . Then there exists a DFA C n with n states over a binary alphabet recognizing a finite language such that Proof. Let C n = (Q n , {a, b}, δ n , 0, {n − 1}), shown in Figure 3. We construct the DFA C n recognizing the neighbourhood by using the construction from Proposition 2. We can summarize the results of Proposition 5 and Lemma 4 as follows: Theorem 3. Let L be a finite language recognized by an n-state DFA over an alphabet Σ with |Σ| ≥ 2 and k ≤ n. Then a DFA recognizing E(L, d s , k) requires at most |Σ| k −1 |Σ|−1 + k · 2 n 2 − 1 states. There is a family of DFAs with n states over a binary alphabet which reaches this bound when k = n 2 . Now, we show that if k > n, the lower bound coincides with the upper bound for regular languages.
Theorem 4. Let L be a finite language recognized by an n-state DFA over an alphabet Σ with |Σ| ≥ 2 and k > n. Then a DFA recognizing E(L, d s , k) requires at most (k − n) + 2 n+1 − 2 states. There is a family of DFAs with n states over an alphabet of size n which reaches this bound.
Proof. Let D n = (Q n , Σ n , δ n , 0, {0}), shown in Figure 4, with Σ n = {a 0 , a 1 , . . . , a n−1 } and the transition function is defined by δ(i, a j ) = i + 1 for all 0 ≤ i < n − 1, 0 ≤ j ≤ n − 1, and i = j. Next, we consider the class of suffix-closed languages [1]. A language L is suffix-closed if wx ∈ L implies x ∈ L. It is well known that the class of suffixclosed languages is a subclass of the regular languages. We will give a tight bound on the size of the DFA for neighbourhoods of suffix-closed languages with respect to the suffix distance.
Theorem 5. Let L be a suffix-closed language recognized by an n-state DFA. Then a DFA recognizing E(L, d s , k) requires at most n + k + 1 states. For each n ∈ N there exists an n-state DFA E n recognizing a suffix-closed language such that the state complexity of E(L(E n ), d s , k) is n + k + 1 for all k ∈ N.
The DFA E n is shown in Figure 5.

Conclusion
The state complexity of radius k prefix distance neighbourhoods of an n state DFA language depends linearly on n and on k [15]. As we have seen, the corresponding bounds for the suffix and the subword distance neighbourhoods depend exponentially on n and k and, furthermore, coming up with matching lower bounds is considerably more involved. For suffix distance neighbourhoods where the radius k equals, roughly, half of the number of states n, we have given a matching lower bound construction based on a binary alphabet. However (and perhaps curiously), the construction does not seem to extend, at least not directly, for other values of the radius when k < n.