Unary Words Have the Smallest Levenshtein k-Neighbourhoods

The edit distance (a.k.a. the Levenshtein distance) between two words is deﬁned as the minimum number of insertions, deletions or substitutions of letters needed to transform one word into another. The Levenshtein k -neighbourhood of a word w is the set of words that are at edit distance at most k from w . This is perhaps the most important concept underlying BLAST, a widely-used tool for comparing biological sequences. A natural combinatorial question is to ask for upper and lower bounds on the size of this set. The answer to this question has important algorithmic implications as well. Myers notes that ”such bounds would give a tighter characterisation of the running time of the algorithm” behind BLAST. We show that the size of the Levenshtein k -neighbourhood of any word of length n over an arbitrary alphabet is not smaller than the size of the Levenshtein k -neighbourhood of a unary word of length n , thus providing a tight lower bound on the size of the Levenshtein k -neighbourhood. We remark that this result was posed as a conjecture by Dufresne at WCTA 2019.


Introduction
BLAST (Basic Local Alignment Search Tool) is a widely-used tool for comparing biological sequences such as the amino-acid sequences of proteins or the nucleotides of DNA or RNA sequences.A BLAST search enables to compare a subject sequence, called a query, against a database of sequences to identify the ones that resemble the query sequence above a certain threshold.The paper describing BLAST [1] is one of the most highly cited papers in science.According to Myers [8], the most important algorithmic idea underlying BLAST is that of searching for exact matches to words in the neighbourhood of fixed-length fragments selected from the query sequence.We call these fragments words.Let δ be a sequence comparison measure that given two words v and w returns a numeric measure δ(v, w) of the degree to which the two words differ.Given a word w, the k-neighbourhood of w with respect to δ is the set of all words whose best alignment with w under measure δ is no more than k.The most widely-used case is where δ is the edit distance (a.k.a. the Levenshtein distance), which is the minimum number of insertions, deletions or substitutions of letters needed to transform one word into another [6].When δ is the Levenshtein distance, we call this neighbourhood the Levenshtein k-neighbourhood of w and we denote it by N k,Σ (w), where Σ is the considered alphabet.We provide an example below.
From an algorithmic point of view, the most natural question is how we can generate the Levenshtein k-neighbourhood in time that is proportional to the size of the neighbourhood.In fact, this is the core computational task underlying BLAST.Myers described an algorithm for generating a condensed version of this neighbourhood efficiently (see [8] for more details).Another natural question is how we can compute the size of the Levenshtein k-neighbourhood.Touzet gave an algorithm for computing |N k,Σ (w)| for a word w of length n over an alphabet Σ that works in time linear in n but exponential in k [11].This algorithm is based on a variant of the so-called Universal Levenshtein Automaton [7], which in turn is based on the Levenshtein automaton of w: the non-deterministic finite automaton recognising all words which are at Levenshtein distance at most k from w.For other related works, see [2,3,9,10].
From a combinatorial point of view, the most natural question asks for upper and lower bounds on the size of the Levenshtein k-neighbourhood.Myers provided recurrences for counting the number of distinct sequences of k edit operations that one could perform on a given word and notes that "such bounds would give a tighter characterisation of the running time of the algorithm" behind BLAST [8].A word is called unary if it consists of a single element of Σ.The main result of this work can be formally stated as follows.
Theorem 2. Let a ∈ Σ be an arbitrary element of alphabet Σ.For any positive integers n and k, we have The course of our proof is to construct, for every word u ∈ N k,Σ (a n ), a distinct word u ∈ N k,Σ (w) that can be obtained by a similar sequence of edit operations.In particular, we show that, for any n, k, and Σ, is the size of the smallest Levenshtein k-neighbourhood of a word of length n, where a ∈ Σ and σ = |Σ|.We remark that our main result was posed as a conjecture by Dufresne in [5].

10:3
Organisation of the Paper.The basic definitions and notation used throughout are introduced in Section 2. In Section 3, we present the main result of this work for binary alphabets -apart from the strictness of the inequality.We then generalise this result to arbitrary alphabets in Section 4 and prove the strictness of the inequality directly in this more general case.We conclude this paper in Section 5 with some final remarks.

Preliminaries
An alphabet Σ is a finite non-empty set of size σ = |Σ| whose elements are called letters.A word over Σ is a sequence of letters from Σ.We call a word w unary if it consists of a single letter of Σ and non-unary if it consists of at least two letters of Σ. Σ n denotes the set of words of length n over Σ and Σ * denotes the set of finite words over Σ.For a word w, by |w| we denote its length, and by w[i], for i = 1, . . ., |w|, we denote its subsequent letters.The word of length 0 is the empty word, which we denote by ε.
We consider the following elementary edit operations: insertion, deletion, and substitution.For two words x and y, we define the edit distance (a.k.a. the Levenshtein distance) as the minimum number of edit operations that transform x to y, and we denote it by Lev(x, y).The function Lev is then a metric on Σ * [4].
Given a word w, an alphabet Σ, and a positive integer k, we define N k,Σ (w) as the set of all words in Σ * that are at Levenshtein distance at most k from w. Formally, we have that We call N k,Σ (w) the Levenshtein (k, Σ)-neighbourhood of w.
For any binary alphabet Σ, we define the complement of a word w over Σ as the word obtained by substituting w[i] for letter a = w[i], with a ∈ Σ, for all i = 1, . . ., |w|.We denote the complement of w by w and we call a single such substitution operation a flip.

Main Result for Binary Alphabets
In this section we consider Σ = {a, b}, write N k (w) and refer to Levenshtein k-neighbourhood for simplicity.We present the main result but do not show the strictness of the inequality.We generalise this result to an arbitrary alphabet Σ and show the strictness in Section 4. Let N j k (w) = {u ∈ N k (w) : |u| = j}.Further let # a (u) denote the number of a's in word u.We illustrate the main ideas of our approach on a simple case and first consider the words of the neighbourhood that are of length at most n.Proof.We use the characterisation of Observation 3. Let us argue why |N j k (w)|, for any word w of length n, is at least as big as |N j k (a n )|.Consider the following procedure applied on word w: the deletion of the first n − j letters of w followed by the flipping of at most k − (n − j) letters.Clearly all words obtained by this procedure are distinct.This procedure thus gives us a subset of Let us now consider the case of the words of the neighbourhood that have length greater than n.In particular, we denote the neighbourhood j>n N j k (w) of these words by N >n k (w) and thus For a word u ∈ N >n k (a n ), we distinguish between two cases depending on the number of a's in u: The following observation states that in each case the word u ∈ N >n k (a n ) can be obtained by a restricted sequence of edit operations.Intuitively, in Case 1, we insert the relevant number of a's to reach # a (u) because we have fewer a's than needed, and then insert the relevant number of b's.In Case 2, we flip the relevant number of a's to go down to # a (u) because we have more a's than what is needed, and then insert the remaining b's to the right of the rightmost flip.

Observation 6. (1) Any
Proof Strategy.Let u be an arbitrary element of N k (a n ), for some positive integers n and k.We define a function f u : Σ n → Σ * , such that: , for all w ∈ Σ n ; and 2. Given w and f u (w) we can retrieve u.Such an f u directly yields the desired bound (apart from the strictness) since it implies that for any word w we cannot have Note that for |u| ≤ n we already used the same idea to lower bound |N ≤n k (w)| by |N ≤n k (a n )|.Indeed, we implicitly defined f u (w) for u ∈ N ≤n k (a n ) that consists in removing the first n − |u| letters of w, resulting in a word w , and then flipping the letters of w at positions j where u[j] = b (see Lemma 5).
Table 1 Let n = 3, w = aab and k = 2.The table presents an assignment fu from every word in N2(a 3 ) to a different word in N2(aab) that is used in the proof of the main result.Note, however, that as per Theorem 2 there is at least one more word in N2(aab), namely, the word a.
See Figure 1 for an illustration of operation ins-diff(w, j, t).In what follows, we assume that all insertions are with respect to the original indices of w.This can be achieved, for example, by performing insertions in a right-to-left manner when they are given as an ordered batch.Before providing the definition of f u , we define two auxiliary operators g x and h x , for a word x.
Let us start with g x .For any word x starting with a, we define an operator g x that can be applied to any word y such that |y| = # a (x).Intuitively, to construct word g x (y), the letters C P M 2 0 2 0 10:6 Unary Words Have the Smallest Levenshtein k-Neighbourhoods of y get in g x (y) the positions that letters a possess in x, and for every maximal block of b's in between in x, i.e. a block consisting of only b's that is neither preceded nor succeeded by a b, we apply an ins-diff operation on y.Specifically, we define g x (y) = v as follows: starting with v = y, for each maximal block x[r . .r For any word x, we also define an operator h x that takes as input a word y of length |x| and flips its letters on positions in which x has b's.
We are now in a position to define f u (w), for all w ∈ Σ n .Recall that u ∈ N >n k (a n ).We have the following two cases for f u .
Case 1: # a (u) ≥ n.Let us split u in its shortest suffix s that contains n a's and the remaining (possibly empty) prefix p.We then define f u (w) for words u and w in this case as follows (see Example 9): ( Example 9. Let w = aaabbaa and k = 4; note that n = 7.If u = abaaaaaabaa, then # a (u) = 9 ≥ n, so we are in Case 1.We have p = aba and s = aaaaabaa is the shortest suffix that contains n = 7 occurrences of the letter a. Then f u (w) is constructed by concatenating p with the word g s (w) as shown in the figure below.Case 2: # a (u) < n.In this case we split u in its shortest suffix s that contains |u| − n b's and the remaining prefix p .Note that p is always non-empty.We then define f u (w) for words u and w in this case as follows (see Example 10): In particular, # a In Table 1 we provide a complete example of applying f u , for each u ∈ N k (a 3 ), to w = aab.
Let us now show the following fact.Proof.The first part can be readily verified.As for the second part, one can obtain f u (w) from w by the same sequence of edit operations (types and positions) that yield u from a n , according to Observation 6.
We next prove the main lemma on which our main result relies.A pseudocode implementing the algorithm used in the proof of this lemma can be found as Algorithm 1.Note that by considering a's as 0's and b's as 1's, we have that h x (y) ⊗ y = x, where ⊗ denotes the XOR operation.Consider, for instance, x = aabab, y = aaabb and h x (y) = aabba.

Input: Two words w and f u (w).
Output: A word u.
prepend(a, u)  prepend(b, u) 15: return u Lemma 12. Let u be an arbitrary element of N >n k (a n ), for some positive integers n and k.Given w and f u (w), for any w ∈ Σ n , we can retrieve u.C P M 2 0 2 0 10:8

Unary Words Have the Smallest Levenshtein k-Neighbourhoods
Proof.Let us note that, as we only had insertions and flips of letters, conceptually for each position in w there is a corresponding position in f u (w).The correspondence is given by ignoring the letters of f u (w) that were inserted by operation ins-diff.Our aim is to find all such pairs of corresponding positions in order to retrieve u.
To this end, we will swipe both w and f u (w) from right to left and prepend letters to an initially empty word u, which in the end will be equal to u ∈ N >n k (a n ).We will maintain a position in each of the words, k 1 initiated as |w| and k 2 initiated as |f u (w)|.
Intuitively, we are first processing the part of f u (w) that comes from an application of operator g to w or to a suffix of w, depending on which case we are in.While processing this part, we maintain the invariant that the letter of , relying on the definition of g.Then we can apply the following procedure repeatedly; see Lines 4-10 in Algorithm 1.We compute the rightmost occurrence of w[k 1 ] in f u (w)[1 . .k 2 ]; let it be at position j.We have that u[j . .k 2 ] = ab k2−j .We prepend ab k2−j to u, decrement k 1 and set k 2 to j − 1; see Example 13.
Let us now focus on the stopping condition of this procedure, i.e. the point where the remaining prefix of f u (w) does not originate from an application of g.If we are in Case 1, while k 1 > 0 we must have that If at some point k 1 reaches 0, i.e. we have consumed all of w, then we are in Case 1.Thus, f u (w)[1 . .k 2 ] = p and we prepend this prefix to u; see Lines 11-12 in Algorithm 1.
Else, if at some point k 1 = k 2 , i.e. we are left with equal-length prefixes of w and f u (w), then we are either in Case 2 or in Case 1 with p = ε.By using the XOR operation w in the former case we retrieve p and in the latter case we get a k1 which is the missing prefix of s.In either case we prepend the result to u; see Lines 13-14 in Algorithm 1.
Example 13.Let u = abba ∈ N 2 (a 3 ), w = aab, and f u (w) = abbb.We have k 1 = 3 and k 2 = 4.At the first iteration of the while loop in Algorithm 1 we have w [3] = f u (w) [4] = b and so we set k 1 = 2, u = a and k 2 = 3.At the second iteration we have w [2] = a = f u (w) [3] = b and so we get u = ba and k 2 = 2.At this point we exit the while loop (because k 1 = k 2 ), and since we are at Case 2 we prepend w[1 . .2] ⊗ f u (w)[1 . .2] = aa ⊗ ab = ab to u = ba, which gives us u = abba.At this point we have retrieved u = abba ∈ N 2 (a 3 ).
By combining Lemmas 5 and 12 and Fact 11 we get |N j k (a n )| ≤ |N j k (w)| for every j.This implies our main result for binary alphabets, apart from the strictness of the inequality.We leave the latter for the next section.

Generalisation to Arbitrary Alphabets and Strictness
For an arbitrary alphabet Σ = {0, . . ., σ − 1} we only need to make minor adjustments in the definition of function f u and in the algorithm for retrieving u from w and f u (w).Specifically, we replace the XOR operation by addition/subtraction modulo σ.Intuitively, one can think of 0's as a's in the binary case, and of non-0's as b's in the binary case.
Definition of f u .Let u and v be two words of equal length.Let us denote by u ⊕ v the position-wise sum of words u and v modulo σ, e.g. for σ = 4 we have 1312 ⊕ 1112 = 2020.We analogously denote by u v the position-wise subtraction of words u and v modulo σ.

10:9
We adapt operation ins-diff as follows, with z being a word containing only positive letters: ins-diff(w, j, z) = insert word z ⊕ (w[j]) |z| after the letter at position j in word w. ( Operator g x can now be applied to any word y such that |y| = # 0 (x).g x considers maximal blocks of letters in x not containing 0's instead of maximal blocks of b's (in the binary case).For such a block z, it performs an ins-diff(y, j, z) operation on a word y.
The definition of f u becomes as follows.
Case 1: # 0 (u) ≥ n.We split the word u into p and s exactly as in the binary case and define f u (w) = p • g s (w); see Example 14.
Example 14.Let Σ = {0, 1, 2}, w = 201120 and k = 7; note that n = 6.If u = 0210001200210, then # 0 (u) = 7 ≥ n, so we are in Case 1.We have p = 021 and s = 0001200210 is the shortest suffix that contains n = 6 occurrences of the letter 0. Then f u (w) is constructed by concatenating p with the word g s (w) as shown in the figure below.
The split of u into p and s is the same as in the binary case, but instead of The proof of Lemma 5 that considers words of length at most n in N k (w) can be directly generalised for arbitrary alphabets, by allowing substitutions of letters instead of flips.This concludes the description of the generalisation.
The following theorem summarises all the results and introduces strictness in the inequality.Input: Two words w and f u (w).Output: A word u.
| for every j by combining the counterparts of Lemmas 5 and 12 and Fact 11 for an arbitrary alphabet.It thus suffices to find some value of j for which this inequality is strict.
Let us first consider the case that k < n, in which we claim that Note that in this case words of length n − k can be obtained only by performing k deletions, i.e. no insertions or substitutions are allowed.Hence |N n−k k,Σ (a n )| = 1.For a non-unary w, we can, for instance, delete letters in lexicographic or reverse lexicographic order, breaking ties arbitrarily, obtaining words with different multiplicities for some letter.
Let us now proceed to the complementary case that k ≥ n.Then, each word u ∈ N k+1 k,Σ (a n ) can be obtained by exactly k +1−n insertions and at most n − 1 substitutions.Let us restrict ourselves to determining the size of N (w) ⊆ N k+1 k,Σ (w), defined as the set of elements of N k+1 k,Σ (w) that can be obtained from w using exactly k + 1 − n insertions and at most n−1 substitutions.In particular, one letter from w remains unchanged and gets shifted to the right by at most k + 1 − n positions -possibly not shifted at all.Thus, each word u ∈ N (w) can be obtained as follows.We first choose the position i in u where the shifted letter has landed.For such a position i, it is a letter c occurring in w[max (1, -any of those letters can be chosen by picking a right layout of insertions.We then put c at position i and fill the remaining k positions arbitrarily; see Example 16. Let us do the above process once for each position i, with a fixed letter λ(i), arbitrarily chosen from the possible ones.In total, we obtain all words from Σ k+1 apart from the ones which differ from λ(i) on every position i.In particular, the total number of words that we get for this specific choice of λ(i)'s is σ k+1 − (σ − 1) k+1 and this is equal to Then, at some position j, since w is non-unary we can actually choose a letter c = λ(j) instead; for instance any position j such that w[j − 1] = w[j] will work.Let us now pick this letter c and fill each other position i with a letter different from λ(i).This way we obtain a word that was not obtained with the previous choice of λ(i)'s and hence

10:11
Example 16.Let us consider word w = abc and k = 5.Every word u ∈ N (w) is obtained by 3 insertions and up to 2 substitutions.If the letter a from w is not substituted for, it can land at any of the positions from 1 to 4 in u; similarly, b and c can land at positions from 2 to 5 and from 3 to 6, respectively.This is shown schematically in the following table.Then the i-th column specifies the possible choices for λ(i), e.g., λ(2) ∈ {a, b}.Note that if w was unary, then all those sets would be singletons.
One possible choice of λ(1), . . ., λ( 6) is a, a, c, a, b, c.For it, we generate all the words but the 2 6 words that have no positions in common with aacabc.For a different choice of λ(i)'s, say, a, b, c, a, b, c, we obtain a word that was not generated before, e.g., bbabca that has exactly one position in common with abcabc.
Let us now complete the picture by showing a closed formula for obtaining the tight lower bound implied by Theorem 2 and thus an efficient way to compute this bound.Proof.We first choose the number i of letters that are different from a in some u ∈ N k,Σ (a n ) and then the length n + j of the word.Note that j ≥ i − k since we can have at most k − i deletions as we need at least i insertions or substitutions to have i letters different from a.
We then have n+j i options to choose the positions where the letter is not a and (σ − 1) letters to choose from for each such position.

Final Remarks
We showed a tight lower bound on the size of the Levenshtein k-neighbourhood.In particular, we defined a function f u for each word u ∈ N k,Σ (a n ), such that, for any given w ∈ Σ n , we have that f u (w) ∈ N k,Σ (w) and f u (w) = f u (w) for u = u .Our construction is not the only one possible.For example, in Case 1 of our construction, one could take f u (w) = q • g s (w), where q = p ⊕ 1 |p| (for the binary case, this corresponds to the negation of p).However, our construction has a neat property that f u (a n ) = u, for any u ∈ N k,Σ (a n ).
The following two questions remain unanswered: 1. Can a similar approach be employed for showing a tight upper bound on |N k,Σ (w)|?
2. Touzet gave an algorithm for computing |N k,Σ (w)| for a word w of length n over an alphabet Σ that works in time linear in n but exponential in k [11].Can this computation be done in polynomial time or is this problem #P -hard?

Observation 3 .
Any u ∈ N k (a n ) with |u| ≤ n can be obtained from a n by the following sequence of at most k edit operations: n − |u| deletions of a's in the beginning of a n followed by a sequence of |u| − # a (u) flips.

Example 4 .Lemma 5 .
Let w = aaaa and k = 2. Then u = aba ∈ N 2 (aaaa) can be obtained from aaaa by deleting n − |u| = 1 letter a to obtain aaa and then by |u| − # a (u) = 1 flip to obtain aba.Intuitively, the size of the set N j k (a n ) is equal to the number of subsets of {1, . . ., j} of size at most k − (n − j); n − j is the number of deletions and k − (n − j) the number of flips.If j ≤ n, then |N j k (a n )| ≤ |N j k (w)| for all w ∈ Σ n .

Example 7 .
be obtained from a n by the following sequence of at most k edit operations: n − # a (u) flips followed by a sequence of |u| − n insertions of b's.The insertions can be restricted to the part of the word after the rightmost flip.Let w = aaaa and k = 2.For Case 1, u = aaaaba ∈ N >n 2 (aaaa) with # a (u) = 5 ≥ n = 4 can be obtained by # a (u) − n = 1 insertion of a in the beginning of aaaa to obtain aaaaa and then by |u| − # a (u) = 1 insertion of b to obtain aaaaba.For Case 2, u = aabab ∈ N >n 2 (aaaa) with # a (u) = 3 < n = 4 can be obtained by n − # a (u) = 1 flip to obtain aaba and then by |u| − n = 1 insertion of b to the right of the flip to obtain aabab.

b b b a a b b a a a b b a aFigure 1
Figure 1For every position j of w = aaabbaa (bottom), the letter (top) of which a block inserted by ins-diff(w, j, 1) after position j would comprise.

Example 8 .
Note that |g x (y)| = |x|; see Example 8. Let x = aababbababaab and y = aaabbaa; note that # a (x) = 7 = |y|.We have g x (y) = v = aababbbabaaab; see also the figure below and recall that we perform this procedure from right to left.Starting from v = y, for the first maximal block x[r . .r + t − 1] = x[13 . .13] = b with # a (x[1 . .r − 1]) = # a (x[1 . .12]) = m = 7, we perform ins-diff(v, 7, 1), which constructs aaabbaab.For the next maximal block x[r . .r + t − 1] = x[10 . .10] = b with # a (x[1 . .r − 1]) = # a (x[1 . .9]) = m = 5, we perform ins-diff(v, 5, 1), which constructs aaabbaaab.For the next maximal block x[r . .r + t − 1] = x[8 . .8] = b with # a (x[1 . .r − 1]) = # a (x[1 . .7]) = m = 4, we perform ins-diff(v, 4, 1), which constructs aaababaaab, and so on.x : a a b a b b a b a b a a b y : x (y) : a a b a b b b a b a a a b u : a b a a a a a a b a a w : a a a b b a a f u (w) : a b a a a a b b a a a

Example 10 .
and so applying g x is well-defined.Let w = aaabbaa and k = 4; note that n = 7.If u = aababbaab, then # a (u) = 5 < n, so we are in Case 2. We have p = aabab and s = baab is the shortest suffix that contains |u| − n = 2 occurrences of letter b.Then f u (w) is constructed by concatenating two words: the first one is h p (w ), where w = w[1 . .|p |]; and the second one is composed of the final |s | letters of g x (w), where x is the word obtained from u by changing the first n − # a (u) = 2 occurrences of b to a as shown in the figure below.
where x = 0 |p | s ; see Example 15.Example 15.Let Σ = {0, 1, 2}, w = 201120 and k = 5; note that n = 6.If u = 21021020, then # 0 (u) = 3 < n, so we are in Case 2. We have p = 2102 and s = 1020 is the shortest suffix that contains |u| − n = 2 occurrences of letters different than 0. Then f u (w) is constructed by concatenating two words: the first one is h p (w ), where w = w[1 . .|p |]; and the second one is composed of the final |s | letters of g x (w), where x is the word obtained from u by changing the first n − # 0 (u) = 3 occurrences of non-0 letters to 0 as shown in the figure below.an adaptation of Algorithm 1 for Σ = {0, . . ., σ − 1}.Note that the two constructions are identical for |Σ| = 2, a = 0 and b = 1.

position in u 1 2 3 4 5 6
landing positions of a a a a a landing positions of b b b b b landing positions of c c c c c