State Complexity of Simple Splicing

. Splicing, as a binary word/language operation, was inspired by the DNA recombination under the action of restriction enzymes and ligases, and was ﬁrst introduced by Tom Head in 1987. Splicing systems as generative mechanisms were deﬁned as consisting of an initial starting set of words called an axiom set, and a set of splicing rules—each encoding a splicing operation—, as their computational engine to iteratively generate new strings starting from the axiom set. Since ﬁnite splicing systems (splicing systems with a ﬁnite axiom set and a ﬁnite set of splicing rules) generate a subclass of the family of regular languages, descriptional complexity questions about splicing systems can be answered in terms of the size of the minimal deterministic ﬁnite automata that recognize their languages. In this paper we focus on a particular type of splicing systems, called simple splicing systems, where the splicing rules are of a particular form. We prove a tight state complexity bound of 2 n − 1 for (semi-)simple splicing systems with a regular initial language with state complexity n ≥ 3. We also show that the state complexity of a (semi-)simple splicing system with a ﬁnite initial language is at most 2 n − 2 + 1, and that whether this bound is reachable or not depends on the size of the alphabet and the number of splicing rules.


Introduction
In [10] Head described a language-theoretic operation, called splicing, which models DNA recombination, a cut-and-paste operation on DNA double-stranded molecules, under the action of restriction enzymes and ligases.A splicing system is a formal language model which consists of a set of initial words, I (representing double-stranded DNA strings), and a set of splicing rules R (representing restriction enzymes).The most commonly used definition for a splicing rule is a quadruplet of words r = (u 1 , v 1 ; u 2 , v 2 ).This rule splices two words x 1 u 1 v 1 y 1 and x 2 u 2 v 2 y 2 : the words are cut between the factors u 1 , v 1 , respectively u 2 , v 2 , and the prefix (the left segment) of the first word is recombined by catenation with the suffix (the right segment) of the second word; see Fig. 1 and also [16].The words u 1 v 1 and u 2 v 2 are the restriction sites in the rule r.A splicing system generates a language which contains every word that can be obtained by successively applying rules to axioms and the intermediately produced words.
The most natural variant of splicing systems, often referred to as finite splicing systems, is to consider a finite set of axioms and a finite set of rules.
Several different types of splicing systems have been proposed in the literature, and Bonizzoni et al. [1] showed that the classes of languages they generate are related.Shortly after the introduction of splicing in formal language theory, Culik II and Harju [4] proved that finite splicing systems can only generate regular languages; see also [11,15].Gatterdam [7] gave (aa) * as an example of a regular language which cannot be generated by a finite splicing system; thus, the class of languages generated by finite splicing systems is strictly included in the class of regular languages.Descriptional complexity considers the complexity of a language in terms of the size of a computational device (in this case splicing system) that generates or recognizes it.For instance, Mateescu et al. [14] consider a number of descriptional complexity measures for simple splicing systems, such as the number of rules, the number of words in the initial language, the maximum length of a word in the initial language, and the sum of the lengths of all words in the initial language.Loos et al. [13] consider the descriptional complexity of finite splicing systems by using the number of rules, the length of the rules, and the size of the initial language as complexity measures.Pȃun [16] proposed using the radius, the largest u i in a rule, as a descriptional complexity measure.
As the class of languages generated by splicing systems forms a subclass of the family of regular languages, their descriptional complexity can also be considered in terms of the finite automata that recognize them.For example, Loos et al. [13] gave a bound on the number of states required for a nondeterministic finite automaton to recognize the language generated by an equivalent finite splicing system.
We focus our attention on simple splicing systems, that is, splicing systems where the rules (u 1 , v 1 ; u 2 , v 2 ) are of a particular form: u 1 = u 2 = a, are singleton letters, and v 1 = v 2 = ε are the empty word.The descriptional complexity of simple splicing systems was considered by Mateescu et al. [14] in terms of the size of a right linear grammar that generates a simple splicing language.Here we consider the descriptional complexity of simple splicing systems in terms of deterministic state complexity [6].In other words, we measure the descriptional complexity of a simple splicing system in terms of the size of the minimal deterministic finite automaton that recognizes the language generated by the splicing system.
In this paper, we prove tight state complexity bounds for simple and semisimple splicing systems with regular and finite initial languages.In Sect.2, we fix notation and definitions for simple splicing systems.We consider the state complexity of simple splicing systems with regular and finite initial languages in Sect.3. In Sect.4, we give tight state complexity bounds for semi-simple splicing systems with finite initial languages.We consider the state complexity of the crossover operation related to simple splicing systems in Sect. 5.

Preliminaries
Let Σ be a finite alphabet.We denote by Σ * the set of all finite words over Σ, including the empty word, which we denote by ε.We denote the length of a word w by |w| = n.If w = xyz for x, y, z ∈ Σ * , we say that x is a prefix of w, y is a factor of w, and z is a suffix of w.
A deterministic finite automaton (DFA) is a tuple A = (Q, Σ, δ, q 0 , F ) where Q is a finite set of states, Σ is an alphabet, δ is a function δ : Q × Σ → Q, s ∈ Q is the initial state, and F ⊂ Q is a set of final states.We extend the transition function δ to a function Q × Σ * → Q in the usual way.A DFA A is complete if δ is defined for all q ∈ Q and a ∈ Σ.We will make use of the notation q w − → q for δ(q, w) = q , where w ∈ Σ * and q, q ∈ Q.A state q ∈ Q is called a sink state if δ(q, a) = q for all a ∈ Σ and q ∈ F .
Each letter a ∈ Σ defines a transformation of the state set Q. Let δ a : Q → Q be the transformation on Q induced by a, defined by δ a (q) = δ(q, a).We extend this definition to words by composing the transformations The language recognized or accepted by A is L(A) = {w ∈ Σ * | δ(q 0 , w) ∈ F }. A state q is called reachable if there exists a string w ∈ Σ * such that δ(q 0 , w) = q.Two states p and q of A are said to be equivalent if δ(p, w) ∈ F if and only if δ(q, w) ∈ F for every word w ∈ Σ * .A DFA A is minimal if each state q ∈ Q is reachable from the initial state and no two states are equivalent.The state complexity of a regular language L is the number of states of the minimal complete DFA recognizing L [6].
A nondeterministic finite automaton (NFA) is a tuple A = (Q, Σ, δ, I, F ) where Q is a finite set of states, Σ is an alphabet, δ is a function δ : Q×Σ → 2 Q , I ⊆ Q is a set of initial states, and F is a set of final states.The language recognized by an NFA A is L(A) = {w ∈ Σ * | q∈I δ(q, w) ∩ F = ∅}.As with DFAs, transitions of A can be viewed as transformations on the state set.Let δ a : Q → 2 Q be the transformation on Q induced by a, defined by δ a (q) = δ(q, a).The image of δ a is defined by im δ a = {δ(p, a) | p ∈ Q}.We make use of the notation P w − → P for P = q∈P δ(q, w), where w ∈ Σ * and P, P ⊆ Q.
A splicing scheme is a pair σ = (Σ, R) where Σ is an alphabet and R is a set of splicing rules.For a splicing scheme σ = (Σ, R) and a language L ⊆ Σ * , we denote by σ(L) the language Then we define σ 0 (L) = L and σ i+1 (L) = σ(σ i (L)) for i ≥ 0 and For a splicing scheme σ = (Σ, R) and an initial language L ⊆ Σ * , we say the triple H = (Σ, R, L) is a splicing system.The language generated by H is defined by L(H) = σ * (L).
Mateescu et al. [14] define a restricted class of splicing systems called simple splicing systems.A simple splicing system is a triple H = (Σ, M, I), where Σ is an alphabet, M ⊆ Σ is a set of markers, and I is a finite initial language over Σ.For a ∈ M , we have (x, y) a z if and only if x = x 1 ax 2 , y = y 1 ay 2 , and z = x 1 ay 2 for some In other words, a simple splicing system is a system in which the set of rules is M = {(a, ε; a, ε) | a ∈ M } and the initial language I is finite.Since the rules are determined solely by our choice of M ⊆ Σ, the set M is used in the definition of the simple splicing system rather than the set of rules M. Based on these properties, one can deduce that the class of languages generated by simple splicing systems is subregular [4,15].Mateescu et al. [14] show that these languages form a proper subclass of the extended star-free languages.
In this paper, we will relax the condition that the initial language of a simple splicing system must be a finite language.We will consider also simple splicing systems with regular initial languages.By [16], it is clear that such a splicing system will also produce a regular language.In the following, we will use the convention that I denotes a finite language and L denotes an infinite language.

State Complexity of Simple Splicing
In this section, we will give tight state complexity bounds for simple splicing systems with regular and finite initial languages.First, we will define an NFA that recognizes the language of a given simple splicing system.The construction follows a more general construction due to Loos et al. [13] for finite splicing systems.This construction is a simplification of a construction by Pixton [15], which itself is a simplification of the original proof of regularity of finite splicing due to Culik II and Harju [4].Proposition 1.Let H = (Σ, M, L) be a simple splicing system with a regular initial language L and let L be recognized by a DFA with n states.Then there exists an NFA A H with n states such that L(A H ) = L(H).
Proof.Let H = (Σ, M, L) and let A = (Q, Σ, δ, q 0 , F ) be a DFA for L. We will define the NFA A H = (Q , Σ, δ , q 0 , F ), where First, we describe the construction of [13].Let M = {(a, ε; a, ε) | a ∈ M } be the set of rules for H.For each rule (α 1 , α 2 ; α 3 , α 4 ) ∈ M, add new states and transitions corresponding to α 1 α 4 and α 3 α 2 .That is, if such that q is reachable from the initial state q 0 and a final state of A is reachable from q .We add an ε-transition from q to r 0 and from s j+k to q .Similarly, for each path t α3α4 − −− → t , add ε-transitions from t to s 0 and from r i+ to t .Now, since H is a simple splicing system, this construction can be simplified further.Since every rule of H is of the form (a, ε; a, ε), we only need to add states and transitions for p a a − → p a for each rule.Then add ε-transitions from states q of A to p a if q has an outgoing transition on a to a non-sink state of A. From each state p a , add ε-transitions to each state of A with an incoming transition on a. Recall that im δ a is the image of the transformation of δ induced by a, and therefore it is the set of states of A with an incoming transition on a. Finally, we can simplify this NFA by removing ε-transitions in the usual way to obtain an NFA A H = (Q, Σ, δ , q 0 , F ), where Figure 2 illustrates the new states and transitions added for a ∈ M before and after ε-removal.Observe that by removing the ε-transitions, we also remove the states that were initially added earlier in the construction of A H . Thus, the state set of A H is exactly the state set of the DFA A recognizing L. Given a splicing system H = (Σ, M, L), one can obtain a DFA that recognizes L(H) by performing the subset construction on A H .As shown in Proposition 1, if L is recognized by a DFA with n states, then A H also has n states.By applying the subset construction and observing that the empty set is not reachable from any subset of Q in A H , this gives an upper bound of 2 n − 1 states for a DFA equivalent to A H .
We will now show that there exists a family of regular languages L n with state complexity n such that a simple splicing system H = (Σ, M, L n ) with one marker requires 2 n − 1 states for an equivalent DFA to recognize it.Proposition 2. For |Σ| ≥ 3 and n ≥ 3, there exists a simple splicing system with a regular initial language H = (Σ, M, L n ) with |M | = 1 where L n is a regular language with state complexity n such that the minimal DFA for L(H) requires at least 2 n − 1 states.
Proposition 2 is proved via the family of languages L n accepted by DFAs A n , shown in Fig. 3, with M = {c}.Together, Propositions 1 and 2 give the following result.
Theorem 3.For a simple splicing system with a regular initial language H = (Σ, M, L n ) where M ⊆ Σ and L n ⊆ Σ * has state complexity n, the state complexity of L(H) is at most 2 n − 1 and this bound can be reached in the worst case.
We will now consider simple splicing systems with a finite initial language.We will show that the upper bound of Proposition 1 is not reachable in this case.Proposition 4. Let H = (Σ, M, I) be a simple splicing system with a finite initial language, where I is a finite language recognized by a DFA A with n states.Then a DFA recognizing L(H) requires at most 2 n−2 + 1 states.
We will show that this bound is reachable.We note that the lower bound witness used in the following lemma is defined over an alphabet with size exponential in the number of states of the DFA recognizing the initial language.Lemma 5.There exists a simple splicing system with a finite initial language H = (Σ, M, I n ) where I n is a finite language with state complexity n such that a DFA recognizing L(H) requires 2 n−2 + 1 states.
Together, Proposition 4 and Lemma 5 give the following result.Theorem 6.For a simple splicing system with a finite initial language H = (Σ, M, I n ) where M ⊆ Σ and I n ⊆ Σ * has state complexity n, the state complexity of L(H) is at most 2 n−2 + 1 and this bound can be reached in the worst case.
The bound of Lemma 5 is reached by a witness defined over an alphabet size of 2 n−3 + 1.An obvious question is whether this bound can be reached via a smaller alphabet.We will consider in the following the state complexity of simple splicing systems with a finite initial language for small, fixed alphabets.We begin with a general observation on the transition function of a DFA recognizing the language of a simple splicing system.

Lemma 7. Let H = (Σ, M, L) be a simple splicing system with a regular initial language and let A H be an NFA recognizing L(H). If a ∈ M and δ is the transition function of
First, we will consider simple splicing systems with a finite initial language defined over a unary alphabet.Proposition 8. Let H = ({a}, M, I) be a simple splicing system where M is nonempty and I is a finite language containing a word of length at least 2. Then the minimal DFA recognizing L(H) has exactly two states.
Next, we consider simple splicing systems with a finite initial language defined over a binary alphabet.We will show that the small size of the alphabet restricts the number of transformations that can be performed on the state set and that the upper bound on the number of states falls far below the upper bound of Proposition 4 as a result.Proposition 9. Let H = ({a, b}, M, I) be a simple splicing system where I is a finite language with state complexity n.Then the minimal DFA recognizing L(H) has at most 2n − 3 states and this bound is reachable in the worst case.
We will now consider the state complexity of simple splicing systems with a finite initial language defined over a ternary alphabet.We will show that the upper bound of 2 n−2 + 1 from Proposition 4 cannot be reached with an alphabet of size 3. Proposition 10.Let H = ({a, b, c}, M, I) be a simple splicing system where I is a finite language with state complexity n.Then the minimal DFA recognizing L(H) has at most We note that the upper bound of the previous lemma is similar to the state complexity of the reversal operation on finite languages [2].We will use this result as inspiration for a family of lower bound witnesses in the following lemma.Lemma 11.There exists a family of finite languages I n ⊆ {a, b, c} * , for n ≥ 4, recognized by a DFA with n states such that the minimal DFA for a simple splicing system H = ({a, b, c}, M, I n ) requires at least The family of witness languages I n used to prove Lemma 11 is accepted by DFAs A n , shown in Fig. 4, with M = {c}.Together, Proposition 10 and Lemma 11 give us the following theorem.
Theorem 12.For a simple splicing system with a finite initial language H = (Σ, M, I n ) where |Σ| = 3, M ⊆ Σ, and I n ⊆ Σ * has state complexity n, the state complexity of L(H) is at most + 1 states if n is odd and this bound can be reached in the worst case.

State Complexity of Semi-simple Splicing
In this section, we will give tight state complexity bounds for semi-simple splicing systems with regular and finite initial languages.In particular, we will show that the upper bound is reachable for semi-simple splicing systems with a finite initial language defined over a fixed alphabet.
Goode and Pixton [9] generalize simple splicing systems by defining semisimple splicing systems.A splicing system is semi-simple if every rule is of the form (a, ε; b, ε) for a, b ∈ Σ.Again, rather than explicitly define a set of rules M, we refer instead to the set M (2) ⊆ Σ × Σ of pairs of symbols, which determines the set of rules.As with simple splicing systems, one can conclude that the class of languages generated by semi-simple splicing systems is subregular [4,15].
In the following, we will give a construction for an NFA that recognizes the language generated by a semi-simple splicing system.As with the NFA for simple splicing systems from Proposition 1, the construction will follow the more general construction for finite splicing systems of Loos et al. [13].
Proposition 13.Let H = (Σ, M (2) , L) be a semi-simple splicing system with a regular initial language.Then there exists an NFA B H with n states such that It is clear from Proposition 13 that for a given regular language L, the language of a semi-simple splicing system H = (Σ, M (2) , L) can require 2 n −1 states in the worst case.Since a simple splicing system is also a semi-simple splicing system, the lower bound witness from Proposition 2 holds.Therefore, we can focus on the more interesting case of semi-simple splicing systems with finite initial languages.First, we observe that even with semi-simple splicing rules, the upper bound on the number of states for a DFA recognizing a semi-simple splicing system with a finite initial language remains the same.Proposition 14.Let H = (Σ, M (2) , I) be a semi-simple splicing system with a finite initial language where I is a finite language recognized by a DFA A with n states.Then a DFA recognizing L(H) requires at most 2 n−2 + 1 states.
The proof of this fact is identical to the proof of Proposition 4.
Recall from Lemma 5, that the lower bound witness for simple splicing systems with a finite initial language was defined over an alphabet with size exponential in the state complexity of the initial language.We will show in the following lemma that for semi-simple splicing systems with a finite initial language, a lower bound witness defined over an alphabet of size 3 exists.

Lemma 15.
Let n ≥ 4. Then there exists a semi-simple splicing system with a finite initial language H = (Σ, M (2) , I n ) where |Σ| = 3 and I n is a finite language with state complexity n such that L(H) is recognized by a DFA that requires at least 2 n−2 + 1 states.
The family of witness languages I n of Lemma 15 is accepted by DFAs A n , shown in Fig. 5, with Σ = {a, b, c} and M (2) = {(a, c)}.From Proposition 14 and Lemma 15, we have the following result.
Theorem 16.For a semi-simple splicing system with a finite initial language H = (Σ, M (2) , I n ) where M ⊆ Σ and I n ⊆ Σ * has state complexity n, the state complexity of L(H) is at most 2 n−2 + 1 and this bound can be reached in the worst case.

State Complexity of the Crossover Operation
In this section, we will give tight state complexity bounds for the crossover operation [3], which can be thought of as a single step of semi-simple splicing.Mateescu et al. [14] gave an algebraic characterization of the class of languages generated by simple splicing systems based on the crossover operation therein.A similar such characterization for the class of languages generated by semi-simple splicing systems is given by Ceterchi [3].
Then for two languages L 1 , L 2 ⊆ Σ * , we have The operation M is a variant of the Latin product defined in [8].Based on M , we define the crossover operation M for M ⊆ Σ × Σ and two languages where pref(L 1 ) is the set of prefixes of words in L 1 and suff(L 2 ) is the set of suffixes of words in L 2 .From this definition, the operation M can be viewed as a combination of operations under each of which the regular languages are closed.Therefore, it is easy to see that the regular languages are closed under M .Note that by restricting M to pairs (a, a) for a ∈ Σ, we get an operation that can be thought of as a single step of simple splicing.The operation M , when restricted to pairs of the form (a, a) has some similarities to many operations that have been studied in the literature, such as the chop operation [12] and the word blending operation [5].In fact, word blending can be seen as a special case of the crossover operation, taking M = {(a, a) | a ∈ Σ}.
We will now give a DFA construction for the crossover of two regular languages.
Proposition 17.Let A and B be two DFAs defined over Σ with m and n states, respectively.Then for any M ⊆ Σ × Σ, there exists a DFA C such that and the transition function δ C is defined for q ∈ Q A , P ⊆ Q B , and a ∈ Σ by δ C ( q, P , a) = q , P , where q = δ A (q, a) and P = im(δ B ) b , if (a, b) ∈ M and q is not a sink state;

Conclusion
We have given tight bounds for the state complexity of simple and semi-simple splicing systems and the associated crossover operation.In almost all cases, the exponential upper bound was easily reached via splicing systems defined over a fixed-size alphabet with one rule.The exception is with simple splicing systems with a finite initial language, where a natural open problem to consider is the worst-case state complexity when the initial languages are defined over alphabets of size between 3 and 2 n−3 .
Together, Proposition 17 and Lemma 18 give us the following theorem.Theorem 19.For regular languages L m and L n , with m, n ≥ 3, defined over an alphabet Σ, with |Σ| ≥ 4, and a subset M ⊆ Σ × Σ, if L m has state complexity m and L n has state complexity n, then L m M L n has state complexity at most m • 2 n and this bound can be reached in the worst case.