Graph-Controlled Insertion-Deletion Systems Generating Language Classes Beyond Linearity

. A regulated extension of an insertion-deletion system known as graph-controlled insertion-deletion (GCID) system has several components and each component contains some insertion-deletion rules. A rule is applied to a string in a component and the resultant string is moved to the target component speciﬁed in the rule. When resources are so limited (especially, when deletion is context-free) then GCID systems are not known to describe the class of recursively enumerable languages. Hence, it becomes interesting to ﬁnd the descriptional complexity of such GCID systems of small sizes with respect to language classes below RE. To this end, we consider closure classes of linear languages. We show that whenever GCID systems describe LIN with t components, we can extend this to GCID systems with just one more component to describe, for instance, 2-LIN and with further addition of one more component, we can extend to GCID systems that describe the rational closure of LIN.


Introduction
The origin of insertion systems comes from linguistics, under the name of semicontextual grammars [6], as well from biology.In biology, the insertion operation is found in the process of mismatched annealing in DNA strands [14] and in RNA editing, some fragments of messenger RNA are inserted or deleted [1].Further motivation for insertion operations can be seen in [8].On the other hand, the deletion operation was introduced independently in [10].Insertion and deletion operations together were introduced in [11]; the corresponding grammatical mechanism is called insertion-deletion system (abbreviated as ins-del system).Informally, insertion means inserting a string η between the strings w 1 and w 2 , whereas deletion means deleting a substring δ from the string w 1 δw 2 .
Among the several variants of ins-del systems (e.g., see [15] for this), we focus on graph-controlled ins-del systems (abbreviated as GCID systems).Such a corresponding author.Email: fernau@uni-trier.desystem was introduced in [5] where the concept of components is introduced, associated with insertion or deletion rules.The transition is performed by choosing any applicable rule from the set of rules of the current component and by moving the resultant string to the target component specified in the rule.The descriptional complexity measures are based on the size, denoted by (k; n, i , i ; m, j , j ) where the parameters from left to right denote (i) the number of components k (ii) the maximal length of the insertion string n, (iii) the maximal length of the left context and right context used in insertion rules, i and i , respectively, (iv) the maximal length of the deletion string m, (v) the maximal length of the left context and right context used in deletion rules, j and j , respectively.We will also refer to the last six numbers in the septuple as ID size, where ID stands for insertion-deletion.
It is known that the class of linear languages LIN is not closed under concatenation and Kleene closure.Let L • (LIN) and L * (LIN) denote the super-classes of LIN closed under concatenation and Kleene closure, respectively.It is shown in [3] that if GCID systems can describe LIN with ID size s and t components, then it can be extended to a GCID system with ID size s and t + 1 components to describe L * (LIN) and particular cases of GCID systems with ID size s and t+2 components describing L • (LIN) were reported.In this paper, we generalize these results to show that even the rational or regular closure of LIN (denoted as L reg (LIN)) can be described by GCID systems with ID size s and t + 2 components.We also show that a subclass of L reg (LIN) containing languages which can be described as concatenation of two languages from L * (LIN), can be described by GCID systems with ID size s and t + 1 components.For the first result, we employ a new normal form for L reg (LIN).Due to space restrictions, many illustrations, examples and proofs have been suppressed.

Preliminaries
We assume that the readers are familiar with the standard notations used in formal language theory.However, we recall a few notations .Let N denote the set of positive integers, and [1 . . .k] = {i ∈ N : 1 ≤ i ≤ k}.If Σ is an alphabet (finite set), then Σ * denotes the free monoid generated by Σ.The elements of Σ * are called strings or words; λ denotes the empty string.For a string w ∈ Σ * , w R denotes the reversal (mirror image) of w.Likewise, L R and L R are understood for languages L and language families L. The family of linear, context-free and recursively enumerable languages are denoted by LIN, CF and RE, respectively.
The language class LIN is neither closed under concatenation nor under Kleene closure.This motivates to consider several so-called closure classes of the linear languages.A detailed study of these closure classes is given in [12].
Let L op (F) be the smallest language class containing F and being closed under the operation op.Since LIN is not closed under concatenation and Kleene closure, the closure classes L • (LIN) and L * (LIN) are strict supersets of LIN.

The class L
we arrive at the class k-LIN, a subclass of L • (LIN).In other words, L • (LIN) = k≥1 k-LIN and LIN=1-LIN by definition.Similarly, if L ∈ L * (LIN), then either L ∈ LIN or L = (L ) * for some linear language L .It is well known that L * (LIN) and L • (LIN) are not closed under concatenation and Kleene closure, respectively; see [12].The class is also considered as an extension of L * (LIN) in [12]. 4It has a nice characterization in terms of pushdown automata with finite turns.Continuing to play around with the concatenation and Kleene closure operators and extending our notation to lists of operators, we have L •, * (LIN), the smallest language family containing LIN and being closed under concatenation and Kleene closure.Recall that L reg (LIN) is the smallest language family that contains LIN and is closed under the three regular operators: union, concatenation and Kleene closure.In our notation, this corresponds to L ∪,•, * (LIN).

Graph-Controlled Insertion-Deletion Systems
We define graph-controlled insertion-deletion systems following [5].Definition 1.A graph-controlled insertion-deletion system (GCID system for short) with k components is a construct Π = (k, V, T, A, H, i 0 , i f , R), where k is the number of components, V is an alphabet, T ⊆ V is the terminal alphabet and V \ T is the non-terminal alphabet, A ⊆ V is a finite set of axioms, H is a set of labels associated (in a one-to-one manner) to the rules in is the final component, and R is a finite set of rules of the form (i, r, j) where r is an insertion rule of the form (u, η, v) ins or a deletion rule of the form (u, δ, v) del , with i, j ∈ [1 . . .k].We say that a GCID system handles terminals properly if terminal symbols are only inserted in non-empty contexts containing non-terminals and never get deleted.
An insertion rule of the form (u, η, v) ins means that the string η is inserted between u and v and it corresponds to the rewriting rule uv → uηv.Similarly, a deletion rule of the form (u, δ, v) del means that the string δ is deleted between u and v and this corresponds to the rewriting rule uδv → uv.The pair (u, v) is called the context, η is called the insertion string, δ is called the deletion string and x ∈ A is called an axiom.A rule of the form l : (i, r, j), where l ∈ H is the label associated to the rule, denotes that the string is sent from component i (for short denoted as Ci) to Cj after the application of the insertion or deletion rule r on the string.If the initial component itself is the final component, then we call the system to be a returning GCID system.A graph-controlled ins-del system Π is said to be of size (k; n, i , i ; m, j , j ) if k is the number of components n = max{|η| : (i, (u, η, v) ins , j) ∈ R} m = max{|δ| : (i, (u, δ, v) del , j) ∈ R} i = max{|u| : (i, (u, η, v) ins , j) ∈ R} j = max{|u| : (i, (u, δ, v) del , j) ∈ R} i = max{|v| : (i, (u, η, v) ins , j) ∈ R} j = max{|v| : (i, (u, δ, v) del , j) ∈ R} In general, we follow the convention to use rule label names that are carrying some meaning as follows.For instance, if we like to describe the simulation of a rule p, then this is usually done by several rules in several components, so that pi.j would refer to the jth simulation rule in component Ci.The underlying control graph of a k-GCID system Π is defined to be a graph with k nodes labelled C1 through Ck.There exists a directed edge from Ci to Cj if and only if there exists a rule of the form (i, r, j) in R of Π.We also associate a simple undirected graph on k nodes to a GCID system of k components as follows: There is an undirected edge from a node Ci to Cj (i = j) if and only if there exists a rule of the form (i, r 1 , j) or (j, r 2 , i) in R of Π.If this underlying undirected simple graph is a tree structure, then we call a returning GCID system treestructured.The language class generated by returning GCID systems of size s is denoted by GCID(s).

Properties of Closure Classes
In this section, we show some auxiliary results needed to describe the closure classes by GCID system and then provide a characterization of the rational closure of LIN which follows directly from a normal form representation for regular expressions which states that each regular expression can be expressed as finite union of union-free expressions; see [13,Theorem 2].
Proof.The positive closure properties follow in a straightforward inductive way from what is known about LIN and some algebraic identities.
Suppose L is closed under reversal.Then ( R is not a linear language.Discuss some d j w ∈ L 3 , where w does not start with d.
Proposition 3. [12] The following inclusions are true.Moreover, all are strict.
Proposition 4. [12] The following pairs of language classes are incomparable.The following proposition follows directly from Theorem 2 of [13].Let us now consider a small example that illustrates this proposition.Consider the language L described as follows.
for linear languages L 1 , L 2 , L 3 ⊆ T * .Then, we find the following representation: Due to the previous proposition, we can focus now on expressions that have only concatenation and Kleene star as operations and whose basic elements are linear languages.Recall the well-known equivalence between expressions and (expression) trees about which we talk in the following.So, the term subexpression corresponds to a subtree.In this sense, leaf labels can be subexpressions.Also, we consider Kleene star as a unary operation, but concatenation can take any arity of at least two.This allows us to assume that stars and concatenation always alternate on any path in the expression tree.
Fig. 1.The expression tree of our example; dotted lines indicate continuation points by joining leaves i and j if j ∈ cont(i), suppressing the direction information.
In order to describe our grammar constructions that show how to generate all languages from the regular closure of LIN by appropriate GCID systems, we need to specify which of the linear grammars (associated to the leaves of the expression trees) should be simulated 'next', i.e., after finishing with the simulation of the 'current' grammar.This is formalized in the following with the notion of continuation points, reminiscent of the Glushkov transformation [7].
Assume that t is an expression tree with inner nodes labeled * or •, and the leaves be labeled with numbers from [1 . . .k].For i ∈ [1 . . .k], we define the set of continuation points cont(i) ⊆ [1 . . .k+1] as follows.Here, let subex(i) denote the smallest subexpression to which i belongs, and r(i) be the root label of subex(i).Moreover, range(i) be the subinterval of [1 . . .k] that spans from the first to the last leaf label of subex(i).Slightly abusing notation, we also write range(n) for the subinterval of [1 . . .k] that spans from the first to the last leaf label of the subexpression rooted at some inner node n.Hence, subex(i) = subex(r(i)).Inductively, we set subex 1 (i) = subex(i), r 1 (i) = r(i) and range 1 (i) = range(i), as well as r j (i) = p(r j−1 (i)), where p is the parent function, subex j (i) is the subexpression rooted at r j (i), and range j (i) be the subinterval of [1 . . .k] that spans from the first to the last leaf label of subex j (i).We refer to j also as the level.Clearly, at some point p(r j−1 (i)) is no longer defined, as specified by the height h(i).In the following, let j ≤ h(i).
-If r j (i) = * and i = max(range j (i)), then min(range j (i)) belongs to cont(i).
-If r j (i) = • and i < max(range j (i)) and either (a) j = 1 or (b) r j−1 (i) = * and i = max(range j−1 (i)), then let s 1 , . . ., s q be all the right siblings of (a) i or (b) the root of subex j−1 (i), respectively, such that the labels of s 1 , . . ., s q are all * but that of s q , which is •, or s q is a leaf; then, min(range(s o )) belongs to cont(i) for all 1 ≤ o ≤ q.As a special case, if there is no s q with label • (because all right siblings carry stars), then we have to continue from the beginning, with the left siblings, again from left to right, until we hit the first s q with label •.
Look again at our example to illustrate this definition, calling the 16 linear languages occurring in the leaves of the expression in Eq. ( 1) as L 1 , . . ., L 16 from left to right, also cf.Fig.For instance, for i = 1, we have r 1 (1) = •, as the parent operation of For the computation of the continuation points, only r 1 (1) = • is important and yields 2. The case i = 4 is more interesting.Again, we have r 1 (4) = •, r 2 (4) = * , r 3 (4) = • and r 4 (4) = * , with h(4) = 4.However, the level j = 1 is no longer of interest, rather j = 2, which puts 1 = min(range 2 (4)) into cont(4).Moreover, considering the level j = 3, we get the first elements of the range of the siblings into the set of continuation points, which is 5 = min(range(p(p(5)))), as p(p( 5)) describes the right sibling of p(p(4)), 9 = min(range(p(p(9)))), as p(p(9)) describes the right sibling of p(p(5)), as well as 13.The next interesting case happens if i = 8, as we now have to continue looking at siblings from the beginning.Finally, with i = 9, we see something interesting with j = 1, as now the starred subexpression L * 10 = (L 3 ) * that follows L 9 = L 1 can be skipped.Proposition 6.Let L ⊆ T * .If L ∈ L * ,• (LIN), given by some expression tree t, then there is a context-free grammar G = (N, T, S 1 , P ) with L(G) = L, together with some integer k ≥ 1 counting the leaves of t, satisfying the following: -N is partitioned into N 0 , N 1 , . . ., N k , where for each i = 1, . . ., k, S i ∈ N i ; -N 0 = {S 1 , S 2 , S 3 , . . ., S k , S k+1 }; -P can be partitioned into P 0 , P 1 , . . ., P k such that G i = (N i , T, S i , P i ) forms a linear grammar for each i = 1, . . ., k; If the continuation points satisfy cont(i) = {i + 1} for all 1 ≤ i ≤ k, then this gives a characterization of the language class L • (LIN).In order to simplify some of our main results in the following sections, the following observations from [4] are helpful.Proposition 7. [4] Let L be a language class that is closed under reversal and k, n, i , i , m, j, j be non-negative integers.The following statements are true.
4 Describing Closure Classes of Linear Languages Initially, our main objective was to find how much beyond LIN GCID systems (of the four sizes stated in Proposition 1) can lead us.However, we then succeeded to provide a general result showing that if there exists GCID SD(T ) systems of ID size (n, i , i ; m, j j ) describing LIN, then these constructions can be extended to GCID SD(T ) systems of the same ID size at the expense of two more components to describe L reg (LIN).Unfortunately, we were not able to describe even CF with GCID systems of these four sizes and this question is left open to the reader.
Describing L reg (LIN) by GCID systems is rather an immediate consequence of Proposition 6.Here, we slightly extend the notion cont(i) once more to the case when i = 0.This is somehow interesting when r h (1) (1) = * and allows to skip, for instance, to the position k + 1 to easily incorporate the empty word.
Proof.Let L ∈ L ∪,•, * (LIN) for some L ⊆ T * .By Proposition 5, we can assume that L is the finite union of k languages from L •, * (LIN).We first show how to simulate context-free grammars that are as in Proposition 6, using GCID X (t + 2; n, i , i ; m, j , j ) for languages from L •, * (LIN).By using disjoint nonterminal alphabets, we get a GCID system for the finite union of such languages, as well, because we can assume that the constituent systems handle terminals properly.
Since LIN ⊆ GCID SD(T ) (t; n, i , i ; m, j , j ), each G i can be simulated by a simple-deleting GCID system Π i = (t, V i , T, {S i }, H i , 1, 1, R i ) for 1 ≤ i ≤ k, each of size (t; n, i , i ; m, j , j ).We assume, without loss of generality, that Let us first consider the case i ≥ 1 and i = 0. We construct a GCID system Π for G as follows: -R is the set with the following rules: for each 1 ≤ i ≤ k and c ∈ cont(i), r i (t + 1).1 : (t + 1, (S i , S i , λ) ins , t + 2), r i (t + 1).(c + 1) : (t + 1, (S i , S c , λ) ins , 1) r i (t + 2).1 : (t + 2, (λ, S i , λ) del , t + 1); Further, r k+1 (t + 1).1 : (t + 1, (λ, S k+1 , λ) del , 1) Since L i = L(G i ) is generated by Π i , respectively, for 1 ≤ i ≤ k, the linear rules of Π i are simulated by rules of R i in the first t components and there is no interference between rules of different systems Π i and Π j , since We start with the axiom S 0 S c for some c ∈ cont(0).S 0 is deleted and in C(t + 1) and for some d ∈ cont(c).Now, a string w 1 ∈ L c is produced by simulating G c in the first t components of the system Π.In general, the simulation goes from left to right.When the string w c ∈ L c is produced, the terminating rule of L c , namely h c .1, takes the string to component t + 1, where we either arrive in configuration (w c S d ) t+1 , and the simulation continues with producing a word according to G d etc.The whole process ends on applying the rule r k+1 (t + 1).1 : (t + 1, (λ, S k+1 , λ) del , 1), which deletes the nonterminal S k+1 .
Conversely, any derivation within Π can be split into phases, where each linear phase starts and ends in the first component with a string that starts with a terminal string, followed by S i S c for some c ∈ cont(i) in the beginning, and by some X i vS c in the end of this phase, where v is some terminal string.Now, on applying h i 1.1, X i gets deleted and the transition phase is initiated, moving a string starting with a terminal string and ending with some S c into C(t + 1).Now, apart from the special case when S k+1 is the last symbol of the string, by applying rules from C(t + 1) or C(t + 2), some string is moved back to C1 that satisfies the conditions expressed as the beginning of a linear phase.It is now clearly seen that this alternation of linear and transition phases corresponds to generating words from L from left to right, following some concrete instantiation of the expression tree.The case when i = 0 and i ≥ 1 follows from Propositions 2 and 7.

Reducing components for certain closure classes
In this section, we show that with GCID systems of ID size s and t+1 components we can describe We prove the next theorem by providing three different simulations of its three subsets stated above, in the subsequent theorems.
Starting with the axiom S 1 # and using rules of R 1 , a string w 1 ∈ L 1 is produced first with being the last rule applied is h 1 1.1.This leads to w 1 # in C(t + 1).The only rule in C(t + 1) is applied which inserts S 2 after # and moves back to C1. Continuing with w 1 #S 2 , w 2 ∈ L(G 2 ) is generated reaching to the configuration (w 1 #w 2 ) 1 where # is deleted by r1.1.If r1.1 is applied before h 1 1.1, then the string is stuck at C(t + 1) which is not the target component.
Since 2-LIN is closed under reversal (due to Proposition 2), the case when i = 0, i ≥ 1 follows from Proposition 7.
Theorem 4. For all integers t, n, m ≥ 1 and i , i , j , j ≥ 0 with i + i ≥ 1 and X ∈ {SD, SDT }, if LIN ⊆ GCID X (t; n, i , i ; m, j , j ) was shown by a simple-deleting simulation, then L ∪ L R ⊆ GCID X (t + 1; n, i , i ; m, j , j ).
The rules in C(t+1) initiate the simulation of the rules of G 1 by inserting S 1 after # and thereafter continuing with #S 1 w 2 from C1, the configuration (#w 1 w 2 ) t+1 is reached, with w 1 ∈ L * 1 and w 2 ∈ L 2 .Now if r(t+1).1 is applied, the simulation of G 1 is restarted, and after generating w 1 ∈ L(G 1 ) for desired number of times, the whole derivation stops.With this observation, we conclude that Π 4 generates (L(G 1 )) * L(G 2 ) ∈ L .
Consider the case when i = 1, but we want to prove the inclusion for L R .We aim at constructing a GCID system Π 4 for L 2 L * 1 .The simulation is identical to the one just presented except for the axiom, which is S 2 # now.
The case when i = 0 and i ≥ 1 follows from the fact that L ∪ L R is closed under reversal and by Propositions 2 and 7.
The following proof is a simple extension of Theorem 4. Hence, we give the simulating rules and refrain from explaining the working.
Proof.For i ≥ 1 and i = 0, we construct a GCID system The case when i = 0 and i ≥ 1 follows from Propositions 2 and 7.
Remark 1.The proof of Theorem 5 can be extended to describe {L * 1 L * 2 . . .L * k : L i ∈ LIN for 1 ≤ i ≤ k}.Consider the GCID system Π as in Theorem 5 with alphabet and label set extended from 2 to k.Let axiom be # 1 # 2 . . .# k .The rules of R, R ∈ Π 3 are similarly extended to k rules and there are 2k rules in R .This shows that L * 1 L * 2 . . .L * k ∈ GCID(t + 1; n, i , i ; m, j , j ) under the assumptions of Theorem 5.Since there is no control on the number of applications of the rule r i (t + 1).1 : (t + 1, (# i , S i , λ) ins , 1), we cannot enforce it to be applied exactly once; hence L 1 L 2 or L * 1 L 2 or L 1 L * 2 alone cannot be produced this way.

Summary and Future Challenges
Up to the present, most of the research on the descriptional complexity of (graphcontrolled) insertion-deletion systems was about the limits to the resources so that we can still show that such systems are able to describe all recursively enumerable languages.Although we do not have a proof showing that the borderline that we reached is optimal, it might be an idea to look into smaller language classes now.One natural question would be to see with which resources we can still describe all context-free languages.While all results collected in this paper show, in particular, that all linear languages can be described by the corresponding resources, we put it up as a challenge to come up with non-trivial simulations of context-free grammars.
In this paper, we tried to bridge between linear and context-free languages as best as possible.Our main technical contribution is to describe these simulations in quite a general fashion, so that we can save giving similar simulations for each specific case of sizes of the systems.
(i) 2-LIN and L * (LIN), (ii) 2-LIN and L ∪ L R , (iii) L • (LIN) and L * (LIN), (iv) L • (LIN) and L ∪ L R .LIN 2-LIN L * (LIN) L•(LIN) L ∪ L R SMLIN MSLIN Lreg(LIN) CF The inter-relationship between the closure classes of LIN stated in Propositions 3 and 4 is shown on the left.A (path of) solid arrow from A to B indicates A B and no arrowed path between A and B tells that A and B are incomparable.We also add MSLIN = L•(L * (LIN)) and SMLIN = L * (L•(LIN)).

Proposition 5 .
Let L ⊆ T * .Then L ∈ L reg (LIN) if and only if L is the finite union of languages from L •, * (LIN).
3. The following table lists the continuation points.