Bounded Repairability for Regular Tree Languages

We study the problem of bounded repairability of a given restriction tree language R into a target tree language T. More precisely, we say that R is bounded repairable with respect to T if there exists a bound on the number of standard tree editing operations necessary to apply to any tree in R to obtain a tree in T. We consider a number of possible specifications for tree languages: bottom-up tree automata (on curry encoding of unranked trees) that capture the class of XML schemas and document type definitions (DTDs). We also consider a special case when the restriction language R is universal (i.e., contains all trees over a given alphabet). We give an effective characterization of bounded repairability between pairs of tree languages represented with automata. This characterization introduces two tools—synopsis trees and a coverage relation between them—allowing one to reason about tree languages that undergo a bounded number of editing operations. We then employ this characterization to provide upper bounds to the complexity of deciding bounded repairability and show that these bounds are tight. In particular, when the input tree languages are specified with arbitrary bottom-up automata, the problem is coNExp-complete. The problem remains coNExp-complete even if we use deterministic nonrecursive DTDs to specify the input languages. The complexity of the problem can be reduced if we assume that the alphabet, the set of node labels, is fixed: the problem becomes PSpace-complete for nonrecursive DTDs and coNP-complete for deterministic nonrecursive DTDs. Finally, when the restriction tree language R is universal, we show that the bounded repairability problem becomes Exp-complete if the target language is specified by an arbitrary bottom-up tree automaton and becomes tractable (P-complete, in fact) when a deterministic bottom-up automaton is used.


INTRODUCTION
A basic problem in data management is to ensure that data is valid, namely that it satisfies all integrity constraints associated with a schema [Bertossi 2011]. Validation of data with respect to a schema is crucial in any database system: if data does not satisfy the integrity constraints, then one cannot guarantee that the output produced by the system is correct. Nevertheless, when data does not satisfy constraints, a natural approach is to attempt a repair-that is, to modify the data minimally so that it becomes valid [Arenas et al. 1999;Afrati and Kolaitis 2009]. We may want to perform this transformation on the data, or we may be merely interested in knowing how difficult it is to perform the transformation in case of need-that is, determining how far a given collection of data is from satisfying the specification. For example, some applications may retain input data even when this contains a few errors, where "few" could be interpreted as a user-defined bound to the total number of errors or to the fraction of errors over the size of the input [Grahne and Thomo 2004].
On relational data, this problem has been extensively studied under the notion of constraint repair (e.g., see Arenas et al. [1999] and Afrati and Kolaitis 2009]): in this case, the specifications are given by relational integrity constraints, such as keys and foreign keys, and the problem asks to determine how much a database needs to be modified to satisfy a given constraint. This approach has been investigated for a variety of integrity constraints, starting with classical functional and inclusion dependencies [Arenas et al. 1999] and continuing with more expressive constraints such as tuple generating dependencies [Afrati and Kolaitis 2009]. In addition, several different repair operators have been considered in the relational case, including insertions, deletions, and modifications of tuples. Besides finding repairs of relations, this line of research also focuses on querying inconsistent documents via their minimal repairs.
In the XML framework, malformed or nonconformant documents are more the rule than the exception [Chen et al. 2005;Ofuonye et al. 2010]. Indeed, a recent study [Grijzenhout and Marx 2013] shows that although most XML documents are well formed (more than 85%), only 25% of them reference a downloaded schema. Even worse, in this study, it is shown that less than 10% of XML documents satisfy their downloaded schema. This means that most of the XML data on the Web can be read, but only a small part of it can be processed automatically. In this scenario, it is natural to look for automatic repairing XML data with respect to a target schema. The idea is that an automatic repair process receives an invalid XML document and produces the best sequence of edit operations that results in a document satisfying the target schema. The edit operations should respect the nested structure of the XML document and modify the document in a minimal way.
The notion of repair for XML data is defined in a natural way when considering documents as trees: in this case, a repair can be simply defined as the tree edit distance [Tai 1979] between the input tree and the repaired tree-that is, the number of atomic edit operations that are needed to get from one tree to another. An atomic edit operation here amounts to inserting, deleting, or modifying a single node in a tree [Bille 2005]. Edit distance can then be lifted to a measure of distance dist(t, T ) of a tree t from a specification T : this is nothing but the minimal distance of t to any tree satisfying T . In our setting, t can be seen as an XML document and T as an XML Schema, and hence dist(t, T ) measures how difficult it is to repair the data t so as to satisfy T . Furthermore, dist(t, T ) can be computed efficiently when T is a regular language (e.g., document type definition (DTD) or XML schema definition) specified by means of an automaton [Wagner 1974;Aho and Peterson 1972].
In this article, we take the next step in repairing XML documents or trees: given two regular specifications S and T over trees, we aim at calculating how difficult it is, in the worst case, to transform an object satisfying S into an object satisfying T . The problem is motivated by considering S to be a source (i.e., a constraint that the input is guaranteed to satisfy) and T to be a target (i.e., a constraint that needs to be enforced). More precisely, we consider the worst case over all trees t satisfying S of the minimum number of edit operations needed to transform t into some tree in T -that is, Of course, the preceding cost may be infinite. In this work, we isolate the pairs of schemas S and T such that cost(S, T ) is finite, and we give optimal procedures to decide when this happens, namely when schema S is "almost contained" in schema T . Specifically, we say that S is bounded repairable into T when cost(S, T ) < ∞-that is, when every document t in S can be repaired to a document t in T by applying a finite, uniformly bounded number of edits.
The notion of bounded repair is also motivated by the schema matching problem [Rahm and Bernstein 2001]: we would like to identify whether two schemas are semantically related. In our setting, the semantic relation between two schemas is considered at a very low level, namely each schema is seen as a set of documents and not as a set of rules. Then the bounded repair problem states that a source schema S is related to a schema T if any XML document satisfying S can be transformed with a finite, uniformly bounded number of operations into a document satisfying T . Here, it is important to notice that our repair operations are designed to only consider the structural part of the data. Further research needs to be done to include in the analysis the data itself, such as the constraints between attribute values in XML documents, as this is an important aspect to take into account when reasoning on and transforming XML documents.
The following examples give an account of some of the difficulties of telling whether one schema is bounded repairable into another.
Example 1.1. Recall that languages of unranked trees can be specified by means of DTDs-that is, by sets of rules of the form a → L a , where L a is a regular language describing the possible sequences of children of an a-labeled node. For the sake of brevity, we will often omit from DTDs the rules of the form a → ε, which denote the fact that a-labeled nodes are leaves.
Consider the following DTDs: The left-hand side schema S defines the language of all trees of the form r(d (a, . . . , a, b, . . . , b), c, . . . , c), whereas the right-hand side schema T defines the language of all trees of the form r (a, . . . , a, e(b, . . . , b, c, . . . , c)). We claim that S is repairable into T with a uniformly bounded number of edit operations. Indeed, given a tree r(d (a, . . . , a, b, . . . , b), c, . . . , c) satisfying S, one can first delete the node labeled by d, obtaining the tree r (a, . . . , a, b, . . . , b, c, . . . , c), and then insert a new e-labeled node under the root, which adopts as children all nodes labeled by b or c; this results in a tree r (a, . . . , a, e(b, . . . , b, c, . . . , c)) that satisfies T .
It is easy to see that S is bounded repairable into T : any tree r(a (b, . . . , b)) in S can be modified into a tree in T by inserting a new c-labeled node as a right sibling of the nodes labeled by b. However, if we replace in both DTDs S and T the rule r → a with the rule r → a * , we obtain a new pair of languages S and T such that S is not bounded repairable into T . This example suggests that bounded repairability depends on some interplay between the rules of DTDs and, more generally, between the specifications of the labelings of the nodes at different levels of the trees. We will deal with the notion of bounded repairability for schemas that are more general than DTDs, such as schemas that are given by regular tree languages [Schwentick 2007], and which capture the structural part of the W3C's XML schema [Fallside and Walmsley 2004]. We will formalize the edit distance between regular tree languages, and from this we will define the bounded repair problem-that is, the problem of deciding bounded repairability between two given tree languages S and T . Our main result is that it is decidable whether or not S can be repaired into T with a uniformly bounded number of edits.
For regular languages of words, the bounded repair problem was resolved in Benedikt et al. [2013]. There, it was shown that the problem is CONP-complete when the languages are represented by deterministic finite state automata, and a characterization of bounded repairability was given using a coverability relation between chains of connected components of the automata. In the case of tree languages, the problem turns out to be more complex, both in terms of complexity and in terms of proof techniques that are required to solve it. We will provide a characterization of bounded repairability that exploits a suitable notion of component of a stepwise tree automaton [Carme et al. 2004], a form of automaton that turns out to be particularly convenient for analyzing repairs. An additional complication for the tree case is that we need to consider structures of connected components of stepwise tree automata that take the form of trees rather than chains. Our characterization of the bounded repairability of S into T requires that every component structure of S can be "covered" by a component structure of T . The notion of covering is subtle, and the proof that it captures bounded repairability requires lifting the notion of edit from the level of the individual trees to the level of the component trees associated with the automata for S and T .
With an effective characterization at hand, we can decide the bounded repairability problem, and with some additional optimizations, we can give tight complexity bounds. It turns out that, differently from the string setting, the bounded repairability problem is equally complex no matter whether the tree languages are given by nondeterministic automata, deterministic automata, DTDs, or even nonrecursive DTDs. Indeed, for all of these cases, the bounded repair problem is CONEXP-complete. We then look for tractable cases that are obtained by further restricting the tree specifications. For example, we show that the bounded repairability problem becomes much simpler when the source alphabet is fixed and the languages are given by deterministic DTDs, or when the source language is assumed to be trivial, namely the set of all trees over a given finite alphabet.
New material in this article. Preliminary versions of some of the results in this work appeared in Puppis et al. [2012]. However, this article contains substantial new material. We include a full proof of the main characterization result (Theorem 5.7). The upper bounds to the number of repairs (Lemma 5.5 and Proposition 5.8) are also new. As concerns the complexity of deciding bounded repairability, the article provides new complexity bounds that are moreover tight. In the previous work, it was shown that the bounded repair problem for regular tree languages is decidable in EXP 2 and is EXP-hard. Here, we show that the problem is actually CONEXP-complete (Theorems 7.3 and 7.4). We finally include new examples and proofs of other claims that were omitted in the previous work.
Organization. The article is organized as follows. In Section 2, we discuss some related work. In Section 3, we give some preliminaries on trees and regular tree languages, and in Section 4, we define the bounded repairability problem for tree languages. In Section 5, we give the formal statement of our main result-that is, a characterization of those pairs of schemas that are bounded repairable. Section 6 gives a detailed proof of the characterization. In Section 7, we analyze in detail the complexity of the bounded repairability problem. In Section 8, we give another, simple characterization of bounded repairability for the case where the source language is universal, and we accordingly derive new complexity results. Finally, in Section 9, we give our conclusions and future work.

RELATED WORK
Ever since their conception, computers required the input data to follow a set of strict structural and semantic rules, and the failure to do so typically resulted in operations producing unpredictable outputs, a well-known phenomenon of Garbage In, Garbage Out [Babbage 1864]. While initially this phenomenon was attributed to situations of erroneous manual data entry, with the ever-increasing number of applications exchanging data, the phenomenon has gained a new meaning, describing potential problems occurring when two applications attempt to communicate with incompatible protocols [Lidwell et al. 2010]. Although our research aims at solving the problems of the latter scenario, very early in the development of computer science we saw solutions to the problems resulting from erroneous manual data entry. One prominent example is the work on the error-correcting parser for context-free languages [Aho and Peterson 1972], where a malformed input string is repaired by applying a (minimal) number of editing operations that make it conform to the given grammar. Korn et al. [2013] consider a similar problem for XML, where a serialization of an XML document is not well formed (e.g., mismatching opening and closing tags or misspelled tag names) and is repaired to allow parsing into an XML tree. A slightly different variant of the problem is repairing well-formed XML with respect to a given schema in the form of a DTD [Suzuki 2005;Staworko and Chomicki 2006] or XML schema [Staworko et al. 2008]. The validity of the XML document is restored using a minimal set of editing operations (insertion, deletion, and renaming of nodes). Adding a move operation that can change the relative order of elements is challenging because of fundamental computational limitations [Cormode and Muthukrishnan 2007], and approximate measures have been studied for this operation [Boobna and de Rougemont 2004]. Furthermore, repairing XML documents with respect to analogues of classical relational constraints (key and inclusion dependencies) has also been studied [Flesca et al. 2005]. HTML documents often violate the syntactic rules of well formedness and the structural rules imposed by the HTML standard; consequently, repairing them requires methods that diligently combine the approaches of editing the textual serialization and editing the tree representation of the input document [Chen et al. 2005;Ofuonye et al. 2010].
The problem that we study is, however, more general than repairing a single input XML document with respect to a given schema, as we are interested in repairing any input document drawn from a possibly infinite set of documents with a number of operations that is independent on the size of the XML document. One could attempt to approximate the bound on the number of required editing operations by randomly 18:6 P. Bourhis et al. generating the input document [Antonopoulos et al. 2013] and then computing their edit distance to the target regular language. Because in our setting the input document is drawn from a regular language and is repaired with respect to another regular language, the problem is in fact a generalization of a well-known and thoroughly studied problem of containment of two schemas [Comon et al. 2007;Colazzo et al. 2013;Martens et al. 2009]. Closely related is the problem of measuring similarity between two schemas based on a notion of embeddings studied in Fan and Bohannon [2008]. Yet there are significant differences: on the one hand, our semantic characterization of bounded reparability is stronger than the structural similarity determined with embeddings, but on the other hand, the framework of Fan and Bohannon [2008] introduces an additional challenge: it requires the embeddings to be information preserving-that is, for any query that can be evaluated on a document from the source schema, there exists an equivalent query over the corresponding document from the target schema.
The preservation of information expressed with queries is the essence of data exchange source-to-target mappings and finding (target) solutions for a given source document is a difficult problem in the context of XML [Arenas and Libkin 2008;Amano et al. 2009]. In this setting, the problem of absolute consistency [Bojańczyk et al. 2011], checking that a solution exists for any possible source instance, bears strong resemblance to the problem of reparability except it does not call for using editing operations, and consequently it does not impose any limit on the number of editing operations but merely inquires the possibility of always finding a solution. In fact, the authors propose a solution that uses a notion of a kind in a manner analogous to the connected components used in our approach.

REGULAR LANGUAGES OF TREES
In this article, we work with finite unranked ordered trees whose nodes are labeled over a finite alphabet . Formally, the set of finite unranked ordered trees over (hereafter, simply trees) is inductively defined as follows: (1) every symbol a ∈ is a tree, and (2) if a ∈ , n ∈ N, and t 1 , . . . , t n are trees, then a(t 1 , . . . , t n ) is a tree. A sequence of -trees t 1 · . . . · t n is called a forest. As an example, the left-hand side of Figure 1 shows a tree over the alphabet = {r, a, b, c, d}. We denote by T the set of all trees over . A (tree) language over is any subset L of T .
It is useful to identify nodes of an unranked tree with sequences of positive natural numbers. Given a tree t of the form a(t 1 , . . . , t n ), its domain is the subset of N * that is formally defined as nodes(t) = {ε} ∪ {i · x | x ∈ nodes(t i ) ∧ 1 ≤ i ≤ n}. Note that the root of a tree is represented by ε. Finally, for every node x ∈ nodes(t), we denote by t(x) the label of x in t.
Given a tree t, we introduce two partial orders on the domain nodes(t), which are called ancestor order and post-order and are denoted by anc t and post t , respectively. The ancestor order anc t is nothing but the prefix order on the sequences of positive natural numbers that identify the nodes of t-that is, x anc t y if and only if x is a prefix of y. The post-order post t is the total ordering on the nodes of t (i.e., sequences of natural numbers) defined by x post t y if and only if y is a prefix of x or there exist z, x , y ∈ N * and i, j ∈ N such that x = z · i · x , y = z · j · y , and i ≤ j.
DTDs. We manipulate regular languages of trees mainly by means of automatonbased specifications, which will be formally defined in the next paragraphs. However, we use less expressive specifications, such as XML document type definitions, to give examples of simple tree languages.
An XML DTD is defined as a tuple D = ( , d, I), where is a finite alphabet, d is a function that maps symbols from to regular expressions over , and I ⊆ is the set of initial symbols [Comon et al. 2007]. A tree t satisfies the DTD D if t(ε) ∈ I and, for every x ∈ nodes(t), the word t(x · 1) · . . . · t(x · n) belongs to the language defined by the regular expression d(t(x)), where n is the number of children of x in t. We denote by L (D) the language of trees satisfying the DTD D. We will often omit the rules of the form a → ε, as well as the set of initial symbols from the definition of a DTD when this set is understood from the context (e.g., when it consists of a single symbol r). As an example, consider the following DTD: One can easily check that the left-hand side tree of Figure 1 satisfies the preceding DTD D.
It is known that DTDs define a proper subclass of regular tree languages [Martens et al. 2006;Martens and Niehren 2007]. Quite interestingly, most of our complexity lower bounds for the bounded repair problem hold even for languages defined by DTDs (see Section 7 for more details).
We will also consider languages defined by nonrecursive deterministic DTDs, which are often used in practice and have been studied extensively in Segoufin and Vianu [2002] and Segoufin and Sirangelo [2007]. Let us define the dependency graph of a DTD D = ( , d, I) as the directed graph whose nodes are the letters in and whose edges connect any letter a to a letter b whenever b occurs in the language specified by the regular expression d(a). A DTD D is called nonrecursive if its dependency graph is acyclic. A DTD D is called deterministic if each regular expression d(a) is oneunambiguous (namely, it can be equally seen as a deterministic finite state automaton) [Brüggemann-Klein and Wood 1998] and the set of initial symbols is a singleton.
Curry encoding. To ease the definitions of the automaton model and the reasoning on tree repairs, we introduce here the notion of curry encoding, also known as extension encoding, of a tree [Carme et al. 2004;Martens and Niehren 2007]. According to this encoding, any unranked tree over is seen as a binary tree with leaves labeled over and internal nodes labeled by a distinguished symbol @. Formally, the curry encoding is the function ext that injectively maps unranked trees to binary trees as follows: ext (a) = a, ext (a(t 1 , . . . , t n )) = @(ext (a(t 1 , . . . , t n−1 ), ext (t n ))).
To ease readability, we use the symbol @ as a binary, infix, left-associative operator: for instance, ext (a (t 1 , . . . , t n )) = a@ ext (t 1 ) @ . . . @ext (t n ). The left-hand side of Figure 1 illustrates the encoding of an unranked tree. The inverse ext −1 of the encoding is defined by providing the symbol @ with the semantics of the extension operator on unranked trees and by evaluating the expression in a bottom-up fashion-that is, ext −1 (a) = a and ext −1 (a @ t 1 @ . . . @ t n ) = a(ext −1 (t 1 ), . . . , ext −1 (t n )).
We observe that there is a one-to-one correspondence between the nodes of an unranked tree and the leaves of the curried encoding. In particular, the root node of an unranked tree corresponds to the left-most leaf of its curry encoding. Moreover, the yield of a curried tree (i.e., the sequence of leaves taken from left to right) corresponds to the standard left-to-right preorder traversal of the corresponding unranked tree. Another observation follows from the semantics of the extension operator: the inner nodes of a curried tree, labeled with @, correspond to the edges of the unranked tree.
Hereafter, we will identify unranked trees with their curry encodings. In particular, by a slight abuse of notation, we will denote by T the set of all curry encodings of trees over , and by nodes(t) the domain of the curry encoding of a tree t.
Contexts. We now fix another special symbol • ∈ that will be used as a placeholder for contexts. Formally, a (curried) context over is the curry encoding of a tree over the alphabet ∪ {•}, with a single node labeled by • (note that in the curry encoding, the symbol • must occur in a leaf). We denote by C the set of all contexts over . The empty context is the context • having exactly one node. The right-hand side of Figure 1 illustrates the encoding of a context. A context C is horizontal if the placeholder • is the left-most leaf of C. We point out that a horizontal context has the form •@ t 1 @ . . . @ t n and represents the forest-that is, a sequence of trees-(ext −1 (t 1 ), . . . , ext −1 (t n )). Note that the empty context is horizontal.
For a context C and a tree t, we denote by C • t the tree obtained from the substitution of • by t in C. Similarly, the composition C 1 • C 2 of two contexts C 1 and C 2 is obtained from the substitution of the placeholder in C 1 by C 2 (this results again in a context in C ). The composition of two horizontal contexts is also horizontal and corresponds to concatenation of the corresponding forests of unranked trees. Note, however, the difference in the order of context composition and the order of forest concatenation: if C = •@ t 1 @ . . . @ t n and C = •@ t 1 @ . . . @ t m , then C • C = •@ t 1 @ . . . @ t m @ t 1 @ . . . @ t n , which represents the forest (ext −1 (t 1 ), . . . , ext −1 (t m ), ext −1 (t 1 ), . . . , ext −1 (t n )).
Stepwise tree automata. We use stepwise tree automata to specify regular tree languages. These are essentially bottom-up tree automata running on the curry encodings of trees [Carme et al. 2004;Martens and Niehren 2007;Comon et al. 2007]. Formally, a stepwise automaton is a tuple A = ( , Q, δ, δ 0 , F), where (1) is a finite set of labels, (2) Q is a finite set of states, (3) δ : Q × Q → 2 Q is a transition function, (4) δ 0 : → 2 Q is an assignment of initial states to labels, and (5) F ⊆ Q is a set of final states.
We say that the automaton A is deterministic if δ 0 (respectively, δ) can be described as a partial function from (respectively, Q × Q) to Q. It is often convenient to represent δ 0 and δ as a set of rules. For instance, we write a → q to indicate that q ∈ δ 0 (a) and q 1 @ q 2 → q to indicate that q ∈ δ(q 1 , q 2 ).
A run of a stepwise automaton A = ( , Q, δ 0 , δ, F) on a tree t ∈ T is a function ρ : nodes(t) → Q such that (1) for every leaf node x, ρ(x) ∈ δ 0 (t(x)), and (2) for every inner node x, ρ(x) ∈ δ(ρ(x·1), ρ(x·2)) (recall that we represent t with its curry encoding). A run ρ is accepting if ρ(ε) ∈ F. The language recognized by A, denoted by L (A), is the set of all trees t ∈ T on which A has an accepting run. Example 3.1. The following will serve as our running example. Consider two DTDs: The following two stepwise automata capture (modulo the curry encoding) the languages defined by the previous DTDs (the underlined states are final, and each rule with q a ? translates to two rules with q a 0 and q a 1 ): Figure 2 presents the (accepting) runs of the automata S and T on some curried trees.
Stepwise automata capture exactly the class of regular (unranked) tree languages [Carme et al. 2004], and they are more succinct than other models of automata [Martens and Niehren 2007]. Even though other equivalent models of automata, such as unranked tree automata, are more frequently used in practice, these can be converted into stepwise tree automata in polynomial time. This means that algorithms for analyzing stepwise automata provide the same complexity bounds for unranked tree automata-in particular, all of our complexity results apply to both stepwise tree automata and unranked tree automata. The main advantage of using stepwise automata in our proofs is due to their ability to capture in a uniform way the "cyclic behavior" of a regular tree language (as we will see in Section 5, this cyclic behavior is defined in terms of strongly connected components of automata).
In the sequel, we will assume that our stepwise tree automata are trimmed, namely they contain only states that appear in valid accepting runs. Formally, a stepwise tree automaton A = ( , Q, δ, δ 0 , F) is trimmed if for every state q ∈ Q, there exist t ∈ T 18:10 P. Bourhis et al. and an accepting run ρ of A on t such that ρ(x) = q for some x ∈ nodes(t). Every stepwise tree automaton can be trimmed in linear time [Comon et al. 2007]. Given that all problems considered in this article are at least P-hard, the assumption that all stepwise tree automata are trimmed is without loss of generality.
As is usual for word automata, we extend the transition function δ of a stepwise automaton to trees in T and to contexts in C . More precisely, we define the function δ * : T → 2 Q such that q ∈ δ * (t) if and only if there exists a run ρ of A on t and ρ(ε) = q. Similarly, we define the function δ * we simulate some computation of A on C under the assumption that the placeholder is assigned state q). In particular, we have δ * • (q, •) = {q}. By an abuse of notation, we will denote δ * and δ * • simply by δ.

THE BOUNDED REPAIR PROBLEM FOR TREES
We repair trees by using the standard set of edit operations over nodes [Tai 1979;Bille 2005]. We briefly recall the definitions of the standard edit operations on unranked trees, which are extensions of the edit operations over words. The first operation, called deletion, removes a distinguished (nonroot) node x from a tree t and promotes the subtrees x as children of its parent. The second operation, called insertion, adds a new node x in an unranked tree t, with a possible adoption of a list of consecutive children of the parent of x whose original position immediately follows the position of x. Figure 3 provides an example of these two operations. The last operation, called relabeling, modifies the label of a node x to a new label in . These three operations are the standard edit operations used to define the edit distance between trees (see Bille [2005] for a survey). We denote by dist(t, t ) the minimum number of edits operations that are needed for transforming t into t given two unranked trees t and t . Note that the operation of relabeling a node in an unranked tree, which is sometimes used as a standard edit operation, is subsumed by the insertion and deletion of nodes. Therefore, allowing or not the use of the operation of relabeling has an impact on the edit distance between two trees. However, the bounded repair problem is equivalent following that the relabeling is allowed or simulated by insertion and deletion operations.
We identify a close correspondence between the two basic edit operations on unranked trees and two operations on the corresponding curry encodings. Because the operations on unranked trees involve changing the parent of a sequence of consecutive subtrees (forests), the operations on curry encodings involve moving corresponding horizontal contexts.
Deleting an inner node x in an unranked tree ( Figure 4) corresponds to deleting the corresponding leaf node x in the curry encoding, identifying the horizontal context C that represents the sequence of children of the deleted node, and replacing the parent @ of C by C. Note that the context C is uniquely defined in the curry encoding and always has a parent @ because we apply the deletion operation to an inner node in the unranked tree.  Conversely, inserting a node y in an unranked tree ( Figure 5) corresponds to identifying a context C representing the forest of the adopted children, placing a new @ node in place of C while attaching C as the right child of the new node @, and substituting the placeholder of C by the new node y. Here the context C is also uniquely defined by the sequence of consecutive children adopted by the inserted node.
We are interested in studying the bounded repairability problem. This problem was studied for strings by Benedikt et al. [2013], so we extend their setting from strings to trees. We consider two finite alphabets and and regular languages S ⊆ * and T ⊆ * , called the source and target languages, respectively. Furthermore, we define a repair strategy as any function from trees in S to trees in T for any tree languages S and T .
We are now ready to introduce the problem in which we are mainly interested.
Definition 4.1. Given two regular tree languages S and T , let be the worst-case cost of repairing S into T-note that this can be equally defined as the minimum of max t∈S dist(t, f (t)) over all repair strategies f from S to T . If cost(S, T ) is finite, then we say that S is bounded repairable into T , and we write cost(S, T ) < ∞ for short. Intuitively, this is equivalent to saying that there is a repair strategy f transforming any tree t ∈ S into a tree f (t) ∈ T and having dist(t, f (t)) uniformly bounded by a constant.
The bounded repair problem amounts to deciding, given two regular tree languages S and T-specified by means of stepwise tree automata or DTDs-whether S is bounded repairable into T .  Example 3.1 (continued). Consider the two DTDs D and D that we introduced in our running example. For the tree languages specified by D and D , one can check that L (D) is bounded repairable into L (D ). Figure 6 shows how to repair a tree satisfying D into one satisfying D with five edits: first, one removes the d-labeled node under the root and its b-labeled child; then, one adds a new d-labeled node above the two branches starting with a; and finally, one adds b-labeled leaves under the r daa branches. In fact, similar strategies with edit cost at most five can be used to repair any tree t ∈ L (D) into a tree t ∈ L (D ). In particular, this shows that cost(L (D), L (D )) < ∞.
It is easy to verify that the bounded repairability relation cost(S, T ) < ∞ satisfies the following key properties, which shall be used later: The first property, subset-subsumption, is trivial to prove given that cost(S, T ) = 0 if and only if S ⊆ T . For the transitivity property, one can easily check that function dist is a metric over trees and then satisfies the triangle inequality (i.e., dist(t, t ) ≤ dist(t, t ) + dist(t , t ) for any t, t , t ∈ T ). This implies that dist(t, t ) ≤ cost(S, T ) + cost(T , U ) given that dist(t, t ) ≤ cost(S, T ) and dist(t , t ) ≤ cost(T , U ) for any t ∈ S, t ∈ T , and t ∈ U . Thus, we conclude that cost(S, U ) is also bounded and the transitivity property is proved. Finally, union-compatibility follows directly from the definition of worst-case cost of repairing a source into a target language. Indeed, if cost(S, T ) < ∞ and cost(S , T ) < ∞, then cost(S ∪ S , T ∪ T ) ≤ max{cost(S, T ), cost(S , T )}, and we conclude that cost(S ∪ S , T ∪ T ) is bounded as well.

CHARACTERIZATION OF BOUNDED REPAIRABILITY
In this section, we give an effective characterization of the bounded repairability relation between regular tree languages. Similarly to the string setting [Benedikt et al. 2013], this characterization is based on the notion of strongly connected component of the transition graph of a stepwise automaton. In the string case, a suitable coverability relation between chains of components is used to characterize bounded repairability. Because here we work with trees, we need to generalize the notion of coverability to a relation over the so-called synopsis trees-that is, full binary trees with nodes labeled by strongly connected components.

Components of Stepwise Automata
Given a stepwise automaton A = ( , Q, δ, δ 0 , F), the transition graph of A is the graph We call the edges in E v vertical and the edges in E h horizontal. Note that an edge may be both vertical and horizontal. As an example, Figure 7 depicts the transition graphs of the automata S and T of Example 3.1 (dashed arrows represent horizontal edges, solid arrows represent vertical edges).
Recall that a strongly connected component of a graph (or simply a component) is a maximal set of nodes X such that every two nodes x, y in X are connected by a direct path. By SCC(A), we denote the set of all strongly connected components in the transition graph of A. In a way similar to the string setting, we associate with each component X ∈ SCC(A) the language L (A | X) of contexts that are realizable within X: For example, the contexts realizable within the component { p d 1 } of the automaton of our running example (see also Figure 7) are all of the following form: Because editing operations on unranked trees correspond to operations involving horizontal contexts in the curry encodings, we identify strongly connected components of an automaton that yield only horizontal contexts. A proper manipulation of those components translates to performing a fixed number of editing operations regardless of the contexts such components define, which is the basis of characterizing bounded repairability. Formally, a component X ∈ SCC(A) is horizontal if and only if L (A | X) consists of horizontal contexts only. Similarly, we say that X is trivial if and only if it realizes the empty context only-that is, L (A | X) = {•}. Note that trivial components are horizontal.
As an example, consider again the transition graphs of Figure 7. All components The components { p d 1 } and {q r 1 } are nontrivial horizontal, as they both realize the contexts •, •@ c, (•@ c) @c, and so forth. The components { p a 1 } and {q a 1 } are nonhorizontal, as they both realize the contexts •, a @•, a @ (a @•), and so forth.

Synopsis Trees
We now introduce a suitable structure that eases the characterization of bounded repairability, which we call the synopsis tree. The structure is a generalization of the chain of components that is used in Theorem 4.1 [Benedikt et al. 2013] to characterize bounded repairability between string languages. Formally, a synopsis tree of an automaton A is any full binary tree whose nodes are labeled with elements of SCC(A). The language [[σ ]] A of curried trees that is induced by a synopsis tree σ of A is defined recursively as follows: Intuitively, all trees in the language [[σ ]] A are obtained by combining, in a suitable order that is compatible with the structure of σ , some contexts that are realizable within the components of σ . Figure 8 contains two synopsis trees σ and τ , respectively, for the source automaton S and the target automaton T of Example 3.1. An example of tree induced by the synopsis tree on the left is that of Figure 1.
Next we identify a family of synopsis trees that captures "closely enough" the language recognized by an automaton.

PST(S) denotes the set of all primitive synopsis trees of S.
We observe that the second property stated in Definition 5.1 is equivalent to asking that every component appears at most once in every path of a primitive synopsis tree.
As an example, the tree σ depicted to the left of Figure 8 is a primitive synopsis tree, and it corresponds to the run on the left-hand side of Figure 2 of the automaton S of Example 3.1. On the other hand, the synopsis tree τ depicted to the right is not primitive.
The idea underlying the notion of primitive synopsis tree is to capture the "cyclic behavior" of the components of the source automaton. This cyclic behavior has to be taken into account in the characterization of bounded repairability because it could generate arbitrary large fragments of trees that cannot be edited with uniformly bounded cost. Moreover, the use of primitive synopsis trees as a representation of the source language L (S) is sound in the sense that L (S) is contained in the union of the languages induced by primitive synopsis trees. Before entering the details of the proof of Lemma 5.2, we illustrate the main ideas on the example tree t from Figure 1 accepted by the automaton S from Example 3.1. We use the accepting run of S on t (see Figure 2), particularly the transitions in it that induce a change of component along both successors, to decompose t into a binary tree structure, where each node represents a context realizable by some component of S. We present this decomposition in Figure 9, where for a better visualization we annotate the states not on the nodes but on the edges above them (this requires adding a virtual edge entering the root).
PROOF OF LEMMA 5.2. Fix a curried tree t ∈ L (S) and an accepting run ρ of S on t. We need to construct a primitive synopsis tree σ such that t ∈ [[σ ]] S . Recall that a synopsis tree is a tree whose nodes are labeled with strongly connected components of S. To construct σ , we first decompose t into pieces: this will result in a tree-shaped arrangement of contexts, which we refer to as the context decomposition of t. We then show how to turn the context decomposition of t into the desired primitive synopsis tree.
Formally, we represent a context decomposition of t as a subset D of the nodes of t satisfying the following conditions: (1) D contains the root of t; and (2) for every node x in D, either x is a leaf of t or there is a descendant y of x in t such that (i) the states ρ(x) and ρ(y) belong to the same strongly connected component of S, (ii) both successors y · 1 and y · 2 belong to D, and (iii) for every node z ∈ D, if z is a proper descendant of x, then z is a proper descendant of y as well.
Note that the node-to-child relation of t induces an analogous structure on any set D of the preceding form. In particular, we can think of a context decomposition D as a full binary tree. We further associate with each internal node x of D the context D(x) that is obtained by selecting the portion of the tree t that lies between x and the unique node y such that y · 1 and y · 2 are children of x in D-the node that corresponds to y in the resulting context is labeled with the placeholder •. The flattening [[D]] of a context decomposition D is the language of trees that is inductively defined as follows: where, in the second line, C 0 is the context associated with the root of D and D 1 and D 2 are the subtrees of D (D 1 and D 2 can be seen as context decompositions of two disjoints subtrees of t). It is easy to see that t ∈ [[D]] for every context decomposition D of t. We also associate with each decomposition D of t an induced synopsis tree σ D by replacing every context C labeling an internal node x of D with the strongly connected component X of the state ρ(x) (note that C ∈ L (S | X)).

Due to the similarity between the definition of the flattening [[D]] and the definition of the language [[σ D ]] S induced by the synopsis tree
. However, given a generic context decomposition D, there is no guarantee that σ D is a primitive synopsis tree-in particular, it may happen that two consecutive nodes in σ D are labeled with the same component. To overcome this problem, next we show how to construct a specific context decomposition D of t that satisfies the following additional property: (3) If x is an internal node of D and y · 1 and y · 2 are the two immediate successors of x in D, then the component of ρ(x) is different from the components of ρ(y · 1) and ρ(y · 2).
Clearly, the additional property suffices to conclude that σ D is a primitive synopsis tree and to thus prove the lemma.
To construct a context decomposition D of t that satisfies properties (1) through (3), we follow maximal paths within the same component in the run ρ. More precisely, let x be the root of t, and let X be the component of ρ(x). We distinguish two cases depending on whether or not there is a leaf y of ρ whose state belongs to the component X. If there is such a leaf y, then we define the context decomposition D of t to be the set containing only the two nodes x and y. In this case, the corresponding synopsis tree σ D is clearly primitive. Otherwise, if all states associated with the leaves of ρ are outside X, we choose any maximal path of ρ that starts in x and visits only states within the component X. Let y be the last node of this path. Clearly, y is not a leaf, and both states ρ(y · 1) and ρ(y · 2) are outside X. By exploiting a simple inductive argument, we can assume that the two subtrees of t rooted at nodes y · 1 and y · 2 admit some context decompositions D 1 and D 2 satifsying (1) through (3). We can thus define our context decomposition D of t to be the set {x} ∪ D 1 ∪ D 2 . It is routine to verify that D satisfies properties (1) through (3) and induces a primitive synopsis tree σ D .
Summing up, we constructed from t and ρ a suitable context decomposition D, and from this we derived the existence of a primitive synopsis tree This concludes the proof of the lemma.
We also observe the following remark.
Remark 5.3. The height of a primitive synopsis tree of the source automaton S is bounded by the number of components in G S and hence by the number of states of S.
Consequently, PST(S) is a finite set and can be represented in exponential space with respect to the size of S.
To represent the target language and the possible edited trees, one needs a relaxed version of primitive synopsis tree, called the basic synopsis tree, which enforces only the first condition of Definition 5.1. Basic synopsis trees are the analogs of chains of components over dag * (T ) that were used in Theorem 4.1 [Benedikt et al. 2013] to characterize bounded repairability between string languages.
Definition 5.4. A basic synopsis tree of an automaton T is a synopsis tree τ of T that respects the transition function of T (compare to first item of Definition 5.1). We denote by BST(T ) the set of all basic synopsis trees of T .
For example, the tree τ in Figure 8 is a basic synopsis tree that respects the transitions of the run of the automaton T depicted in the right-hand side of Figure 2.
Differently from primitive synopsis trees, basic synopsis trees may contain repeated occurrences of the same component. This implies that the set BST(T ) of all basic synopsis trees of T is potentially infinite. However, this set can be finitely presented by means of a deterministic binary bottom-up tree automaton of size polynomial in the size of T .
The following lemma shows that the language induced by a basic synopsis tree of T is bounded repairable into the language L (T ).
where [[τ ]] T is now seen as a language of unranked trees. In particular, PROOF. We begin by explaining the main ingredients of the proof. Any tree t ∈ [[τ ]] T can be seen as a composition of contexts, precisely a context C x for each node x in τ , with C x ∈ L (T | τ (x)). Every such context can be decorated by a partial run of T that justifies the fact that the context belongs to the language L (T | τ (x)). These partial runs can be used to construct a complete run of T , but only after the insertion of a small number of small pieces of runs, which provide the necessary connections between the partial run of a context C x and the partial runs of the successor contexts C x·1 and C x·2 . The existence of these pieces is guaranteed by the fact that the synopsis tree τ respects the transition function of T and by the definition of strongly connected component. We now turn to a more detailed proof.
For every state q of T , we define the automaton T q = ( , Q, δ, δ 0 , {q}) that recognizes trees via runs that end with state q at the root. We also define p,q = min{|C| | q ∈ δ( p, C)} for each pair of states p, q, with q reachable from p in G T . We take the maximum of all these values (i.e., = max p,q∈Q p,q ), and we observe that ≤ 2 |Q| . We can associate with each pair p, q ∈ Q such that q is reachable from p a context C p,q of size at most such that q ∈ δ( p, C p,q ). Next we exploit an induction on the size of a basic synopsis tree τ to prove that for all states q ∈ τ (ε), Note that to get from the preceding statement to the claim of the lemma, it is sufficient to recall that T is trimmed, and hence any tree in L (T q ) can be repaired into L (T ) by simply inserting a context C q,q , where q is some state from F.
To prove the statement in the base case τ = X, we consider a generic tree t ∈ [[X]] A , and we observe that t = C • a for some context C ∈ L (T | X) and some letter a ∈ .
Clearly, there exist p , q ∈ X such that q ∈ δ( p , C). Since T is trimmed, there exist a symbol b ∈ and a state q 0 ∈ δ 0 (b) such that p is reachable from q 0 . We define the tree t = C q ,q • C • C q 0 , p • b and observe that t can be obtained from t (= C • a) by replacing the leaf node a with the tree C q 0 , p • b and by adding the context C q ,q at the top. Clearly, t ∈ L (T q ) and the overall cost of transforming t to t is at most 4 . Now, for the inductive step, suppose that τ = X(τ 1 , τ 2 ) and consider t ∈ [[X(τ 1 , τ 2 )]] T . By definition, we have t = C • (t 1 @ t 2 ) for some context C ∈ L (T | X) and some trees Since the synopsis tree τ respects the transition function of T , we know that there exist p ∈ X, p 1 ∈ τ 1 (ε), and p 2 ∈ τ 2 (ε) such that p ∈ δ( p 1 , p 2 ). By inductive hypothesis, there exists t 1 ∈ L (T p 1 ) (respectively, t 2 ∈ L (T p 2 )) such that dist(t 1 , t 1 ) ≤ 4|τ 1 | · (respectively, dist(t 1 , t 1 ) ≤ 4|τ 1 | · ). We can then define and claim that t ∈ L (()T q ). Moreover, the preceding tree t can be obtained from the original tree t = C • (t 1 @ t 2 ) by first transforming the subtrees t 1 and t 2 into t 1 and t 2 , respectively, then inserting the context C p, p between t 1 @ t 2 and C, and finally adding the context C q ,q at the top. Overall, this transformation costs at most 4|τ 1 | · + 4|τ 2 | · + 2 ≤ 4|τ | · .

Coverings
In the previous section, we introduced the concepts of primitive and basic synopsis trees, and we showed that they correspond roughly (i.e., up to boundedly many edits) to trees accepted by the source and target automata, respectively. The remaining part of the puzzle is to relate each primitive synopsis tree of the source automaton S to some basic synopsis tree of the target automaton T , so as to characterize bounded repairability from L (S) to L (T ). This is accomplished by the notion of covering between synopsis trees.
Definition 5.6. Given two stepwise tree automata S and T and two synopsis trees σ of S and τ of T , we say that σ is covered by τ if and only if there is an injective mapping λ from nontrivial nodes of σ to nontrivial nodes of τ such that (1) λ maps components in a way that is compatible with the languages of contextsthat is, L (S | σ (x)) ⊆ L (T | τ (λ(x))) for every nontrivial node x of σ ; (2) λ preserves the post-order of nontrivial nodes-that is, We call the mapping λ a covering from σ to τ, and we denote it shortly by λ : σ → τ . Figure 10 presents a covering of a primitive synopsis tree σ of S by a basic synopsis tree τ of T , where the square boxes represent the nontrivial nodes and have double borders when the component is nonhorizontal.
We are now able to state the main characterization theorem of the article.
THEOREM 5.7. Given two stepwise automata S and T , the language L (S) is bounded repairable into the language L (T ) if and only if every primitive synopsis tree σ of S is covered by some basic synopsis tree τ of T , namely The proof of the preceding result is given in Section 6. Here we briefly explain the main ideas underlying the definition of covering. We begin by observing that a reasonable strategy for repairing S into T with uniformly bounded cost applies the edit operations only at the "junctions" of the contexts realized by the nontrivial components. Indeed, since nontrivial components of S can realize arbitrary large repetitions of the same context, we have that either these repetitions do not need any editing at all or they need an arbitrary large amount of editing. This observation gives an intuitive account for the first condition of Definition 5.6, which enforces containment relationships between languages of contexts realizable within nontrivial components.
As for the other two conditions, it is worth looking at the effect of an edit operation on the curry encoding of an unranked tree t ∈ L (S). Let us consider a node x in t that is about to be deleted by the editing. There is a unique way to represent the curry encoding of t together with the distinguished node x as an expression of the form C • (t @ (C • a)), where a is the label of x and C is a horizontal context representing the forest of subtrees under x. The result of the deletion of node x from t is encoded by the curried tree C • C • t (see Figure 4 for an example). Note that this operation does not allow the deletion of the left-most leaf node in the curried tree (this would correspond to deleting the root node in an unranked tree, an operation that is typically prohibited). The operation of inserting a new node y in an unranked tree t can be described in a symmetric way via curry encodings and transpositions of horizontal contexts-that is, given an unranked tree t with curry encoding C • C • t , where C is a horizontal context, the curried tree C • (t @(C • a)) represents the unranked tree that results from the insertion of a new a-labeled node y in t having as children the forest represented by C .
We now observe that the transformations on curried trees that we just described satisfy two crucial properties: (i) they preserve the post-order of the nodes, and (ii) they preserve the ancestorship of nonhorizontal contexts (e.g., the context C of Figure 4) with their descendants. These properties are precisely captured by the last two conditions of Definition 5.6.
We conclude the section by mentioning a strengthening of the "if " direction of Theorem 5.7, which gives an upper bound for the cost of an optimal repair strategy from L (S) to L (T ).

PROOF OF THE MAIN CHARACTERIZATION
The following sections are devoted to proving the two directions of the characterization.

From Covering to Repair
We begin with the proof of the "if " direction of Theorem 5.7. For the rest of the section, we fix two stepwise automata S = ( , Q, δ, δ 0 , F) and T = ( , Q , δ , δ 0 , F ) recognizing the source and the target languages, respectively. We then assume that every primitive synopsis tree of S is covered by some basic synopsis tree of T , and we show how to construct a repair strategy from L (S) to L (T ) with uniformly bounded cost. The proof basically follows from a series of containments and bounded repairability relations between languages, which can be summarized as follows: where denotes the bounded repairability relation. The intermediate bounded repairability relation in the preceding chain is established by the following lemma.
LEMMA 6.1. For every synopsis tree σ of S and every synopsis tree The proof of the preceding lemma is quite technical and will take the entire section. Before entering the details, we briefly discuss how the "if " direction of Theorem 5.7 follows from it. By Lemma 5.2, the source language L (S) is contained in the union of the languages [[σ ]] S induced by all primitive synopsis trees σ ∈ PST(S). By the hypothesis, each of these synopsis trees is covered by some basic synopsis tree τ of the target automaton T . Thus, by Lemma 6.1, each language [[σ ]] S can be repaired with uniformly bounded cost into the language [[τ ]] T for some τ ∈ BST(T ). By Lemma 5.5, each language [[τ ]] T can in turn be repaired with uniformly bounded cost into the target language L (T ). The result now follows from the fact that there are only finitely many primitive synopsis trees and the fact that bounded repairability is a transitive relation that is moreover preserved by finite unions.
The rest of the section is devoted to the proof of Lemma 6.1. We begin by extending slightly the definition of synopsis tree and by allowing the use of special nodes labeled with ε that represents dummy trivial components. The semantics is extended in the natural way by letting L (A | ε) = {•} (for any stepwise automaton A). Because all trivial components have the same associated language {•}, we shall often identify trivial components of automata with the dummy component ε.
For a technical reason (see the proof of Lemma 6.3 later), we also need to assume that the alphabet of the source automaton S is contained in the alphabet of the target automaton T . Note that this condition can be enforced without loss of generality-that is, without changing the recognized languages.
The first ingredient of the proof shows how to "interpolate" two synopsis trees σ and τ by a third synopsis tree θ of S in such a way that -θ has the same labels (i.e., components) as σ on the nontrivial nodes, and it covers σ via a bijection between nontrivial nodes of σ and nontrivial nodes of θ that maps any nontrivial node of σ with label X ∈ SCC(S) to a nontrivial node of θ with the same label X (we say that σ is strongly covered by θ and denote this by σ θ ); and -θ has the same domain (i.e., set of nodes) as τ, and it is covered by τ via the identity function between nontrivial nodes (we say that θ is embedded into τ and denote this by θ τ ).
It is not difficult to show that such an interpolating synopsis tree θ exists. In other words, if σ is covered by τ , then there is a synopsis tree θ that strongly covers σ and that is embedded in τ .
PROOF. Let λ be a covering function from σ to τ and recall that λ is injective. Let range(λ) denote the set of nodes of τ of the form λ(x) for some x ∈ nodes(σ ). Moreover, for every y ∈ range(λ), let λ −1 (y) denote the unique node x of σ such that λ(x) = y. The synopsis tree θ has the same domain as τ and the same labels as σ -that is, for all y ∈ nodes(θ ) = nodes(τ ), Note that λ can be seen as a bijection between the nontrivial nodes of σ and the nontrivial nodes of θ (the latter are precisely the nodes of τ that belong to range(λ)). It follows that σ is strongly covered by θ via the function λ and that θ is embedded into τ via the identity function.
A first advantage of considering an interpolating synopsis tree θ that is embedded into τ is that θ and τ have the same structure. As a consequence, we can claim that the language induced by θ is contained in (not just bounded repairable into) the language induced by τ . LEMMA 6.3. If θ is a synopsis tree of S, τ is a synopsis tree of T , and θ is embedded PROOF. The proof is by structural induction on θ (or, equally, τ ). In the base case, where θ consists of a single node x, we consider the components θ (x) = X and τ (x) = Y . If X is a trivial component, then since the covering function from θ to τ is a bijection between nontrivial nodes, we deduce that Y is also a trivial component, and hence L (S | X) = L (T | Y ) = {•}. Now, recall that we assumed that the source alphabet is contained in the target alphabet . From this, it follows that [[θ ] Otherwise, if X is a nontrivial component, then from the fact that θ is covered by τ , we obtain L (S | X) ⊆ L (T | Y ). As before, we conclude that [[θ ] For the inductive step, we suppose that θ = X(θ 1 , θ 2 ) and τ = Y (τ 1 , τ 2 ). We consider a generic tree t ∈ [[θ ]] S . By definition, we can write t = C • (t 1 @ t 2 ) for some context C ∈ L (S | X) and some trees t 1 ∈ [[θ 1 ]] S and t 2 ∈ [[θ 2 ]] S . We know from the inductive hypothesis that t i ∈ [[τ i ]] T for both i = 1 and i = 2. If X is a trivial component, then C is necessarily the trivial context, which is also realizable within the component Y . Otherwise, if X is a nontrivial component, then it must be mapped to the component Y by the embedding function, and hence C ∈ L (T | Y ). In both cases, we conclude that t ∈ [[τ ]] T . Now recall from Section 4 that the bounded repairability relation is transitive and it generalizes containment. In particular, the previous results reduce the statement of Lemma 6.1 to the problem of proving that [[σ ]] S is bounded repairable into [[θ ]] S . In proving the latter statement, we can take advantage of the fact that σ is strongly covered by θ . In particular, we observe that the strong coverability relation is an equivalence: it is indeed reflexive, symmetric, and transitive (the last two properties follow from the fact that the function that witnesses strong coverability is a bijection between nontrivial nodes that preserves components).
To derive bounded repairability from strong coverability, we associate with each synopsis tree σ of S a suitable normal form σ * that can be used as a canonical representative of the equivalence class of σ induced by the strong coverability relation. We will then prove that [[σ ] can be read off the sequence of generic editing operations that takes σ to its normal form σ * (respectively, θ to its normal form θ * ).
In the sequel, we only manipulate synopsis trees of the source automaton S. For this reason, we can omit the subscript S from notations like [[σ ]] S . Next we describe the structure of a synopsis tree in normal form.
Definition 6.4. A synopsis tree σ is in normal form if one of the following cases holds: (1) σ = ε, namely σ consists of a single node labeled with a trivial component; (2) σ = X(α, ε), where X is a nontrivial horizontal component and α is a synopsis tree in normal form; and (3) σ = ε(α, X(β, ε)), where X is a nonhorizontal component and α, β are synopsis trees in normal form.
We observe that the root of a synopsis tree in normal form is a horizontal (possibly trivial) node and its left subtree is also in normal form. In particular, this means that all components along the left-most branch of a synopsis tree in normal form are horizontal.
The following lemma shows that synopsis trees in normal form can be used as canonical representatives of the equivalence classes induced by the strong coverability relation.
LEMMA 6.5. If σ and σ are two synopsis trees in normal form that strongly cover each other, then σ and σ are isomorphic.
PROOF. Let σ and σ be two synopsis trees in normal form, and let λ be a bijection between the nontrivial nodes of σ and the nontrivial nodes of σ that witnesses the fact that σ σ . In the following, we often identify, for the sake of simplicity, the nodes of the synopsis trees σ and σ with their labels. The proof is by structural induction and case analysis.
For the base case, suppose that σ = ε. Since σ contains only trivial nodes and λ is surjective over nontrivial nodes, σ contains only trivial nodes as well. Since σ is in normal form, it follows that σ = ε.
For the inductive step, we distinguish two cases depending on whether σ is of the form X (σ 1 , ε), where X is a nontrivial horizontal component, or of the form ε(σ 1 , X(σ 2 , ε)), where X is a nonhorizontal component.
In the former case (i.e., σ = X(σ 1 , ε)), we recall that the mapping λ is a bijection between nontrivial nodes that preserves the post-order. Because the root X of σ is nontrivial and is the maximal element with respect to the post-order relation, it must be mapped by λ to the root Y of τ . Moreover, since λ preserves the labels of nontrivial nodes, we have that Y = λ(X) = X. In particular, Y is a nontrivial horizontal component. Since σ is in normal form, it follows that its right subtree is ε, and hence λ maps the nontrivial nodes of the left subtree σ 1 of σ to the nontrivial nodes of the left subtree σ 1 of σ . Finally, since both σ 1 and σ 1 are synopsis trees in normal form, we obtain from the inductive hypothesis that σ 1 = σ 1 , and hence σ = X(σ 1 , ε) = Y (σ 1 , ε) = σ .
We consider the second case (i.e., σ = ε(σ 1 , X(σ 2 , ε))), where X is a nonhorizontal component. For the sake of contradiction, suppose that the root Y of σ is a nontrivial component. Since Y is the last nontrivial node in the post-order traversal of σ and λ preserves the post-order of nontrivial nodes, the pre-image λ −1 (Y ) in σ would consist of the last nontrivial node in the post-order traversal of σ , from which λ −1 (Y ) = X. However, since X is a nonhorizontal component, this would be against the hypothesis that σ is in normal form (recall that the root of any synopsis tree in normal form is always a horizontal component). Knowing that Y is a trivial node and σ is in normal form, we obtain Y = ε, and hence σ is of the form ε(σ 1 , Z(σ 2 , ε)) for some nonhorizontal component Z and two synopsis trees σ 1 and σ 2 in normal form. Toward a conclusion, observe that in the post-order traversal of σ , the nontrivial nodes of σ 1 precede the nontrivial nodes of σ 2 , and the latter are followed by the nonhorizontal node X. Since λ is a bijection that preserves the post-order of nontrivial nodes and the ancestorship relation with nonhorizontal nodes, we conclude that λ(X) = Z and that the synopsis trees σ 1 and σ 1 (respectively, σ 2 and σ 2 ) strongly cover each other. Using the inductive hypothesis, we finally conclude that σ = ε(σ 1 , X(σ 2 , ε)) = ε(σ 1 , Z(σ 2 , ε)) = σ .
Thanks to Lemma 6.5, we can define the normal formσ * of a synopsis tree σ as the unique synopsis tree that is in normal form and that strongly covers σ , provided that this tree exists.
Our next goal is to prove that the normal form σ * of σ indeed exists, and that it can be attained by a finite sequence of generic editing operations on synopsis trees. These operations are called promotion, demotion, and reduction, and they are presented in Figure 11. There, ε represents a trivial component, X represents an arbitrary component, H 1 , . . . , H k represent horizontal (possibly trivial) components, and α, β 1 , . . . , β k represent arbitrary synopsis trees. Note that the figure describes the case where promotion, demotion, and reduction operations are applied at the root of a synopsis tree-in general, these operations can be applied to any subtree of a synopsis tree. We write σ → * op σ whenever σ can be obtained from σ by applying a finite sequence of promotion, demotion, and reduction operations. To give further intuition about these operations, we remark an analogy between the operations of promotion, depicted in Figure 11, and deletion, depicted in Figure 4 (a similar correspondence holds between the operations of demotion and insertion of a new root). In this case, the root X of the synopsis tree is acting as the context C of the curried tree, the subtree α is acting as the curried subtree t , and the subtree rooted at H 1 is acting as the horizontal context C .
Notice that the editing operations on synopsis trees that we just described preserve the post-order of nontrivial nodes and the ancestorship of nonhorizontal nodes. From this, it follows that they also preserve the strong coverability relation. The following lemma shows that the normal form of a synopsis tree exists and can be obtained via a sequence of promotion, demotion, and reduction operations. LEMMA 6.6. For every synopsis tree σ , there is σ * in normal form such that σ → * op σ * . Moreover, the number of operations needed to transform σ into σ * is bounded by 2|σ |.
PROOF. The proof goes again by a structural induction on the synopsis tree σ . Intuitively, we first normalize the left and right subtrees of σ separately using induction. Then we complete the normalization process by applying a suitable series of operations on the basis of the component at the root of σ : if this component is nonhorizontal, then we apply a promotion followed by a demotion; if it is horizontal and nontrivial, then we only apply a promotion operation; and if it is trivial, then we apply a promotion followed by a reduction operation.
We can assume without loss of generality that all leaves in the synopsis tree σ are trivial and hence are labeled by ε (indeed, we can append ε-labeled nodes to every leaf of σ without changing its equivalence class). This assumption reduces the base case to the situation where the synopsis tree σ consists of a single ε-labeled node. In this case, the synopsis tree is already in normal form, and hence the lemma is trivially satisfied by letting σ * = ε = σ .
For the inductive step, we assume that σ = X(α, β). First, we transform the subtrees α, β of σ into their corresponding normal forms α * , β * (this can be done since, by inductive hypothesis, α → * op α * and β → * op β * ). We consider the intermediate synopsis tree that we just obtained: Since the right subtree β * is in normal form, its left-most branch consists of horizontal components only. We can thus write for some horizontal (possibly trivial) components H 1 , . . . , H k and some synopsis trees β 1 , . . . , β k . We can then perform a promotion operation at the root of σ and obtain the synopsis tree σ = X(H 1 (. . . H k (α * , β k ), . . . , β 1 ), ε).
Using a simple induction on i = k, . . . , 1 and the fact that the subtrees H i (. . . H k (ε, β k ), . . . , β 1 ) of β * are in normal form, one can easily verify that the subtrees H i (. . . H k (α * , β k ), . . . , β 1 ) of σ are also in normal form. In particular, this shows that the left subtree of σ is in normal form.
We now distinguish a few cases depending on whether the component X is horizontal, trivial, or nonhorizontal: (1) If the component X is horizontal and nontrivial, then σ is already in normal form and we can simply let σ = σ * . (2) If the component X is trivial, then we "lift" the left subtree of σ via a reduction operation. This results in a synopsis tree σ * = H 1 (. . . H k (α * , β k ), . . . , β 1 ) in normal form.
(3) If the component X is nonhorizontal, then we apply a demotion operation to σ so as to obtain the synopsis tree We observe that both subtrees ε and H 1 (. . . H k (α * , β k ), . . . , β 1 ) are in normal form. Moreover, since X is nonhorizontal, we know that σ * is also in normal form.  [[σ ]] are repairable one into the other by means of a sequence of editing operations of uniformly bounded length. In other words, a single promotion, demotion, or reduction operation to a synopsis tree σ corresponds to a small amount of edits that are applicable to any generic tree in the language [[σ ]]. The proof of this result is via a simple analysis of the transformations on unranked trees that are induced by the operations of promotion, demotion, and reduction. LEMMA 6.7. If σ is a synopsis tree obtained from another synopsis tree σ via a single promotion, demotion, or reduction operation, then cost ( [[σ ]]) ≤ 2 follows from the fact that standard editing operations on trees can be reverted. In the sequel, we assume that all synopsis trees are related to a stepwise automaton S. We apply a case distinction based on the type of operation that transforms σ into σ : (1) Consider a promotion operation, which takes a synopsis tree of the form σ = X(α, H 1 (. . . H k (ε, β k ), . . . , β 1 )) and transforms it into the synopsis tree σ = X(H 1 (. . . H k (α, β k ), . . . , β 1 ), ε). Consider also a generic tree t ∈ [[σ ]]. This can be written as , and a ∈ . For the sake of brevity, define the horizontal context in such a way that we can write t = C • (s@ (C • a)). After deleting the a-labeled node from t, we obtain the tree t = C • (C • s), and after inserting a b-labeled node, we obtain the tree which clearly belongs to [[σ ]]. (2) Consider now a demotion operation, which takes a synopsis tree σ and transforms it into the synopsis tree σ = ε(ε, σ ). Let t ∈ [[σ ]], and let x be the left-most leaf in t (note that this corresponds to the root of the unranked tree ext −1 (t)). We can write t = C • a, where a is the label of the left-most leaf x of t and C is the horizontal context obtained from t by relabeling x with a placeholder. By applying an insertion operation to t, we obtain the tree t = b @ (C • a), which clearly belongs to [[σ ]]. (3) We finally consider a reduction operation, which transforms a synopsis tree σ = ε(α, ε) into the synopsis tree σ = α. We can write any generic tree t ∈ [[σ ]] as t = s @ a, where s ∈ [[α]] and a ∈ . In this case, it suffices to perform one deletion to obtain the tree t = s, which clearly belongs to [[σ ]].
We observe that the repair strategies defined previously can be lifted to trees under any given context C. More precisely, if a tree t ∈ [[σ ]] is transformed with editing operations into a tree t ∈ [[σ ]], then the tree C • t can be edited into C • t using the analogous strategy. This observation is important because the operations of promotion, demotion, and reduction may be applied at arbitrary nodes of synopsis trees.
To conclude the proof, we remark that, thanks to Lemma 6.6, the normal form σ * of any synopsis tree σ can be obtained by applying a sequence of promotions, We conclude the section by proving Proposition 5.8, which essentially gives an upper bound for the cost of an optimal repair strategy from L (S) to L (T ). To prove this proposition, we need to analyze the minimum size of a basic synopsis tree of T that covers a given primitive synopsis tree of S. LEMMA 6.8. Given a primitive synopsis tree σ of S, if σ is covered by some basic synopsis tree of T , then it is covered by one such tree τ ∈ BST(T ) that has size at most (4|σ | + 1) · |SCC(T )|, where |σ | is the number of nodes of σ and |SCC(T )| is the number of components of T .
PROOF. Let σ ∈ PST(S) and τ ∈ BST(T ) such that σ → τ, and let λ be the injective function from nontrivial nodes of σ to nontrivial nodes of τ that witnesses σ → τ . We begin by identifying those nodes of τ that belong to the range of λ. Formally, we say that a node y of τ is used if y ∈ λ(x) for some nontrivial node x of σ . Next, we show how to restrict τ to a subset of its nodes having size at most (4|σ | + 1) · |SCC(T )| and such that the induced subgraph is a basic synopsis tree of T that also covers σ .
We first define the set V that only contains the following nodes: (1) the root of τ , (2) the used nodes of τ , and (3) the nodes of τ whose both subtrees contain some used nodes of τ .
We claim that the subgraph of τ induced by V is a tree with at most two children on each node (note that some internal nodes in the induced subgraph V may contain only one child). Consider two nodes y 1 , y 2 in V . Let y be the least common ancestor of y 1 and y 2 in τ . As both subtrees of y in τ contain used nodes-indeed, they contain y 1 and y 2 , respectively-the node y also belongs to V . This shows that V is closed under the least common ancestor, and hence τ restricted to V is a tree with out-degree at most 2. We can also verify that the size of V is at most 2|σ | + 1. Indeed, the number of used nodes in τ is at most |σ |, and so is the number of nodes whose both subtrees contain used nodes.
Next we extend the set V minimally in such a way that the induced subgraph of τ is a basic synopsis tree (in particular, it is a binary tree). Formally, we let W be the set of the following nodes: (4) the nodes in V , (5) the nodes of τ with one subtree containing some used nodes and with a label different from that of its parent, and (6) the immediate successors in τ of all previous nodes.
Clearly, the subgraph of τ induced by W, denoted τ | W , is a full binary tree, namely all internal nodes have exactly two children. Essentially, this holds because we included in W the immediate successors of a set of nodes.
It is also easy to see that τ | W is a basic synopsis tree. Indeed, consider a node y ∈ W and its two successors y 1 and y 2 in the induced subgraph τ | W . If both y 1 and y 2 are immediate successors of y in τ , then, since τ respects the transition function δ of T , there exist some states q ∈ τ (y), q 1 ∈ τ (y 1 ), and q 2 ∈ τ (y 2 ) such that q ∈ δ(q 1 , q 2 ). Otherwise, if y 1 is not an immediate successor of y, then all nodes of τ between y and the parent of y 1 must have the same label τ (y) (otherwise, one of these nodes would belong to W, thus contradicting the fact that y 1 is a successor of y in τ | W ). From this, using similar arguments as in the previous case, we conclude that q ∈ δ(q 1 , q 2 ) for some states q ∈ τ (y), q 1 ∈ τ (y 1 ), and q 2 ∈ τ (y 2 ). The case where y 2 is not an immediate successor of y in τ is just symmetric. Overall, this proves that the induced subgraph τ | W respects the transition function of T , and hence it is a basic synopsis tree.
To prove that τ | W covers σ , it suffices to recall that W contains all used nodes of τ -that is, λ(x) ∈ W for all nontrivial nodes of σ , and hence the same function λ that witnessed σ → τ can be used to witness σ → τ | W .
It remains to prove that |W| ≤ (4|σ | + 1) · |SCC(T )|. It is easy to see that every node in W\V either has a descendant that is used or is the successor of a node with a used descendant. This means that any subset of W\V containing only nodes that are pairwise incomparable with respect to the ancestor relation has size at most twice the number of used nodes, hence at most 2|σ |. Moreover, if we consider sets of nodes from W\V totally ordered with respect to the ancestor relation, then we observe that such a set has size at most |SCC(T )|: indeed, every two nodes in this set that are consecutive in τ must be labeled with different components. Putting it all together and recalling that |V | ≤ 2|σ |+1, we conclude that |W| ≤ 2|σ |+1+2|σ |·|SCC(T )| ≤ (4|σ |+1)·|SCC(T )|.
We are now ready to derive an upper bound for the cost of an optimal repair strategy from L (S) to L (T ) under the assumption that all primitive synopsis trees of S are covered by basic synopsis trees of T . PROOF OF PROPOSITION 5.8. Let f be a function that maps every primitive synopsis tree σ of S to a basic synopsis tree τ of T that covers σ . Following the previous results, the strategy for repairing L (S) into L (T ) can be obtained from a series of transformations between languages having the following costs: where Q is the set of states of T . In particular,we get As we pointed out previously, any primitive synopsis tree σ of S is of bounded size, precisely |σ | ≤ 2 |Q| , where Q is the set of states of S. As concerns the minimum size of a basic synopsis tree f (σ ) that covers σ , by applying Lemma 6.8 we get that for every σ ∈ PST(S), there is τ ∈ BST(T ) such that σ → τ and |τ | ≤ (4|σ | + 1) · |SCC(T )|.

From Repair to Covering
In this section, we prove the "only if " direction of Theorem 5.7. We fix for the rest of the section two stepwise automata S = ( , Q, δ, δ 0 , F) and T = ( , Q , δ , δ 0 , F ) recognizing the source and the target languages, respectively. We assume that L (S) is repairable into L (T ) with uniformly bounded cost, and we prove that every primitive synopsis tree of S is covered by some basic synopsis tree of T . The general idea is to associate with each primitive synopsis tree σ of S a suitable tree t σ ∈ L (S), called the witness tree of σ , such that from any optimal repair of t σ into L (T ) one can extract a basic synopsis tree τ of T that covers σ . Intuitively, the witness tree t σ is obtained from the primitive synopsis tree σ by replacing every nontrivial node x with a sufficiently large number of repetitions of a special context in L (S | σ (x)), called the fingerprint context. The number of repetitions of each fingerprint context will depend on the worst-case repair cost K = dist(L (S), L (T )). Using the definition of witness tree t σ and the assumption that t σ can be repaired into some tree t σ ∈ L (T ) with at most K edits, one can then argue that t σ contains at least one copy of the fingerprint context associated with each nontrivial node x of σ, and furthermore the arrangements of these fingerprint contexts inside t σ and inside t σ are the same, both with respect to the post-order relation and with respect to the ancestorship of the nonhorizontal components. One finally looks at some run of T that accepts the tree t σ : this run, together with the structure of the fingerprints inside t σ , induces a basic synopsis tree τ of T and a coverability relation from σ to τ . Next we illustrate the various definitions and arguments in more detail. We divide up the proof into constructing the witness tree t σ and building the cover from its repair.
Constructing the witness tree. We begin by giving the following lemma, which defines the so-called fingerprint context of a component of S. Basically, the lemma shows that given a component X of S, one can find a context C X that can be "pumped" inside the language L (S | X) (i.e., C X •. . .•C X ∈ L (S | X)) and that characterizes the containment . We say that a context C is cyclic for a component X if there is a state q ∈ X such that q ∈ δ(q, C). LEMMA 6.9. For all X ∈ SCC(S), there is a cyclic context C X ∈ L (S | X) such that for all Y ∈ SCC(T ), PROOF. Let X be a component of S, and let Y 1 , . . . , Y m ∈ SCC(T ) be all components of T . We construct the cyclic context C X by exploiting an induction over the number m of components of T -that is, we prove that for every 0 ≤ i ≤ m, there is a cyclic context Clearly, the statement of the lemma follows from ( ) when we let C X = C m . The base case i = 0 holds vacuously for C i = •, so we focus on the inductive step. Suppose that we defined a context C i that satisfies ( ) for some index 0 ≤ i < m. To construct a context C i+1 that satisfies ( ), we need to distinguish two cases, depending on whether L (S | X) ⊆ L (T | Y i+1 ) or not.
If L (S | X) ⊆ L (T | Y i+1 ), then we define C i+1 = C i and observe that ( ) holds trivially for i + 1. Otherwise ). Since C i is cyclic and C ∈ L (S | X), there exist some states p, q, r ∈ X such that r ∈ δ(r, C i ) and q ∈ δ( p, C). Let C and C be some other contexts in L (S | X) such that r ∈ δ(q, C ) and p ∈ δ(r, C ) (such contexts exist since p, q, r are states within the same strongly connected component of S). We then define  We claim that C i+1 is a cyclic context in L (S | X). Indeed, we have the following runs in the source automaton S: It is also easy to see that We have just constructed a cyclic context C i+1 ∈ L (S | X) such that L (S | X) ⊆ L (T | Y i+1 ) if and only if C i+1 ∈ L (T | Y i+1 ). To conclude the proof, we recall from the inductive hypothesis that for all 1 ≤ j ≤ i, L (S | X) L (T | Y j ) implies that C i ∈ L (T | Y j ), and hence since C i+1 contains an occurrence of . All together, this shows that for all 1 ≤ j For the rest of this section, we fix for each component X of S a context C X that satisfies Lemma 6.9, which we call the fingerprint context. We also fix an arbitrary primitive synopsis tree σ of S.
Next we construct the so-called witness tree t σ by exploiting a structural induction on the primitive synopsis tree σ . In doing so, we will guarantee that there exists a run of S on t σ that assigns to the root of t σ some state that belongs to the same component that labels the root of σ , namely δ(t σ ) ∩ σ (ε) = ∅. We omit the construction for the base case, where σ is a singleton, as it can be easily derived from what follows.We assume that X is the component at the root of σ and that σ 1 and σ 2 are the (nonempty) left and right subtrees of σ . Suppose that t σ 1 and t σ 2 are the recursively defined witness trees for σ 1 and σ 2 . Moreover, choose arbitrarily some q 1 (respectively, q 2 ) in the nonempty set δ(t σ 1 ) ∩ σ 1 (ε) (respectively, δ(t σ 2 ) ∩ σ 2 (ε)). The witness tree t σ for σ is obtained from the following series of transformations (we suggest that the reader refers to Figure 12 for a graphical representation): (1) The first transformation merges the trees t σ 1 and t σ 2 into a single tree that induces a run of S ending in the component X. We know from the definition of primitive synopsis tree that σ respects the transition function of S. In particular, this means that there exist some states q ∈ X, q 1 ∈ σ 1 (ε), and q 2 ∈ σ 2 (ε) such that q ∈ δ(q 1 , q 2 ).
(2) The second transformation extends the tree obtained in the previous step in such a way that one can later append repetitions of the fingerprint context C X . This is done by identifying a "recurrent" state q such that q ∈ δ(q, C X ) (this state exists since C X is cyclic) and then connecting it to the state q via a suitable context C ∈ L (S | X) such that q ∈ δ(q , C) (note that q and q belong to the same component X). The resulting tree is of the following form: To avoid that an editing of the witness tree t σ could modify the ancestorship of C X with the nodes of the two subtrees t σ 1 and t σ 2 , we further assume that if X is a nonhorizontal component, then the context C that is used for connecting q to q is of the form C K n.h. • C , where K = dist(L (S), L (T )), C , C n.h. ∈ L (S | X), C n.h. is some cyclic nonhorizontal context, and C K n.h. is the K-fold repetition of C n.h. (recall that the ancestorship of nonhorizontal contexts is preserved by the editing operations).
(3) The last transformation adds a sufficiently large repetition of the fingerprint context C X . For this, we define H = m · (2K + 1), where K = dist(L (S), L (T )) and m is the number of components of T . We then attach to the tree so far constructed the H-fold repetition C H X of the fingerprint context C X , finally obtaining the desired witness tree: ). We observe that, thanks to the preceding constructions, the automaton S admits a run on t σ that ends in state q ∈ X. This shows that the invariant δ(t σ ) ∩ X = ∅ is satisfied. We also remark that it may happen that δ(t σ )∩ F = ∅, and hence t σ / ∈ L (S). Technically speaking, this could violate the claim that one can repair t σ into L (T ) with at most K edits. However, from the assumption that S is trimmed, it follows that there is a context C F such that δ(q, C F ) ∩ F = ∅, and hence one can always prolong t σ to obtain a tree inside the language L (S). From now on, we assume for the sake of simplicity that t σ ∈ L (S).
Building the covering from a repaired witness tree. We now turn toward extracting a covering of σ from a repair of t σ . We fix, once and for all, the tree t σ in the target language L (T ) that is obtained by repairing t σ with at most K edits.
We recall that the witness tree t σ contains H = m · (2K + 1) copies of the fingerprint context C X , for each node x in σ , where X = σ (x). As a consequence, the repaired tree t σ must contain an m-fold repetition C m X of each fingerprint context C X . In the following, we will look at the occurrences of these fingerprint contexts inside the repaired tree t σ and compare their post-order and ancestor relationships with those for the analogous occurrences in t σ . We need some preliminary definitions.
Given a context C and a node x of a tree t, we say that C occurs at node x if there exist a context C and a tree t such that (i) t can be written as C • C • t and (ii) x is the node from where the subtree C • t of t hangs. We denote an occurrence of a context C at a node x of a tree t by the pair (C, x). Furthermore, we say that two occurrences (C, x) and (C , x ) of two contexts inside the same tree t are nonoverlapping (respectively, in postorder relation, ancestor relation) if {x} · nodes(C) ∩ {x } · nodes(C ) = ∅ (respectively, if The following lemma shows that the occurrences of the contexts C m X inside t σ are in the same post-order relation as some corresponding occurrences inside t σ , and similarly for the ancestor relation when X is a nonhorizontal component. LEMMA 6.10. One can find a mapping f from the nontrivial nodes x of the primitive synopsis tree σ to the nodes f (x) of the repaired witness tree t σ such that PROOF. We begin by establishing a property that concerns the occurrences of contexts in a tree that has been edited. Intuitively, the following claim implies that if t is a tree obtained from t by applying at most K edit operations and t contains an occurrence of the 2K + 1-fold repetition of a context C, then t contains at least one occurrence of the same context C.

CLAIM 1. Let t be a curried tree, and let t be the curried tree obtained from t after a deletion or an insertion of a single node. If t contains at least n nonoverlapping occurrences of the same context C, then t contains at least n − 2 nonoverlapping occurrences of C.
PROOF. We prove the claim for the deletion operation only, as the arguments for the case of an insertion are similar. Let x be the node that is deleted from t. As mentioned in Section 5.3, there is a unique way to represent the deletion of x using composition of trees and contexts, namely we can write where a is the label of node x and C is the horizontal context that represents the forest of subtrees under x. The result of the deletion of node x gives the curried tree Note that the deletion operation, performed on curry encodings, removes exactly two nodes: the a-labeled leaf that corresponds to x and the @-labeled node y that connects t to C •a. All other nodes are preserved (but possibly rearranged) by this transformation. In particular, this means that if (C, z) is an occurrence of the context C in t that does not overlap with the a-labeled node x nor with the @-labeled node y, then C occurs in either C , C , or t . This shows that C occurs at least once in t . In general, suppose that there are n nonoverlapping occurrences of C in t. Since the deletion operation affects only two nodes, x and y, we have that, in the worst-case, all but two of these occurrences of C can be found in t , and hence t contains at least n − 2 occurrences of C. Finally, it is easy to see that the deletion operation preserves the property of occurrences of being nonoverlapping.
We continue now with the proof of Lemma 6.10. We consider a nontrivial node x of the primitive synopsis tree σ and let X = σ (x) be the associated component. By definition of t σ , we know that the context C H X occurs in t σ . We also recall that H = m · (2K + 1). In particular, t σ contains (2K + 1) occurrences of the context C m X . As t σ is obtained from t σ by applying at most K edit operations to it, we know from the preceding claim that 18:32 P. Bourhis et al. the context C m X occurs at least once in t σ . We denote by f (x) some node of t σ where C m X occurs. We have just proved the first part of the lemma. For second part, let x and y be two distinct nontrivial nodes of σ, and let X = σ (x) and Y = σ (y). Furthermore, let f (x) and f (y) be the nodes in t σ where the contexts C m X and C m Y occurs. Note that the occurrences of these two contexts in t σ are nonoverlapping. Since deletion and insertion operations preserve the property of context occurrences of being nonoverlapping, we have that (C m X , f (x)) and (C m Y , f (y)) are also nonoverlapping occurrences in t σ . Now suppose that x post σ y. Let x and y be the nodes in t σ that carry the corresponding occurrences of the contexts C H X and C H Y , respectively. It is routine to check, by exploiting the recursive definition of t σ , that x post t σ y . In addition, we know that edit operations preserve the post-order relationships between nodes. As discussed previously, at least one occurrence of C m X (respectively, C m Y ) inside C H X (respectively, C H Y ) is not affected by the edit operations that transform t σ into t σ . This means that the corresponding occurrences (C m X , f (x)) and (C m Y , f (y)) in t σ are in the same post-order relationship as x and y, namely f (x) post t σ f (y). The converse implication follows from the fact that post σ is a total order (hence, it is sufficient to swap the roles of x and y above).
We finally check the last condition. Suppose that X = σ (x) is a nonhorizontal component and that x anc σ y. As before, let x and y be the nodes in t σ that carry the occurrences of C H X and C H Y , respectively. Thanks to the construction of t σ , we have x anc σ y . Moreover, recall that during the construction of t σ , we inserted K copies of the nonhorizontal context C n.h. immediately below C H X (and thus above C H Y ). This implies that the path in t σ that connects the node x to its descendant y visits at least K right edges. If we now look at the unranked tree ext −1 (t σ ) encoded by t σ , we observe that there is a bijection between the right edges in t σ and the vertical edges in ext −1 (t σ ), and this bijection preserves the ancestor order. This means that the two portions of the unranked tree ext −1 (t σ ) that are encoded by C H X and C H Y , namely ext −1 (C H X ) and ext −1 (C H Y ), are separated by at least K vertical edges. Finally, each deletion or insertion operation performed on ext −1 (t σ ) can only bring two nodes closer by one level at a time. This means that after at most K edit operations, the resulting curried tree t σ contains the occurrence of C m X at node f (x) is still above the occurrence of C m Y at node f (y). We have just proved that x anc σ y implies f (x) anc t σ f (y). The converse implication follows by symmetric arguments.
It now remains to show how to extract a basic synopsis tree τ that covers σ from the tree t σ . Recall that t σ ∈ L (T ), and let ρ be an accepting run of T on t σ . Further, let f be the mapping from nontrivial nodes of σ to nodes of t σ , as defined in Lemma 6.10. Consider a nontrivial node x of σ, and let X be its label. We know that the context C m X occurs at node f (x) in t σ . Let y be the node of the fingerprint context C X that is labeled with the placeholder symbol •. Clearly, C X occurs in its m-fold iteration C m X at positions ε, y, y · y, . . . , y m−1 . Analogous (nonoverlapping) occurrences exist in t σ , namely at positions y x,0 = f (x), y x,1 = f (x) · y, y x,2 = f (x) · y · y, . . . , y x,m−1 = f (x) · y m−1 . For convenience, let y x,m = f (x) · y m .
Next we consider the states that occur at the m + 1 nodes y x,0 , y x,1 , . . . , y x,m of the run ρ on t σ . By the pigeonhole principle, we know that two among these states, say ρ(y x,i ) and ρ(y x, j ) for some 0 ≤ i < j ≤ m, belong to the same component Y of T . In fact, from the definition of strongly connected component, we can even assume that j = i + 1, from which we immediately obtain C X ∈ L (T | Y ). It is now time to exploit the property of the fingerprint context C X , which is shown in Lemma 6.9. In particular, from the fact that C X ∈ L (T | Y ), we derive that L (S | X) ⊆ L (T | Y ).
We have just shown that it is possible to find a mapping from any nontrivial node x of σ to a node y x (= y x,i ) in t σ such that L (S | X) ⊆ L (T | Y ), where X = σ (x) and Y is the component of the state ρ(y x ). Thanks to Lemma 6.10, we can also claim that for all nontrivial nodes x, x in σ , Toward a conclusion, we now define the basic synopsis tree τ that covers σ . The domain of τ coincides with the domain of t σ (i.e., nodes(τ ) = nodes(t σ )). The labeling function of τ maps every trivial node x of τ to the trivial component ε and every nontrivial node x to the component that contains the state ρ(y x ) associated with the corresponding node y x in t σ . It is easy to see that τ satisfies the properties of a basic synopsis tree (in particular, it respects the transitions of T because its labeling is essentially the lifting of a valid run ρ of T ). It remains to define the mapping λ that witnesses the coverability of σ by τ : for this, we simply let λ(x) = y x for every nontrivial node x of σ . The fact that λ satisfies Definition 5.6 follows easily from the properties described by the preceding three items (i.e., the fact that λ is injective follows from the first item). This proves that every primitive synopsis tree of S is covered by a basic synopsis tree of T .

COMPLEXITY ANALYSIS
In this section, we investigate the complexity of deciding whether a regular tree language S is bounded repairable into a regular tree language T , and we assume that S and T are represented by automata S and T , respectively. One can propose a straightforward decision procedure following the characterization of bounded repairability with synopsis trees (Theorem 5.7): for every primitive synopsis tree of S, it suffices to guess a covering basic synopsis tree of T . We also recall from Remark 5.3 that the size of a primitive synopsis tree of S is bounded by a function exponential in the number of strongly connected components of S. Hence, by Lemma 6.8, an analogous bound holds for the size of basis synopsis trees of T . It is also easy to see that testing the covering of a primitive synopsis tree by a basic synopsis tree can be performed efficiently in the size of the synopsis trees. These observations show that deciding bounded repairability for languages represented by tree automata is in EXPSPACE.
We show, however, that a more efficient procedure exists: rather than inspecting individual elements of PST(S) and verifying that they are covered by elements of BST(T ), it checks inclusion of the sets of normalized synopsis trees. More precisely, we first relabel synopsis trees in BST(T ) with compatible connected components of S, and as a result, we deal with synopsis trees labeled with elements of SCC(S). Next, we define a serialization of a synopsis tree, a string representation of the synopsis tree, and show that serialization of a synopsis tree is the same as the serialization of the normal form of the synopsis tree. Naturally, this reduces bounded repairability to the inclusion of serializations of PST(S) and BST(T ), respectively. Testing this inclusion is not trivial, because although both sets PST(S) and BST(T ) can be captured with tree automata, the serializations versions are string languages that need not be regular. We show, however, that serializations of flattened versions of PST(S) and BST(T ) can be captured with context-free grammars and, moreover, that the context-free grammar for PST(S) is nonrecursive. We then use existing results on testing inclusion of nonrecursive context-free grammar in another (possibly) recursive context-free grammar and obtain a CONEXP upper bound. Finally, we show that the CONEXP upper bound is tight even if we restrict the tree languages provided as input to those definable with deterministic nonrecursive DTDs.

Upper Bound
We show that the complexity of testing bounded repairability between two regular tree languages represented with tree automata is in CONEXP. For the remainder of this section, we fix automata S and T that recognize the source tree language and the target tree language, respectively.
We begin by recalling the notion of embedding from Section 6.1. Given a synopsis tree θ of S and a synopsis tree τ of T , we say that θ is embedded into τ , denoted θ τ , if θ and τ have the same domain (i.e., nodes(θ ) = nodes(τ )), and that θ is covered by τ via the identity function (i.e., L (S | θ (x)) ⊆ L (T | τ (x)) for all nodes x). We define the set Emb S (τ ) = {θ synopsis tree of S : θ τ } of all synopsis trees of S that are embedded into τ, and we extend the notation to any set S of synopsis trees by letting Emb S (S) = τ ∈S Emb S (τ ). Now we introduce a variant of the notion of serialization for synopsis trees, and we show that this can be used as an alternative representation of the normal form that we introduced in Section 6. Such a serialization takes a synopsis tree θ and produces a well-nested wordθ over the alphabet tags(S) that consists of opening tags of the form X and closing tags of the form / X , with X ∈ SCC(S). It is important to remark that the serializationθ does not represent the specific tree θ , but rather the class of synopsis trees that have the same normal form as θ . Formally, we define the serializationθ of a synopsis tree θ of S recursively as follows: if θ = X(θ 1 , θ 2 ) and X is a trivial component, θ 1 ·θ 2 · X · / X if θ = X(θ 1 , θ 2 ) and X is a nontrivial horizontal component, X ·θ 1 ·θ 2 · / X if θ = X(θ 1 , θ 2 ) and X is a nonhorizontal component.
Note that the trivial components disappear in the serializationθ of a synopsis tree θ . As usual, we extend serializations to sets of synopsis trees by lettingÛ = {θ : θ ∈ U }. It is easy to see that serializations are unaffected by the editing operations on synopsis trees that are used to attain the normal form.
PROOF. By induction, it suffices to show that applying a single editing operation does not change the serialization. A quick inspection of the definitions of the editing operations of promotion, demotion, and reduction (see Figure 11) shows that the claim holds trivially.
The preceding lemma, together with the results proven in Section 6.1, implies the following. PROOF. To prove the left-to-right implication, suppose that L (S) is bounded repairable into L (T ), and consider a primitive synopsis tree σ ∈ U . We know from Theorem 5.7 that σ is covered by some basic synopsis tree τ of T . Moreover, by Lemma 6.2, there exists a synopsis tree θ of S that "interpolates" σ and τ , namely such that σ is strongly covered by θ (denoted σ θ ) and θ embedded into τ (denoted θ τ ). In particular, we have that θ belongs to the set V = Emb S (τ ). Moreover, since σ θ , we know from Lemma 6.5 that the normal forms of σ and θ coincide, and hence by Lemma 7.1 we haveσ =θ. We conclude thatσ belongs toV .
The proof of the converse direction is symmetric, namely we assume thatÛ ⊆V , we consider a primitive synopsis tree σ of S, and we prove that σ is covered by some basic synopsis tree of T . Indeed, since σ ∈ U andÛ ⊆V , we haveσ ∈V . This means that there exist a synopsis tree θ of S and a basic synopsis tree τ of T such that θ τ and θ =σ . In particular, Lemma 7.1 implies that θ and σ have the same normal form, and hence they strongly cover each other, from which σ θ τ . We conclude that σ is covered by τ, and hence by Theorem 5.7, L (S) is bounded repairable into L (T ).
We conclude the section by showing how to effectively test the inclusion from Corollary 7.2. For this, we introduce some context-free grammars that capture the languageŝ U andV , where U = PST(S) and V = Emb S (BST(T )). We can define these grammars on the basis of the components of S and T and the transitions of S and T lifted to these components. More precisely, the grammar G S that defines the languageÛ uses nonterminals X, X 1 , X 2 , . . . that correspond to components of S and rules of the forms X ::= X / X X ::= X 1 X 2 if X is a trivial component, where X 1 = X = X 2 , q ∈ δ(q 1 , q 2 ) for some q ∈ X, q 1 ∈ X 1 , q 2 ∈ X, and δ is the transition function of S. The grammar G S,T that defines the languageV uses the same nonterminals X, X 1 , X 2 ∈ SCC(S) and the same rules as earlier, but instead of enforcing X 1 = X = X 2 and q ∈ δ(q 1 , q 2 ) for some q ∈ X, q 1 ∈ X 1 , q 2 ∈ X, it requires that there exist some components Y, Y 1 , Y 2 of the target automaton T such that (i) , and (iv) q ∈ γ (q 1 , q 2 ) for some q ∈ Y , q 1 ∈ Y 1 , and q 2 ∈ Y 2 , where γ is the transition function of T .
Although testing the inclusion of two generic context-free languages is known to be undecidable [Hopcroft and Ullman 1979], here we can exploit the fact that the grammar G S is nonrecursive to decide the inclusion L (G S ) ⊆ L (G S,T ). Indeed, a nonrecursive grammar defines a finite language of words whose lengths are uniformly bounded by an exponent in the size of the grammar. Consequently, a nondeterministic Turing machine can guess a word w of length exponential in the size of G S and decide the noncontainment L (G S ) L (G S,T ) by checking that w ∈ L (G S ) and w ∈ L (G S,T ). It is also known that the membership problem of context-free languages can be solved in polynomial time [Younger 1967;Earley 1970]. The only subtlety here is that although the grammars G S and G S,T are of polynomial size with respect to S and T , this reduction takes exponential time in the size of S and T : indeed, the the definition of G S,T requires checking some containment relationships between the languages recognized by the components of S and T . The latter problem, however, is EXP-complete [Seidl 1990] and hence dominated by the time that is required to guess the word w. We can thus claim the following complexity upper bound to the bounded repair problem. THEOREM 7.3. The bounded repair problem between languages represented by stepwise tree automata is in CONEXP.

Lower Bound
Here we show that the complexity bound established in Theorem 7.3 is tight. More precisely, we prove a matching CONEXP lower bound for the bounded repair problem, which remarkably holds even for tree languages represented by nonrecursive deterministic DTDs.
We recall the results in Champavère et al. [2009], particularly Proposition 4 and Theorem 5, which show that any deterministic DTD can be transformed, in polynomial time, into an equivalent deterministic stepwise automaton. This means that the complexity lower bound for the bounded repair problem of languages represented by (nonrecursive) deterministic DTDs can be immediately transferred to languages represented by deterministic stepwise automata.
We also recall the folklore PSPACE upper bound for the containment problem of nondeterministic DTDs: given two DTDs D and D , one can decide whether the language defined by D is contained in the language defined by D by first removing the useless rules and then checking that for all letters a in the alphabet of D, the regular language associated with a in D is contained in the regular language associated with a in D . This upper bound result is tight due to the PSPACE-hardness of containment of regular expressions [Stockmeyer and Meyer 1973]. We finally observe that the complexity of the containment problem lowers to P as soon as deterministic DTDs are considered. Interestingly, the situation is completely different for the complexity of the bounded repair problem.
THEOREM 7.4. The bounded repair problem between languages represented by nonrecursive deterministic DTDs is CONEXP-hard. PROOF. The proof is by a reduction from the problem of tiling a square grid of exponential size [Boas 1997]. An instance of the latter problem is given by a tuple I = (n, S, H, V, s ⊥ , s ), where n is a natural number encoded in unary and representing the width 2 n of the square grid, S is a finite set of tiles, H, V ⊆ S × S are the set of vertical and horizontal constraints, and s ⊥ and s are the tiles that should mark the lower left and upper right corners. A tiling is a function f mapping pairs (i, j) ∈ {1, . . . , 2 n }×{1, . . . , 2 n } to tiles f (i, j) ∈ S. We say that a tiling f satisfies the constraints of I if the following conditions are satisfied: (1) f (1, 1) = s ⊥ and f (2 n , 2 n ) = s , (2) ( f (i, j − 1), f (i, j)) ∈ H for all 1 ≤ i ≤ 2 n and all 1 < j ≤ 2 n , and (3) ( f (i − 1, j), f (i, j)) ∈ V for all 1 < i ≤ 2 n and all 1 ≤ j ≤ 2 n .
The exponential tiling problem is the problem of deciding whether there exists a tiling f that satisfies all constraints in a given instance I. This problem is known to be NEXP-complete [Boas 1997]. Now we fix an instance I = (n, S, H, V, s ⊥ , s ) of the exponential tiling problem, and we construct some regular languages S, T of unranked trees such that S is bounded repairable into T if and only if there is no tiling satisfying the constraints of I. The basic idea is to let the source language S contain encodings of the possible tilings and the target language T contain modified encodings that expose violations of some constraint of I. The intended relation between S and T can be phrased as follows: if every tree in S can be transformed into a tree in T with a small (i.e., uniformly bounded) amount of edits, then every tiling of the exponential grid violates some constraint in I and vice versa. To forbid a repair strategy to modify the encoded tiling with a bounded amount of edits, we will allow some redundancy in our encodings. For convenience, we first describe the languages S and T as if they were given by means of stepwise automata of polynomial size with respect to the instance I. Toward the end of the proof, we will show how to modify the constructions to get languages representable by nonrecursive deterministic DTDs of polynomial size.
Source language. We begin by describing the trees in the language S. For the sake of the brevity, we let N = 2 n be the width of the grid to be tiled and consider a generic tiling f : {1, . . . , N} × {1, . . . , N} → S. A tree that encodes the tiling f is labeled over an alphabet consisting of tiles in S, separator symbols [and], and a dummy symbol #. Each cell (i, j) in the grid is encoded by a series of consecutive leaves that spell out a word of the form [[. . . [ f (i, j) where each symbol f (i, j) occurs at least once and the square brackets are not necessarily well parenthesized. The repetitions of the symbols f (i, j) are used to ensure robustness to any repair strategy of bounded cost. From now on, such repetitions will be simply represented by a superscript +. The above word encoding a cell (i, j) is called a cell block. Cell blocks are then juxtaposed to form the frontier of a tree, following the left-to-right bottom-to-top order of the corresponding cells in the grid: Finally, #-labeled internal nodes are introduced to guarantee that the frontier is well formed, namely it contains exactly N rows, each one consisting of N cell blocks. This can be done by enforcing the existence of 2n + 1 levels above the frontier and by requiring that each internal node at level = 0 . . . 2n − 1 has exactly two children, and each internal node at level 2n has a cell block as childhood Figure 13).
The source language S is defined as the set of all tree-shaped encodings of tilings f that satisfy the first constraint of I, namely those tilings f such that f (1, 1) = s ⊥ and f (N, N) = s . The language S is clearly regular. Furthermore, it is not difficult to construct a stepwise automaton S that recognizes S and has size polynomial in n and |S|, and hence also in |I| (we omit the formal definition of such an automaton).
Target language. We now turn to the target language T , which intuitively contains encodings of S modified in a suitable way so as to expose violations of horizontal or vertical constraints, which can then be checked by an automaton of small size.
We begin by analyzing the simpler case of a tiling f that violates a horizontal constraint, say between two tiles f (i, j − 1) and f (i, j). Observe that in the frontier of every tree of S that encodes f , the violating tiles are represented by two consecutive cell blocks of the form [ + f (i, j − 1) + ] + and [ + f (i, j) + ] + . It is then convenient to expose the violation at the least common ancestor of these two cell blocks, which must occur at some level ∈ {n, . . . , 2n − 1}. For example, this can be done by relabeling the least common ancestor with the pair ( f (i, j − 1), f (i, j)) ( ∈ H). In this case, the modified encoding looks like the unranked tree in Figure 14 (for the sake of clarity, we highlighted the cell blocks corresponding to the tiles that violate the horizontal constraint).  We denote by T H the language of all trees that can be obtained by relabeling a node of some tree in S as described earlier. We observe that the language T H is regular, and furthermore, one can construct a stepwise automaton that recognizes T H and has size polynomial in |I|.
We now deal with the case of a tiling t that violates a vertical constraint, say between tiles f (i − 1, j) and f (i, j). The basic idea here is to "hide" under a new subtree the factor of the frontier that starts with the corresponding occurrence of the cell block [ + f (i − 1, j) + ] + and ends just before the occurrence of the cell block [ + f (i, j) + ] + . Note that this factor contains exactly N cell blocks, so it can be hidden under a complete binary tree of height n, such as the one depicted in Figure 15.
Similarly, the remaining part of the frontier consists of N − 1 sequences, each one containing N cell blocks, so this shape can be enforced using an almost complete binary tree of height 2n, where exactly one node at level n (e.g., the right-most one) is a leaf. Putting all together, the modified encoding for the tiling f has the shape depicted in Figure 16 (as before, we highlighted the cell-blocks corresponding to the tiles that violate the vertical constraint).
Accordingly, we define the language T V of all unranked trees of the preceding form for all possible choices of 1 < i ≤ N and 1 ≤ j ≤ N such that ( f (i − 1, j), f (i, j)) ∈ V . Note that the latter condition can be checked by a small automaton that compares the highlighted cell blocks in the figure. In particular, the language T V is recognized by a stepwise automaton of size polynomial in |I|.
We can finally construct the target language T as the union of T H and T V , and recall that this is also recognized by an automaton of polynomial size in |I|.
Reduction. Now we need to argue that S is bounded repairable into T if and only if every tiling of the exponential grid violates some constraint of I. We begin with the easier direction, which assumes that every tiling violates some constraint of I. Consider a generic tree t ∈ S that encodes a tiling f, and let t ∈ T be the modified encoding that exposes a violation of a horizontal or vertical constraint, as described earlier. We observe that the frontiers of t and t spell out the same sequence of cell blocks. In particular, t can be obtained from t by deleting all internal nodes and by inserting new internal nodes. Since the number of internal nodes in t and t is uniformly bounded by a constant (roughly O(2 2n )), we know that S is bounded repairable into T .
As for the other direction, suppose that there is a tiling f that satisfies all constraints of I. We fix an arbitrarily large number K and prove that some tree t ∈ S requires at least K edits to be transformed into a tree of T . The tree t is nothing but the encoding of the tiling f , where each symbol in a cell block is repeated K times; more precisely, t is the unranked tree of Figure 13 where every superscript + is replaced with K. We observe that every transformation of t consisting of less than K edits preserves at least one occurrence of each symbol in the frontier, and it also preserves the postorder relationships between these occurrences. Furthermore, note that occurrences of symbols [and] ensure that every transformation do not change or mix the order of the cell block. This means that every such transformation produces a tree whose frontier contains a subsequence that encodes the same tiling f as t. Consider now a generic tree t ∈ T . We observe that the frontier of t contains exactly N 2 cell blocks, so it encodes a tiling f of the exponential grid. Moreover, by definition of T , the tiling f must violate some constraint of I. We thus conclude that f and f must be different tilings, and hence t cannot be obtained as an editing of t with cost less than K. The preceding argument holds for any arbitrarily large number K, so this proves that S is not bounded repairable into T .
From automata to DTDs. It remains to show how to modify the languages S and T in such a way that they can be succinctly described by nonrecursive deterministic DTDs. The general idea is to annotate the internal nodes of the trees in S and T with enough information so as to ease a deterministic top-down processing. First of all, we need to annotate the internal nodes of all trees of S and T with their levels: this is possible thanks to the fact that the considered trees have height at most 3n + 2. In addition, we mark the left-most and right-most paths of the trees of S with special labels, say 1 and 2, respectively (the marking at the root is irrelevant): this makes it possible to check, by means of a DTD, that the first and last cell blocks are of the form [ + s + ⊥ ] + and [ + s + ] + . As for the trees in T , the crux is to ease the certification of a violation of a horizontal or vertical constraint. To do so, we can promote the information about the violating tiles up to the root. More precisely, on the trees depicted in Figure 14 and Figure 16, we consider the access path to the first highlighted cell block and annotate all internal nodes along this path with the corresponding tile; in a similar way, we add a second annotation for the access path to the second highlighted cell block. For example, the parent of the cell block [ + f (i, j) + ] + of the tree of Figure 16 will be labeled with the tuple (#, 2n, f (i − 1, j), f (i, j)), where 2n is the level of that node, and f (i − 1, j) and f (i, j) indicate the tiles corresponding to the first and second highlighted cell blocks.
The additional information on the labeling of the trees of S and T makes it easy to describe these languages by means of nonrecursive deterministic DTDs of size polynomial in |I|. Finally, because only internal nodes are affected by the new annotation, the same arguments for the proof of the reduction can be used here.
Combining Theorems 7.3 and 7.4, we obtain that the bounded repair problem for tree languages represented by all standard specifications [Martens et al. 2006]-that is, stepwise tree automata, deterministic stepwise tree automata, XML schema, DTDs, nonrecursive deterministic DTDs-is CONEXP-complete.

Simpler Instances
To find subcases of the bounded repair problem with a lower complexity, we consider a specialization of the problem where the alphabet of the source language is fixed. We show that in this case, the problem is PSPACE-complete for languages represented by nondeterministic DTDs and CONP-complete for languages represented by deterministic DTDs.
Let us first discuss the complexity upper bounds. Suppose that D is a DTD defining a source language over the fixed alphabet . A close inspection to the translation from DTDs to stepwise automata [Champavère et al. 2009] discloses the following crucial property (see the Appendix for the proof).
LEMMA 7.5. Given a nondeterministic (respectively, deterministic) DTD D that defines a source language S over an alphabet , one can compute in polynomial time a nondeterministic (respectively, deterministic) stepwise automaton S = ( , Q, δ, δ 0 , F) that recognizes S and whose state space can be partitioned into k ≤ 2| | subsets Q 1 , . . . , Q k such that -every component of S is contained in some set Q i , and -for all states q 1 , q 2 , q ∈ Q, if q ∈ δ(q 1 , q 2 ) and q 2 and q are in different components, then q 2 ∈ Q i and q ∈ Q j for some 1 ≤ i < j ≤ k.
For example, the automaton S described in Example 3.1 is a deterministic stepwise automaton whose state space can be partitioned into nine sets that satisfy the first part of the claim: Lemma 7.5 implies that any path in the transition graph of S (i.e., see the left-hand side graph of Figure 7) traverses at most 2| | − 1 vertical edges that connect pairs of states in different components. As a consequence, any primitive synopsis tree of S has size at most |Q| 2| | -that is, polynomial in the size of S when is fixed.
Putting together Lemma 7.5, Corollary 7.2, and Theorem 5.7, one obtains a PSPACE (respectively, CONP) algorithm that decides whether cost(S, T ) < ∞, where S and T are languages defined by nondeterministic (respectively, deterministic) DTDs and S is over a fixed alphabet . The algorithm has the same structure of the algorithm sketched before Theorem 7.3. Namely, it translates the input DTDs into equivalent stepwise automata S and T , then it translates S and T into the grammars G S and G S,T , and finally it checks whether L (G S ) ⊆ L (G S,T ). As stated previously, the last step of the algorithm can be done in CONP by universally guessing a word w of polynomial size from L (G S ) and checking whether w ∈ L (G S,T ) in polynomial time. Note that the translation of S and T into G S and G S,T takes polynomial space for nondeterministic DTDs and polynomial time for deterministic DTDs. The blow-up of the complexity for the former is because one has to check language containment between regular languages, which can be done with polynomial space.
PROPOSITION 7.6. The bounded repair problem between a source language represented by a nondeterministic (respectively, deterministic) DTD over a fixed alphabet and a target language represented by a nondeterministic (respectively, deterministic) DTD over an arbitrary alphabet is in PSPACE (respectively, in CONP).
Finally, we show that even strong restrictions, including fixing both alphabets, cannot get us below PSPACE in the nondeterministic case. Indeed, one can easily reduce the containment problem between regular expressions to a bounded repair problem between languages defined by nonrecursive nondeterministic DTDs, thus showing that the latter problem is PSPACE-hard. To see this, consider two regular expressions E 1 and E 2 . Let # be a fresh symbol, and let E # 1 and E # 2 be the expressions obtained from E 1 and E 2 , respectively, by substituting every occurrence of a symbol a with the expression a * #. Let r be another fresh letter reserved for the roots of the trees. Clearly, the language defined by E 1 is contained in the language defined by E 2 if and only if the DTD r → E # 1 is bounded repairable into the DTD r → E # 2 (one direction is trivial and the other is easily shown by contraposition). As the latter DTDs are nonrecursive (they define trees of height two), this shows that the bounded repair problem between nonrecursive nondeterministic DTDs is PSPACE-hard, and this holds even when the alphabets are fixed.
We can also provide a CONP lower bound for the analogous problem when the languages are represented by nonrecursive deterministic DTDs over fixed alphabets. This lower bound follows easily from a reduction from the validity problem for propositional formulas in disjunctive form. A similar reduction was given in Benedikt et al. [2013] for languages of words recognized by deterministic finite automata. The additional complication here is that we have to fix the source and the target alphabets; however, the reduction is still possible by encoding the valuation of each variable with a block of nodes labeled over a binary alphabet.
PROPOSITION 7.7. The bounded repair problem between languages represented by nonrecursive nondeterministic (respectively, deterministic) DTDs with both source and target alphabets fixed is PSPACE-hard (respectively, CONP-hard).

THE UNIVERSAL CASE
In this section, we consider the so-called universal case of the bounded repairability problem, namely a variant of the problem where the source language is assumed universal (i.e., equal to T ) and the target language is represented by a stepwise automaton T .
We recall the assumption that any stepwise automaton T is trimmed (i.e., every state of T appears in some accepting run of A on some input tree). Under this assumption, we say that an automaton T is complete over if for every tree t ∈ T there is a (possibly nonaccepting) run of T on t.
Here we also make use of deterministic visibly pushdown transducers [Raskin and Servais 2008; Alur and Madhusudan 2009] as suitable devices that transform unranked trees in a streaming fashion. These devices receive the serialized version of an unranked tree and output the serialized version of another unranked tree. By a slight abuse of notation, we identify unranked trees with their serializations.
The following result gives equivalent conditions for bounded repairability in the universal case.
PROPOSITION 8.1. Given an alphabet and an automaton T = ( , Q, δ, δ 0 , F), the following conditions are equivalent: -T is bounded repairable into L (T ), -T is complete over , and -there exist k ∈ N and a deterministic visibly pushdown transducer that receives any unranked tree t over and outputs an unranked tree t such that dist(t, t ) ≤ k and ext(t ) ∈ L (T ).
PROOF. Here we only prove that the second item implies the third one (the other two directions are explained in the Appendix). Suppose that T = ( , Q, δ, δ 0 , F) is a (trimmed) stepwise automaton that is complete over . It is not difficult to show that from the fact that T is complete over , it follows that T is bounded repairable into L (T ). The interesting result is that when we identify unranked trees with their serializations, the repair strategy of T into L (T ) can be implemented by a deterministic visibly pushdown transducer. More specifically, the deterministic visibly pushdown transducer outputs, at the very first step and independently of the input, a fixed prefix of a serialized unranked tree (this represents a portion of the repaired tree); then it copies the input t as a continuation of the prefix formerly constructed, mimicking at the same time the computation of the stepwise automaton T on ext(t); and finally, the transducer terminates by outputting a suitable suffix in such a way that the corresponding repaired tree belongs to the language ext −1 (L (T )). The difficult part of this proof is to show that there is a single prefix that, no matter how it is prolonged, can be completed into a serialized tree that belongs to the language ext −1 (L (T )). Namely, to complete the proof, we need to show the following claim. CLAIM 2. There are a symbol a ∈ , a state p ∈ Q, and a sequence of unranked trees u 1 , . . . , u n over such that for every unranked tree t over , there is a sequence of unranked trees v 1 , . . . , v m over satisfying p ∈ δ(ext(a(u 1 , . . . , u n , t, v 1 , . . . , v m ))).
We prove the claim by contraposition. Suppose that ( ) for every symbol a ∈ , every state p ∈ Q, and every sequence u 1 , . . . , u n of trees, there exists a tree t such that for every sequence of trees v 1 , . . . , v m , p ∈ δ(ext(a(u 1 , . . . , u n , t, v 1 , . . . , v m ))). We fix an arbitrary symbol a ∈ and an enumeration p 1 , . . . , p N of all states in Q. Then, by applying the hypothesis ( ) to the symbol a, to each state p ∈ {p 1 , . . . , p N }, and to increasing sequences of trees u 1 , . . . , u n , we construct a tree t over on which T has no valid run (this would imply that T is not complete over ). First, we let p = p 1 and n = 0, and we obtain from ( ) that there is a tree t 1 such that p 1 ∈ δ(ext (a(t 1 , v 1 , . . . , v m ))) for all sequences of trees v 1 , . . . , v m . Similarly, if we let p = p 2 , n = 1, and u 1 = t 1 , we know from ( ) that there is a tree t 2 such that p 2 ∈ δ(ext (a(t 1 , t 2 , v 1 , . . . , v m ))) for all sequences of trees v 1 , . . . , v m . By applying a simple inductive argument, we can construct a sequence of trees t 1 , . . . , t N such that for every index 1 ≤ i ≤ N, p i ∈ δ(ext (a(t 1 , . . . , t N ))). Since p 1 , . . . , p N are all and only the states of T , we derive that T has no valid run on ext (a(t 1 , . . . , t N )). This shows that T is not complete over .
From the preceding characterization, one can derive a polynomial-time algorithm that decides whether T is bounded repairable into L (T ) when T is given by a deterministic stepwise automaton. For this, it is sufficient to turn T into a trimmed deterministic automaton T = ( , Q , δ , δ 0 , F ) over and then check that (i) for every symbol a ∈ , δ 0 (a) = ∅ and (ii) for every pair of states q 1 , q 2 ∈ Q , δ (q 1 , q 2 ) = ∅. When the target language is represented by a nondeterministic stepwise automaton T , the complexity increases to EXP: one can simply determinize T and then use the decision procedure for the deterministic case.
As one could expect, the above complexity bounds (i.e., P for deterministic stepwise automata and EXP for nondeterministic stepwise automata) are tight. The hardness proofs can be derived from reductions of the emptiness and universality problems, respectively, on the corresponding classes of automata (see the Appendix).
PROPOSITION 8.2. The bounded repair problem in the universal case when the target language is represented by a nondeterministic (respectively, deterministic) stepwise automaton is EXP-complete (respectively, P-complete).

CONCLUSIONS
In this article, we have investigated the bounded repairability problem for regular tree languages. We have provided an effective characterization of bounded repairability and characterized the complexity of testing whether a given source language S is bounded repairable with respect to a given target language T . The characterization can be used with several different formalisms for representing the tree languages: tree automata, XML schemas, and DTDs, as well as their nonrecursive and deterministic restrictions. Although generally the problem is CONEXP-complete, its complexity is considerably reduced for DTDs over fixed alphabets. In the latter case, the problem becomes CONPcomplete or PSPACE-complete, depending on whether the DTDs are deterministic or not. Finally, we have also considered the variant of the problem when the source language is set to be universal. In this case, we have shown that the problems is EXP-complete in general and becomes tractable (P-complete, in fact) when a deterministic bottom-up automaton is used.
Several directions for future work can be envisioned. Bounded repairability is essentially a generalization of inclusion between tree languages modulo a bounded number of editing operations. One could attempt to further generalize it by allowing several editing operations that are bounded by a ratio of the size of the input tree. In Benedikt et al. [2014], it is shown how such a generalized notion of repairability can be computed for regular string languages. It would be interesting to see if the employed methods can be adapted to the setting of regular tree languages. Another direction is bounded repairability in the streaming setting: not only the pair of source and target languages need to be bounded repairable but also the repair must be executable by a transducer, namely a machine with a possibly infinite state space that makes one pass over the serialization of the input tree while producing a serialization of the output tree. Proposition 8.1 shows that in the unrestricted case, visibly pushdown transducers are expressive enough to implement bounded-cost streaming repairs whenever these exist. In the general case, however, visibly pushdown transducers may be too limiting [Bourhis et al. 2013].

ELECTRONIC APPENDIX
The electronic appendix for this article can be accessed in the ACM Digital Library.