Decidable XPath Fragments in the Real World

XPath is arguably the most popular query language for selecting elements in XML documents. Besides query evaluation, query satisfiability and containment are the main computational problems for XPath; they are useful, for instance, to detect dead code or validate query optimisations. These problems are undecidable in general, but several fragments have been identified over time for which satisfiability (or query containment) is decidable: CoreXPath 1.0 and 2.0 without so-called data joins, fragments with data joins but limited navigation, etc. However, these fragments are often given in a simplified syntax, and sometimes w.r.t. a simplified XPath semantics. Moreover, they have been studied mostly with theoretical motivations, with little consideration for the practically relevant features of XPath. To investigate the practical impact of these theoretical fragments, we design a benchmark compiling thousands of real-world XPath queries extracted from open-source projects, and match them against syntactic fragments from the literature. We investigate how to extend these fragments with seldom-considered features such as free variables, data tests, data joins, and the last () and id () functions, for which we provide both undecidability and decidability results. We analyse the coverage of the original and extended fragments, and further provide a glimpse at which other practical features might be worth investigating in the future.


INTRODUCTION
The XPath language [47] is arguably the most popular querying language for selecting elements in XML documents. It is embedded in the XML processing languages XSLT [23] and XQuery [48], and widely used in general-purpose languages like Java or C# through third-party libraries. It combines the ability to navigate the XML tree-which finds its roots in modal logic [3]-with that of comparing data values found in several, distantly related attributes.
Alongside the evaluation of the set of elements selected in a document [e.g. 5,13,16,35], the main computational problem associated with a query is the satisfiability problem: given an XPath query, and optionally a schema for the type of XML documents under consideration, does there exist at least one XML document on which this query would select some node? This abstract question actually allows to answer several questions on the reliability and performance of a query; to wit: Usefulness: is the data query useful at all, i.e. will it select some parts of a well-formed document? The question is far from trivial in an environment where the data can be retrieved from external sources, prone to changes in their structure. This allows simple optimisations in batch processes, with significant savings [36]. Optimisation: can a query be replaced by another, simpler query? This is a classical question in query rewriting in database theory. This type of optimisation has for instance been applied to XPath queries without data joins by Genevès and Vion-Dury [33], with an appreciable impact on performance.
Unfortunately, the satisfiability problem is in general undecidable. Even CoreXPath 1.0, a simplified syntax based on the navigational fragment of XPath 1.0, is intractable, more exactly EXP-complete already when only using the child axis [3,12,45]. Thus with the exception of the work Session 5: Semistructured Data and Knowledge Graphs, Logic, and Verification PODS '19, June 30-July 5, 2019, Amsterdam, Netherlands of Genevès et al. [32] on CoreXPath 1.0, most of the literature on the topic is of a theoretical nature [e.g. 2, 8, 15, 24-26, 29, 31, 39], and focuses on decidability and complexity questions in variants of CoreXPath 1.0 that allow limited forms of data joins. While these results are technically impressive, it is not immediately clear how much is gained in practice by handling data joins. Indeed, data joins are not the sole source of difficulty in XPath: many real-life XPath queries perform calls to a standard library of functions [22]-including arithmetic and string-manipulating functions-that also lead to undecidable satisfiability.
Also, XPath 1.0 dates back to 1999; the more recent versions 2.0 and 3.0 feature path intersections, for loops, etc. [52]. As it evolves in pace with XQuery, XPath includes more and more general programming constructs, and is arguably not just a domain-specific language for path navigation-quite tellingly, in our benchmark, we found that only half of the queries use navigation.
In this paper, we evaluate the practical applicability of XPath fragments proposed in the theoretical literature from a basic, syntactic perspective: how many queries are captured by these fragments? The first step to this end is the compilation of a benchmark of 21,141 real-world XPath queries [6]. The queries are extracted from open-source projects that rely heavily on XSLT or XQuery. As described in Sec. 2, the benchmark offers for each XPath query an XML syntax tree representation in XQueryX [21]. The tools and the resulting benchmark are available from [6,7] under open source licenses.
Naturally, the syntax defined in these works is simplified and was never meant to be used directly against concrete XPath inputs, while we need a concrete syntax for each one of these fragments in order to implement it in Relax NG. This leads to the interesting question of which XPath features can be 'reasonably' handled in these fragments without losing decidable satisfiability nor hampering its complexity (see Sec. 5.1). In turn, answering this type of question requires the definition of a palatable semantics for a substantial subset of XPath 3.0, which we provide in Sec. 3.
As could be expected, naively implementing the syntactic fragments from the literature does not lead to a good coverage (Sec. 4.2), and one should extend them whenever reasonable. In Sec. 5, we propose six extensions of the original fragments and evaluate their coverage on the benchmark. These extensions are root navigation, free variables, data tests against constants, positive data joins, and restricted calls to the functions last() and id(). Just as interestingly, we exhibit several cases where these extensions cannot be handled.
We analyse our experimental results in Sec. 6, notably concluding that higher coverage is obtained through basic extensions than by using complex academic fragments. We also identify increased function support as a promising direction for improved practical satisfiability checking, with an especially high potential for XPath queries from XSLT sources. We conclude in Sec. 7. Additional contents can be found in the full version of the paper available from https://hal.inria.fr/hal-01852475.

A REAL-WORLD BENCHMARK
We explain here the technical aspects of the construction of a benchmark of 21,141 queries [6]: the parser we developed to this end (Sec. 2.1), the sources we employed (App. A.2), and the way we processed the benchmark to check whether a given query belongs to a syntactic XPath fragment (Sec. 2.3). We finish the section by mentioning the limitations of the current benchmark.

Parser
We have slightly modified the W3C parser for XQuery 3.0 from https://www.w3.org/2013/01/qt-applets/, which is (almost) a superset of XPath 3.0, so that we can also use it for XPath queries extracted from XSLT documents. This parser uses a grammar automatically extracted from the language specification, so we are confident in its results. Our implementation [7] (1) extracts XPath queries from XQuery files, by selecting 'maximal XPath subtrees' from the XQuery syntax tree, and (2) outputs syntax trees in the XML format XQueryX [21]; this is what we process to determine to which XPath fragments each query belongs. Duplicate queries within each source are removed. See App. B for an example XQuery document and the corresponding parser output.

Sources
Three XSLT and twenty-five XQuery sources have been chosen by searching through open source GitHub projects containing XSLT or XQuery files, selecting the most popular projects from which we could extract at least 50 queries. We also added one large project not hosted on GitHub, namely DocBook XSL.  The XSLT projects aim to translate enriched text documents between different formats. The XQuery projects we include are most often libraries. The detailed composition is presented in App. A.2. We make no formal claim about the coverage of the benchmark, as it is certainly biased by the restriction to XSLT and XQuery sources. We rather see it as a first open-source release, which could be later enriched by adding XPath queries embedded in other programming languages (e.g., Python, Perl, ECMAScript).

Properties of the Benchmark
Standard Coverage. We exploit this benchmark by validating the syntax trees in XQueryX format against Relax NG [20] specifications. Table 1 presents the number of queries that fall within the scope of the three major revisions of the XPath standard (queries are unique as strings, per source). We can observe that the queries extracted from XSLT files are nearly all XPath 1.0 queries, which contrasts with queries extracted from XQuery sources, which rely more often on advanced XPath features from XPath 2.0 and XPath 3.0. 1 Note that the coverage of XPath 1.0, 2.0, and 3.0 given in Tab. 1 does not restrict function calls to the standard library [22]. The last column 'XPath 3.0 std' shows the coverage of XPath 3.0 when restricted to standard functions. We see here an essential limitation of analysing XPath queries in isolation, without support for non-standard functions, and in particular for user-defined functions: more than half of the queries extracted from XQuery documents are beyond the scope of our analyses. Fig. 1 the number of occurrences for each of the 400 most frequently occurring functions (among 1,600) and the associated accumulated percentage of the total number of function calls. Darker dots correspond to standard XPath functions, and the darker line corresponds to the accumulated percentage achieved by these standard functions. Arithmetic operations do not figure here as they are classified syntactically as operators in XPath. They occur more than the tenth most frequent function. 1 The output XQueryX representations of a handful of queries do not validate against XPath 3.0, due to out-of-bounds constant numerals; this is a very marginal effect.   The standard functions only represent 57.23% of the function calls in the benchmark. This is mostly due to queries from XQuery sources, which routinely use functions defined in the surrounding XQuery programs: when restricting to these sources, standard XPath functions represent 42.93% of the function calls. By contrast, when restricting to XSLT sources, we find only 210 functions and standard XPath functions represent 76.32% of the calls. Moreover, still within XSLT sources, the 16 functions with more than 100 occurrences each all belong to the XSLT or XPath standard, and account for 78.35% of the occurrences of function calls. In the XSLT sources, there are 4,650 queries (31.69%) performing at least one function call, roughly as many as the 4,556 queries (70.46%) found in the XQuery sources.

Functions. We show in
Size. Figure 2 shows the distribution of query sizes, defined as the number of nodes in their syntax trees. As might be expected, a majority of the queries have size at most 13, but there are nevertheless 256 queries of size 100 or more.

Limitations
The benchmark is made of uncurated data, thus no distinction is made between tiny XPath queries and more interesting ones. For instance, only 9,852 of the queries in the benchmark use at least one axis step. Also, as seen in Tab. 1, the numbers of queries from the various sources are not balanced, XPath satisfiability: in both XSLT and XQuery files, no schema information on the XML to be processed is available. Furthermore, it seems likely that most queries are satisfiable-quite possibly all of them. XPath evaluation: similarly, the benchmark does not provide examples of input XML documents on which the XSLT or XQuery should be evaluated.
The first limitation could be lifted by inspecting each source and manually adding the relevant schema when it can be identified.

Related Work
Regarding XPath and XQuery, the previous works [e.g. 4, 30, 49] on benchmarks focus on evaluating the performance of processors. These benchmarks are synthetic and of limited size, and accompanied with XML documents against which they should be evaluated. Compared to these works, we carry a large scale analysis of real-world queries, and analyse them with respect to the satisfiability problem rather than the evaluation problem; for the sake of completeness, we report on the coverage of our syntactic fragments on two such synthetic benchmarks in App. A.1. Several large scale studies of real-world SPARQL queries harvest from semantic web search logs [17,46]. Thanks to the availability of SPARQL logs, the latest one [17] includes over 50 million unique queries and carries out detailed analyses of query features that are relevant to their evaluation, including whether the queries belong to specific SPARQL fragments. While we did not focus on XPath fragments for which evaluation would be more efficient, our benchmark could certainly be exploited in this direction.

XPATH 3.0
The XPath 3.0 specification is arguably too complex to be reasoned about directly. We work instead with a well-defined sub-language, designed to capture accurately the constructions we witnessed in the benchmark. In order to be compatible with the semantics in the XPath literature, we provide a semantics on data trees, but in Sec. 3.6 we show how to capture the actual XPath semantics on XML documents.

Data Trees
Our models are an abstraction of XML DOM trees called data trees, which are finite trees where each node carries both a  label from a finite alphabet Σ and a datum from an infinite countable domain D equipped with an order <. Formally, a data tree is a finite rooted ordered unranked tree with labels in Σ × D: it is a pair t = (ℓ, δ ) of functions ℓ: N → Σ, δ : N → D with a common non-empty finite set of nodes N ⊆ N * as domain; N must be prefix-closed (if p ·i ∈ N for some p ∈ N * and i ∈ N, then p ∈ N ) and predecessorclosed (if p · (i + 1) ∈ N for some p ∈ N * and i ∈ N, then p · i ∈ N ); in particular it contains a root node ε. Nodes in N , being finite sequences of natural numbers, are totally ordered by the lexicographic ordering, which is known in this context as the document order and denoted by ≪. Figure 3 displays an example of a data tree. When working with first-order or monadic second-order logic, a data tree is seen as a finite relational structure (N , ↓, →, ∼, (P a ) a ∈Σ , denote the next-sibling relation, the child relation, the labelling predicates, data equivalence, and the data predicates.

Syntax
While our implementation works with concrete syntax, for the sake of readability we use an abstract syntax throughout the paper. It is nevertheless compatible with XPath 3.0: all the examples in this paper are written in actual XPath. Let X be a countable infinite set of variables, and F be a ranked alphabet of function names; we denote by F n its  subset of symbols with arity n. As usual, our language has multiple sorts: axes denote directions in the data tree, with abstract syntax α ::= self | child | descendant | following-sibling | parent | ancestor | preceding-sibling path expressions describe binary relations between the nodes in a data tree, with abstract syntax π ::= α:: where $x ranges over X, n over N, and f over F n , while node expressions describe sets of nodes, with abstract syntax φ : where a ranges over Σ, d over D, △ over {eq, ne} and △ + over {eq, ne, le, lt, ge, gt}. Note that only eq and ne are allowed when comparing paths, while the ordered structure of D is only available when comparing a path and a data constant: none of the fragments we consider is known to allow the richer path comparisons. We provide a detailed breakdown of how many queries from our benchmark use each one of these syntactic constructs in Sec. 3.7.

Data Tree Semantics
For a fixed data tree of domain N , we give in figures 4 and 5 the semantics of axes α A , path and node expressions π ν P and φ ν N . The semantics is relative to a current variable valuation ν: X → 2 N . The semantics of node expressions are sets of nodes of N , while those of axes and path expression are sets of pairs of nodes.
In order to interpret functions, each function symbol f from F n comes with a semantics f F : (2 N ) n → 2 N . For instance, false() and not() are technically XPath functions, with semantics false F def = ∅ and not F (S ) def = N \ S for all S ⊆ N . Likewise, we interpret each comparison operator △ + as a data relation △ + ∼ ⊆ D × D: eq ∼ is the equality = over D, and ne ∼ the disequality , lt ∼ the strict order <, etc. For binary relations R, R ′ , we employ relational compositions Beware that the semantics of a path expression π changes when seen as a node expression. In particular, a variable $x is a path expression, and when seen as a node formula, $x ν N = N unless ν ($x) = ∅. XPath provides two quantifiers: for expressions bind singleton node sets, while let expressions bind node sets. Our semantics is in line with the ones found in the literature. However, we note that it slightly differs from the actual XPath semantics. In particular, we only account for pure functions acting on paths, while functions from XPath's large standard library [22] may be polymorphic and have side-effects-two features that are anyway out of the reach of the current decidable fragments. Also, variables in XPath are bound to ordered collections of nodes and data values, while we only consider sets of nodes. This simpler semantics is not restrictive for our purposes, as discussed in Sec. 3.6.

The Satisfiability Problem
In this paper, we focus on the satisfiability problem: given a node expression φ, does there exist a data tree t such that t |= φ? As path expressions are also node expressions, this also Session 5: Semistructured Data and Knowledge Graphs, Logic, and Verification PODS '19, June 30-July 5, 2019, Amsterdam, Netherlands captures the satisfiability of path expressions. In presence of a DTD, the data tree t should additionally belong to the DTD's language.

Remark 3.2.
A related problem is query containment: for node expressions φ and φ ′ , we say that φ is contained in φ ′ if, for all data trees and all variable valuations ν, φ ν N ⊆ φ ′ ν N . This is equivalent to asking the unsatisfiability of φ and not(φ ′ ), so it reduces to the satisfiability problem when negation is allowed-which will not always be our case. The problem of path containment asks the same question for path expressions π and π ′ , and is not captured by satisfiability. Furthermore, one might also consider variants where we ask whether for all data trees and variable valuations ν and ν ′ , φ ν N ⊆ φ ′ ν ′ N , which leads to rather different complexities [45].

Syntactic Sugar
, we may use . def = self:: * for referring to the current point of focus, .. def = parent:: * for its parent, α::a def = α:: * [a] for testing the label found after an axis step, and a single label a as a path formula for child::a.
3.5.2 XPath 2.0 Sugar. As shown by ten Cate and Lutz [52], path intersection and complementation as introduced in XPath 2.0 can be expressed using for loops: π intersect π ′ is defined by and π except π ′ by Similarly, node quantification some $x in π satisfies φ can be expressed using and every $x in π satisfies φ is defined dually. By first defining the non-standard future(φ) def =(following:: * union descendant:: * )[φ] and singleton(π ) def = π and not(π intersect π /future(true())), then we can also express standard node comparisons π ≪ π ′ with singleton(π ) and singleton(π ′ ) and π ′ intersect (π /future(true())) (4) In XPath, a path expression can select the last (according to the document order) node among those selected by a path π . This often appears as a predicate π [last()] or π [position() = last()] that calls the nullary last() function [22]. However, this syntax cannot be handled with our simplified semantics-and is a bit problematic-, thus we shall only consider the one-argument version of last() [22], with semantics last F (S ) def = max ≪ (S ) for any S ⊆ N . Then last(π ) can be expressed in XPath 2.0 by π except (π /ancestor:: * union preceding:: * ) In the data tree of Fig. 3, when evaluated at the c node, the path last(ancestor-or-self:: * /child::lang) returns the lang node with data value 'fr. ' We will discuss the last() function further in Sec. 5.2.5.

XML Semantics
The XPath data model [54] specifies that nodes can fall into several categories, with multiple types and accessors. In data trees, there are only nodes, and two accessors ℓ and δ . Nevertheless, a large part of the XPath data model can be handled.
3.6.1 Data Tree of an XML Document. Elements, attributes, text nodes, and comments can all be encoded using distinguished labels: we let Σ = E ⊎ A ⊎ {text, comment} where E is the set of element labels and A of attribute labels.
We see all the values as belonging to D. The data in D associated with attribute nodes, text nodes, and comment nodes is their string value; for an element node, it is the concatenation of the string values of all its element and text children. The following XML document corresponds to the data tree of Fig. 3: <n lang="en"><o id="23">John Doe</o><t ref ="23" lang=" fr ">John <c>Doe</c></t></n> 3.6.2 XML-Specific Syntax. When considering XML documents rather than data trees as models, some additional features of XPath become meaningful. We enrich the syntax with the axis α ::= · · · | attribute and five node tests τ ::= attribute() | comment() | element() | text() φ ::= · · · | τ π ::= · · · | α::node() 3.6.3 Interpretation into Data Trees. Given an XPath node expression φ with the XML semantics of [47], we interpret it as an XPath node expression ⌜φ⌝ and not(//. [notxml]) that uses the data tree semantics of Sec. 3.3.
The first conjunct ⌜φ⌝ is defined by induction on φ; the case of node tests τ is straightforward: The semantics of atomic steps is modified to only visit element nodes, except when using the attribute axis or the node() test, and to forbid horizontal axes in attribute nodes: The second conjunct not(//. [notxml]) ensures that the data tree is indeed the encoding of an XML document, by forbidding the node expression notxml everywhere in the tree. We define it by first ensuring that attribute, comments, and text nodes are leaves: (comment or text or or a ∈A a) and child:: * Note that this does not enforce the XML standard of having at most one a-labelled attribute for every element. This might actually be desirable, for instance for handling set-valued attributes, like the class ones in HTML 5. But it can be otherwise remedied with a disjunction with (7): · · · or or a ∈A (for $x in child::a return for $y in child::a return . [not($x is $y)]) In case we are working with a fragment without for and is, but with data joins, (8) ensures instead that all the a-labelled attributes share the same data · · · or or a ∈A (child::a ne child::a) We might want to ensure that identifiers are unique. To simplify matters, let us assume that all the unique identifiers use the attribute name id ∈ A; then we add · · · or id and (. eq future(id)) Finally, we should also ensure that data values are consistent throughout the tree. Remember that the value of an element node should be the concatenation of the values of its element and text children. There is no way to do this without access to string-processing functions, so the XML semantics and data tree semantics do not quite coincide. Most of the literature  accordingly restricts data joins π △ π ′ and data tests π △ + d to paths ending with an attribute step: only π /@a △ π ′ /@a ′ and π /@a △ + d are allowed in their syntax. We do not enforce this restriction in our concrete syntax specifications, but the effect is limited: there are only 381 occurrences in the benchmark of a data test π △ + d where π is neither a function call nor a variable and does not end with an attribute step. Table 2 shows the number of queries that use each specific axis, first for each type of source, and then globally; attribute and child are the most prominent axes. Tables 3 and 4 show, for each syntactic construct, the number of queries that use it. Table 3 focuses on constructs from our restricted syntax, while Tab. 4 presents XPath constructs not supported by our abstract syntax. The difference between XSLT and XQuery sources is marked, with 'advanced' or unsupported constructs (let, for, map, etc.) used almost only in XQuery, and navigation and data tests against constants used significantly more in XSLT.

XPATH FRAGMENTS
We present here the fragments with decidable satisfiability and containment we have considered in our experiments.
As there is such an abundant literature on the topic [e.g. 12, 13, 24-26, 29, 34, 39, 45, 50, 52], this is clearly an incomplete sample, but we think it is representative of the main lines of investigation.
The fragments we consider in our experiments and their inclusions are shown in Fig. 6, along with the complexity of satisfiability in each fragment. Regarding complexity, we use the DAG-size of the input expression, where isomorphic sub-expressions are shared. Note that Fig. 6 reports the complexity for the original logics, thus for EMSO 2 and non-mixing MSO constraints, the complexity of the XPath fragments we translate into the logics might be lower.

.2). Ten Cate and Lutz
where $x and $y range over X; we call the resulting fragment CoreXPath 2.0. The syntax of node identity tests in [52] is slightly more restrictive than ours, but this can be fixed by defining π is π ′ as singleton(π ) and singleton(π ′ ) and for $x in π return for $y in π ′ return .[$x is $y] (10) A deeper difference lies in the semantics of variables: ten Cate and Lutz [52] assume that valuations map to single nodes. This does not make any difference regarding bound variables, since in CoreXPath 2.0 they must be bound by a for expression, but it does make one for free variables. This is not an issue, since the decision procedure for CoreXPath 2.0 can handle those.
Indeed, satisfiability in CoreXPath 2.0 is decidable in time bounded by a tower of exponentials, whose height depends on the size of the expression. This is seen by reducing the problem to satisfiability in MSO(↓, →, (P a ) a ∈Σ ), using the usual standard translation of CoreXPath expressions into MSO formulae [see, e.g., 3,44]; the resulting TOWER complexity upper bound is tight [52,Thm. 31]. Thus our free XPath variables translate directly into free second-order variables in MSO(↓, →, (P a ) a ∈Σ ).

Data XPath.
A well-studied XPath fragment with the ability to test data equality and disequality is DataXPath [31]. It is obtained by adding joins to the syntax of CoreXPath as φ ::= · · · | π △ π Although DataXPath satisfiability is undecidable [31], restricting the navigational power restores decidability [27]. The first decidable fragment we consider is VerticalXPath, shown decidable by Figueira As seen in Fig. 6, the complexity of the satisfiability problem in these three fragments varies considerably: Downward-XPath is EXP-complete, but VerticalXPath and ForwardXPath are ACKERMANN-hard [28]. It is also notable that satisfiability of DownwardXPath also becomes ACKERMANN-hard in presence of DTDs [28].

Non-Mixing MSO Constraints.
In [19], Czerwiński, David, Murlak, and Parys define MSO constraints as formulae of the form ψ (x ) =⇒ η ∼ (x ) ∧ η (x ), where ψ is an MSO formula over the signature (↓, →, (P a ) a ∈Σ ) with the first-order variablesx as its free variables, and η ∼ and η are positive Boolean combinations of atoms, over the respective signatures (∼, (P d ) d ∈D ) and ( , (¬P d ) d ∈D ). Hence data tests and data joins are permitted, as long as they are not mixed. Satisfiability is called 'consistency' in this context, and is decidable [19,Thm. 4]; better complexities are achievable when restricting ψ to conjunctive queries.
To quote Czerwiński et al. [19], their 'results imply decidability. . . of the containment problem in the presence of a schema for unions of XPath queries without negation, where each query uses either equality or inequality, but never both. ' Here is indeed a NonMixingXPath fragment of XPath, which can be translated to MSO constraints: where △-expressions, for △ in {eq, ne}, are defined by π △ ::= α: where φ is any CoreXPath 2.0 node expression. As this fragment embeds CoreXPath 2.0, its satisfiability problem is TOWER-hard, which matches the upper bound for MSO constraints [19].

Baseline Benchmark Results
We have implemented the fragments of this section as Relax NG schemas. In each case we included obvious extensions, such as the syntactic sugars discussed in Sec. 3.5 (in particular, the last() function is included in CoreXPath 2.0 and NonMixingXPath).
The results of these fragments on the benchmark are presented in grey in figures 7 and 8 (p. 12). The fragments allowing free variables, namely CoreXPath 2.0, EMSO 2 XPath, and NonMixingXPath, have the best baseline coverage. We see here the practical interest of a fragment like NonMixing-XPath with restricted negation but some support for variables, data tests π △ d, and data joins π △ π . The other fragments have an essentially negligible coverage in the XQuery benchmarks (Fig. 8). The support for unrestricted joins π △π ′ in the fragments of DataXPath from Sec. 4.1.4 has a very limited effect, and indeed we only found 65 relevant instances in the entire benchmark, i.e. where neither π nor π ′ is a variable or a function call. Of course, the fragments defined in the literature were not meant to be run against concrete XPath queries; we will see in the next section that several extensions can be made to these fragments.

EXTENSIONS
In this section, we introduce several extensions of the fragments from Sec. 4, while preserving the decidability and complexity of the corresponding satisfiability problems. As seen in Tab. 5, we consider first 'basic' extensions with considerable impact on the benchmark coverage in sections 5.2.1 to 5.2.3, and then 'advanced' ones with smaller impact in sections 5.2.4 to 5.2.6.

Handling an Extension in a Fragment
A first way to prove that an extension can be handled is, when given a node expression φ in the extended syntax, to compute an equivalent node expression φ ′ in the original fragment, i.e. such that for all data trees and valuations ν, φ ν N = φ ′ ν Nand similarly for path expressions. In such a case, we say that the extension can be expressed in the fragment, and polynomially so if φ ′ can be computed in polynomial time.
A second way is instead, when given a node expression φ in the extended syntax, to compute an equisatisfiable node expression φ ′ in the original fragment, i.e. such that there exists t with t |= φ if and only if there exists t ′ with t ′ |= φ ′ . In such a case, we say that the extension can be encoded in the fragment, and that it can be polynomially encoded if φ ′ can be computed in polynomial time. Clearly, an extension that can be (polynomially) expressed can also be (polynomially) encoded, but the converse might not hold.
Last of all, the proof techniques employed to show the decidability of satisfiability in a fragment might allow to handle the extension at hand.

Extensions
5.2.1 /π : Root Navigation. In XPath, navigation to the root is possible through the /π construct as well as using the nullary root() function [22], with semantics root F def = {ε}. We naturally allow these features in CoreXPath 1.0 and 2.0 as well as in the NonMixingXPath and EMSO 2 XPath fragments, where navigation to the root is captured by ancestor-or-self:: * [not(parent:: * )] (11) The same goes for VerticalXPath but not for the two other DataXPath fragments, where one cannot navigate upwards. It is clear that root navigation is not expressible in these fragments. In fact, it cannot even be encoded in ForwardXPath, since it becomes undecidable when extended with navigation to the root, as can be seen by adapting the proofs from [28]. We leave open the question whether there is a (polynomial) encoding of root navigation in DownwardXPath.
Finally, regarding PositiveXPath, Hidders's original fragment [37] allowed root navigation, and his proof of a small model property (showing the satisfiability problem to be in NP) applies mutatis mutandis to the fragment with data joins defined in Sec. 4.1.1.

$x: Free Variables.
In most of our fragments, free variables are admissible. Indeed, any formula φ over Σ and D with a (necessarily finite) set of free variables X ⊆ X can be translated into an equisatisfiable formula φ X over Σ × 2 X and D with no free variables. Let us write a S for (a, S ) ∈ Σ × 2 X .
The key translation steps are: Assuming without loss of generality that the variables bound by constructs such as for or let do not belong to X , they are not affected by this translation. Note that although the encoding is exponential, the extension does not actually impact the complexity of satisfiability: in PositiveXPath and CoreXPath 1.0, a polynomial encoding can be obtained, leveraging a variant of the semantics where multiple propositions may hold at a node. In VerticalX-Path, satisfiability is ACKERMANN-hard, so an exponential blow-up will not have an effect on the worst-case complexity. The decision procedures for the fragments CoreXPath 2.0, NonMixingXPath, and EMSO 2 XPath are based on secondorder logic, hence they actually allow XPath variables in their baseline version.
We finally observe that this translation is not available in DownwardXPath and ForwardXPath, because they cannot express the root path of (12). In fact, free variables cannot be encoded in ForwardXPath; this is similar to Prop. 5.1. Proposition 5.3. Satisfiability in ForwardXPath extended with one free variable is undecidable. We now consider the extension with direct comparisons against constants from D, assuming that < is a dense total order: 3 φ ::= · · · | π △ + d where △ + ∈ {eq, ne, le, lt, ge, gt} and d ∈ D.
We show that any formula over Σ and D featuring comparisons against constants in a finite subset D ⊆ D can be transformed into an equisatisfiable formula over an extended labelling set Σ × C D . This is similar to the treatment of free variables in Sec. 5.2.2, but requires to include data consistency constraints in the encoded formula when the fragment at hand supports data joins. Crucially, these consistency constraints are mixing, and thus not available in NonMixingXPath. Regarding PositiveXPath, the small model property [37, Lem. 1] still holds in the presence of data tests, hence satisfiability remains in NP for this extension.
Going slightly further, we allow comparisons against constant data expressions in our fragments, where constant expressions are built from constant values and the deterministic context-insensitive functions of the XPath specification [22]; e.g., @n eq 3 + 1 is allowed.

π △π
: Positive Data Joins. We observe that most of our fragments can be extended to allow restricted occurrences of data joins. Intuitively, we allow data joins in positions that guarantee that the join will be evaluated only once during satisfaction checking, which allows us to replace it by two tests against a specially chosen data constant. Hence, we extend any fragment with π + ::= π | /π + | π + /π + | π + [φ + ] | π + union π + | π + except π | π + intersect π + | some $x in π + satisfies φ + φ + ::= φ | π + | φ + or φ + | φ + and φ + | π + is π + | π + △ π + | π + △ d Productions for constructs such as navigation to the root, intersection, node comparison, etc. should only be considered in fragments where they are allowed. This works for all relevant fragments: NonMixingXPath does not support the mixing data tests required in our encoding, and Positive-XPath already supports positive data joins in its baseline version.

last():
Positional Predicates. The typical use of last() in XPath is through a positional predicate π [position() = last()] or π [last()] that only keeps the last node in the document order among all those selected by π . We can also check whether a node is the ith one for some i > 0 with π [i], or not the last one with π [position() != last()] or not the ith one with π [position() != i]. As seen in Tab. 5, these constructions are quite frequent in the benchmark. Here we discuss the case of last() and its negation, but the other positional predicates can be handled in a similar fashion.
As explained in Sec. 3.5.2, this kind of predicates is not supported in our simplified semantics, thus we rather focus on the one-argument functions last(π ) and notlast (π )-the latter is not a standard function-, with semantics last F (S ) def = max ≪ S and notlast F (S ) def = S \ {max ≪ S } for any S ⊆ N . Recall from Sec. 3.5.2 that last() can be expressed natively in CoreXPath 2.0 and thus in NonMixingXPath. The question here is to which extent it can be handled in the other fragments. Our first result is that in some cases, the last() and notlast () functions cannot be expressed. This result, proved using bisimulation techniques, shows that last() cannot be expressed in VerticalXPath nor in DownwardXPath, even for simple one-step paths. Furthermore, we can show that it cannot be polynomially encoded in DownwardXPath by adapting the hardness proofs of [28].
However, we can still look for some uses of last() that can be reasonably allowed. In particular, the path expressions from the statement of Prop. 5.7 can be handled in Forward-XPath-or at least in its regular extension [51] with new axes and the Kleene plus and Kleene star operators on paths α ::= · · · | previous-sibling | next-sibling π ::= · · · | π + | π * Their semantics are defined by previous-sibling A def = → -1 ,

next-sibling
As far as the computational complexity of satisfiability is concerned, this extension comes 'for free' in CoreXPath 1.0 [3,44] We show in the full version that we can handle using this regular extension last(π ) and notlast (π ) on any one-step path argument of the form π = α:: * [φ].   [13,43]. In full generality, this mechanism relies on attributes declared as identifiers in the XML schema, but we shall consider a simpler setting where the @id attribute plays this role. The function id() then takes a path as argument, and returns the nodes of the document (if any) that have an @id attribute matching the datum found at the end of the path given as argument. Fig. 3, evaluating id(@ref) at the t node returns the o node.
Adding id() jumps makes most of the fragments undecidable, as soon as we can also use (full) data joins or node tests π is π ′ . The proof reduces from Post's correspondence problem.
We show in the full paper that limited support for the id() function can be provided in VerticalXPath and EMSO 2 XPath.

Extended Benchmark Results
We have implemented the extensions of this section as Relax NG schemas. The results on the benchmark are presented in violet in figures 7 and 8. Generally, the differences observed before still hold but are significantly lessened. Strikingly, the extensions even bring CoreXPath 1.0 above NonMixing-XPath: the latter only supports non-mixing data tests. It also differentiates VerticalXPath from the two other DataXPath fragments, due to its support for root paths, which in turn allows to support free variables.
Looking at the influence of each extension separately, Tab. 6 shows that the 'basic' extensions of sections 5.  predicates, we found that many occurrences are of the form last($x), which is outside the scope of our treatment in Sec. 5.2.5. Regarding id(), Tab. 5 shows that there are very few occurrences of id() in the benchmark. A quick investigation of the usage of the id attribute in the benchmark shows that developers rather interact with it through variables and data tests. CoreXPath 2.0. We look more closely at the differences between the fragments in Sec. 6.1.

DISCUSSION
Obviously, the coverage of XPath queries extracted from XQuery files is quite poor compared to that of XSLT files. Among the other factors, we see that the size of the query is (negatively) correlated (Tab. 7). Another correlation is the presence of at least one axis step, where the combined coverage is of 74.50%, but only 48.79% for queries without any axis step. The main factor we identify is however the presence of non-standard or unsupported function calls in the query, which we discuss in Sec. 6.2.

Comparisons Between Fragments
In the case of the extended fragments of Sec. 5, the inclusions of Fig. 6 are slightly changed: CoreXPath 2.0 now contains NonMixingXPath and PositiveXPath is included into Core-XPath 2.0 but disjoint from NonMixingXPath.
These theoretical inclusions are reflected in the difference matrix shown in Tab. 8 and the accompanying chord graph 4 of Fig. 9. There are three maximal incomparable fragments, namely CoreXPath 2.0, VerticalXPath, and EMSO 2 XPath. Among the fragments of DataXPath, VerticalXPath benefits from the support of most extensions, while-as seen in Tab. 2-horizontal navigation is not used very frequently in the benchmark. The coverage of the extended CoreXPath 2.0 is almost as large as the combined coverage: only 30 queries from VerticalXPath are not captured by CoreXPath 2.0, and they all contain data joins under a negation; only 12 queries from EMSO 2 XPath are not captured, which include 11 queries with id() plus one of the previous 30.
From a more practical perspective, we think that the extended versions of PositiveXPath and CoreXPath 1.0 are the most promising ones: satisfiability has a more manageable complexity (NP-complete and EXP-complete, resp.), and the coverage is not too far behind CoreXPath 2.0 (with 942 and 57 fewer queries, resp.). Note that PositiveXPath is nearly included into CoreXPath 1.0, with only four queries (featuring intersections) not captured by CoreXPath 1.0.

Supporting Functions
Due to the large number of calls to non-standard functions in the benchmark, the coverage of 'XPath 3.0 std' (cf. Tab. 1) is an upper bound on the achievable coverage. With respect to the number of queries captured by 'XPath 3.0 std', the combined coverage is 78.33%, and the precise coverage varies between 75.71% for PositiveXPath and 82.14% for CoreXPath 2.0 in XSLT sources and between 56.30% for NonMixingXPath 4 A chord between fragments i and j has thickness proportional to entry (i, j ) on its i end, and to entry (j, i ) on its j end. The color of the chord is the one of the 'winning' value. An interactive version is available online at http://www.lsv.fr/~schmitz/xpparser/. and 60.00% for CoreXPath 2.0 in XQuery sources: the latter sources are more complex even when leaving aside their higher reliance on non-standard functions.
A remaining issue is the support of standard functions. For instance, the four functions that occur the most frequently in the benchmark are in decreasing order count(), concat(), local-name(), and contains(), and they are all standard; local-name() is supported in our fragments, but the remaining three are not.
6.2.1 Aggregation. CoreXPath 1.0 extended with node expressions count(π ) △ + i for an integer i can be translated into the two-variable fragment of first-order logic with counting on trees, which has an EXPSPACE decision procedure [9]. There are 314 occurrences of such expressions out of the 624 occurrences of count() in the benchmark, but unfortunately not a single query is gained by adding this feature to the extended CoreXPath 1.0 fragment. Capturing more occurrences of count() requires arithmetic operations, and leads to an undecidable fragment akin to AggXPath [13].

String Processing and Arithmetic.
A promising direction for supporting more functions is the move to SMT solvers-even though it might also mean moving to semideciding satisfiability. Linear arithmetic is supported by all solvers, while theories comprising string concatenation, string length, and substring operations are also supported [e.g. 1,40,42,53,55]. SMT solvers have already been used in [11] Session 5: Semistructured Data and Knowledge Graphs, Logic, and Verification PODS '19, June 30-July 5, 2019, Amsterdam, Netherlands  to check XQuery inputs, using the classical interval encoding of trees, and the approach could be enriched to cover basic arithmetic and string support. Furthermore, a custom finite tree theory may be added to SMT solvers for more efficiency [e.g. 14, 18]. These considerations lead us to adding support in the extended PositiveXPath for linear arithmetic, the standard functions concat(), contains(), string-length(), and similar ones such as ends-with(). We view this fragment as a good candidate for practical satisfiability checking. At 62.75%, the coverage of this fragment, shown in yellow in figures 7 and 8, bests the combined coverage of our other fragments. This translates in particular to 84.77% of the subset of XSLT queries captured by 'XPath 3.0 std'.
This new fragment is incomparable with the others, with a new combined coverage of 67.40% (83.55% for XSLT sources). If we restrict our attention to the 14,732 queries (69.68%) that only use not() and the functions supported in this new fragment, the new combined coverage reaches 96.69% (98.10% for XSLT sources). Thus our fragments cover nearly all the queries that do not use unsupported or non-standard functions, from which we argue that improved function support is the most promising research avenue in order to gain over the extended fragments described in Sec. 5.

CONCLUDING REMARKS
We have designed a benchmarking infrastructure for testing the practical relevance of XPath fragments, based on the XQueryX format and Relax NG schemas. We have used this benchmark of over 20,000 queries, extracted from the XSLT and XQuery files of open-source projects, to evaluate the syntactic coverage of state-of-the-art XPath fragments for which decidability is known (or still open in the case of EMSO 2 XPath). Concerning the benchmark itself, it would of course be interesting to incorporate new sources, to confirm our observations on a larger scale.
Our analysis shows that, in a hypothetical satisfiability checker for XPath, the differences between the fragments defined in the theoretical literature are not as important as the differences introduced by the front-end translating real XPath inputs to the restricted syntax on which the decision procedure operates. Among the features that such a front-end should support, the most impactful ones would be free variables, data tests against constants (and constant expressions), positive joins, and positional predicates.
According to our benchmark results, such a front-end combined with the decidable fragments from the literature would cover about 70%-75% of the XPath queries found in XSLT files. However, due to the high reliance on userdefined or ill-supported functions, this drops to less than 30% for XPath queries from XQuery files: full-blown program analysis techniques seem necessary for XQuery.
As the the support of XPath functions is a key factor, a promising approach might be to harness the power of modern SMT solvers to handle string-manipulating functions and linear arithmetic, which might cover 77.44% of the XPath queries from the XSLT sources.

Synthetic Benchmarks
We did not include earlier synthetic benchmarks in our analysis, as we focus on real-world queries. We report here on two well-known synthetic benchmarks from the literature, XMark [49] and XPathMark [30]. The latter comprises two collections of queries: functional queries (XPathMark-FT) check the functional correctness of XPath processors, while performance queries (XPathMark-PT) allow to evaluate the performance of query evaluation in XPath processors.
A.1.1 Composition. Table 9 shows the composition of these benchmarks following the same format as in Tab. 1. The composition of these two synthetic benchmarks differs significantly from Tab. 1. The queries in these benchmarks are almost all from XPath 1.0, and although XMark is a benchmark of XQuery formulae, it employs very few non-standard functions or complex constructs such as map.
A.1.2 Coverage. The coverage of the various syntactic fragments of XPath over these synthetic benchmarks is shown in Fig. 10, following the same format as figures 7 and 8.
As these sources total less than 200 queries, it is hard to draw any conclusions from this analysis. Nevertheless, we observe that the improvement brought by extensions is much Session 5: Semistructured Data and Knowledge Graphs, Logic, and Verification PODS '19, June 30-July 5, 2019, Amsterdam, Netherlands more significant than for real-world queries, which might be explained by the fact that synthetic benchmarks aim for conciseness while attempting to achieve a complete coverage of XPath features. Similarly, the coverage of the extended VerticalXPath fragment is significantly lower, compared to other extended fragments, than for real-world queries, due to the higher proportion of horizontal navigation. Overall, the coverage of extended fragments is slightly lower than the one achieved for XSLT sources in the real-world benchmark.