XML-search query language: needs and requirements

Jovan Pehcevski [HREF1],
School of Computer Science and Information Technology [HREF2], RMIT University, Melbourne, Victoria 3000
jovanp@cs.rmit.edu.au

James Thom [HREF3],
School of Computer Science and Information Technology [HREF2], RMIT University, Melbourne, Victoria 3000
jat@cs.rmit.edu.au

Anne-Marie Vercoustre, [HREF4],
CSIRO, Mathematical and Information Sciences [HREF5], South Clayton, Victoria 3169
Anne-Marie.Vercoustre@csiro.au

Abstract

This paper explores the needs of XML-search and makes comparisons between the INEX experience in evaluating XML retrieval and the recently proposed W3C requirements for extending XQuery and XPath to handle full-text retrieval. Our experience in INEX gave insight into some problems combining ranking with XML structured queries and more generally the needs from the users' perspective. INEX has focused on how to express the user needs through topics that may involve structural XML elements. How such user needs can be translated into XML queries has an impact on the requirements for an XML-search language. The analysis in this paper shows that the W3C requirements for extending XQuery and XPath with full-text search seem to address most of the needs identified through our first experiments with INEX. In particular, being able to specify an implementation dependent scoring function that allows ranking of answers. However, there are still issues concerning how the W3C requirements cater for combining structural requirements with ranking.

1. Introduction

The Extensible Markup Language (XML) [6] developed by the World Wide Web Consortium (W3C) [HREF6] can be used both for exchange of data in Web-based information systems and for representing the logical structure of any kind of documents that contain free-text information, particularly certain document collections such as scientific articles, business reports, and heterogeneous Web pages. XML is a meta language allowing users to design markup languages that can express their various interests. As large XML collections are being deployed, there is a important need for an adequate query language to search these collections.

There are two aspects of XML which a standard query language should support.

The document-centric aspect focuses on XML applications that use the existing markup mainly for describing the logical document structure. A query language that supports this aspect will allow for flexible formulation of query conditions with respect to both the content and the structure of XML documents. XQL [20] is an example of a query language that addresses the document-centric aspect of XML.
The data-centric aspect focuses on XML applications that exchange data in a structured form conforming to a previously defined schema. Typical examples are EDI applications, such as orders, bills, or product catalogues. A query language that supports this aspect will offer many operators for restructuring the result along with other aggregation operators. These operators are widely used in standard database query languages, such as SQL. XML-QL [12] is an example of a query language that addresses the data-centric aspect of XML.

XQuery[5], the current W3C proposal for a standard XML query language, takes advantage of both content and structure specified within XML collections; and typically specifies exact answers to a user's query by strict satisfaction of logical query conditions. This is often being identified in line with database-oriented query languages, which find the exact match of the users' queries from the underlying document/data collection. However, XQuery manages to combine some features of XQL with those of XML-QL and thus addresses both aspects of XML described above.

The XML information retrieval community, on the other hand, follows the document-centric aspect, and has been interested in ranked answers to the users queries along with successful satisfaction of their information needs.

There has been a lot of research on XML search in recent years from both the database and information retrieval communities, and the W3C is showing a willingness to define a standard that closes the gap between these communities.

Thus it is timely to examine the needs and requirements of XML-Search for full-text as two separate recent initiatives have been tackling the problem of bridging this gap. In Section 2 we introduce the Initiative for the Evaluation of XML retrieval, INEX [HREF7], organized by the European DELOS Network of Excellence for Digital Libraries, which held its first workshop in December 2002. In February 2003 the XQuery working group within the W3C published the XQuery and XPath Full-Text Requirements [7] and XQuery and XPath Full-Text Use Cases [1]; we outline some of the main requirements of these working drafts in Section 3. In Section 4 we consider the additional needs for document-centric XML retrieval, and Section 5 outlines some issues with expressing those needs. Related work on XML and information retrieval is surveyed in Section 6. We present our conclusion in Section 7 that the proposed extensions to XQuery do not fully bridge the gap between the database and information retrieval communities.

2. The INEX initiative

INEX is a coordinated international effort evaluating content-based XML retrieval [HREF7]. In its first year, INEX involved groups from around the world undertaking a series of retrieval tasks (topics) on a large collection of XML documents (XML sources for about 12,000 articles in IEEE Computer Society publications since 1995). Each group submitted suggestions for topics (queries), and altogether 60 topics were selected from those submitted by the participating groups. Of these 30 topics specified content only (CO topics) and corresponded to conventional information retrieval queries, although the answers to the CO topics could be elements instead of whole documents. More constrained were 30 topics, specifying content and structural aspects (CAS topics). Figure 1 shows a slightly modified Topic 14, which is derived from one of the CAS topics proposed by the CSIRO group to find figures that describe the Corba architecture and the paragraphs that refer to those figures [11,22].

   <?xml version="1.0" encoding="ISO-8859-1"?>
   <!DOCTYPE INEX-Topic SYSTEM "inex-topics.dtd">
   <INEX-Topic topic-id="14" query-type="CAS" ct-no="075">
   <Title>
      <te>//fig,//p,//ip1</te>
      <cw>Corba architecture</cw>
      <ce>//fgc</ce>
      <cw>Figure Corba Architecture</cw>
      <ce>//p, //ip1</ce>
   </Title>
   <Description>
      Find figures that describe the Corba architecture
      and the paragraphs that refer to those figures.
   </Description>
   <Narrative>
      To be relevant a figure must describe the standard
      Corba architecture or a system architecture that
      relies heavily on Corba. A figure describing a
      particular aspect of a system will not be regarded
      as relevant even though the system may rely on
      Corba otherwise. Retrieved components would
      ideally contain both the figure and the paragraph
      referring to it.
   </Narrative>
   <Keywords>
      CORBA ORB Object Request Broker Architecture
      interface invocation interoperability
      communication protocols IDL
   </Keywords>
   </INEX-Topic>

Figure 1: INEX topic 14

The language for topic description is an extension of the one used for TREC evaluations [HREF14]. Within the <Title>, the <cw> elements specify the query terms for the topic. The CAS topics also defined either or both <ce> content elements and <te> target element constraints on the structure; the <ce> parts specify constraints about in which document element the query terms should appear, and the <te> part specifies target elements for the answers. In the example shown in Figure 1, the query terms "Corba" and "Architecture" should appear within the figure's caption (given by the XPath [4] expression //figc), and the query terms "Figure", "Corba" and "Architecture" within a paragraph (given by the XPaths //p and //ip1), the target to be returned as an answer is a figure (element defined by the XPath //fig) or a paragraph (element defined by the XPaths //p and //ip1). For topics where no target was specified the retrieval system should return an element or elements from matching documents that most closely meet the information need expressed in the query. The explanation of the other elements in Topic 14 is as follows. The <Description> contains an one or two sentence natural language definition of the information need, the <Narrative> provides a detailed explanation of the topic statement in terms of what makes a document/element relevant or not, and the <Keywords> is a set of query terms that are used in the collection exploration phase during the topic development process. We will discuss limitations of this language for describing topics in Section 5.

Each group were able to submit three runs with variations of their system and return (up to) the top 100 answer elements for each topic. These were then combined into a pool for each topic of between one and two thousand answer elements which were then assessed for relevance, usually by the participating author who had originally formulated the topic. The surrounding and component elements of the answer elements in the pool were also assessed. There were two dimensions to the assessments: involving four-degrees of relevance (from irrelevant to highly relevant) and four levels of document coverage (from no coverage to exact coverage). These two dimensions can then be combined in various ways; for example a strict measure would only count answer elements that are both highly relevant and have exact coverage as relevant when computing recall and precision.

The INEX experiments provide valuable experience with a large collection of XML documents and help identify the requirements for realistic document-centric XML user queries as compared to data-centric queries sent by applications to XML data repositories.

3. W3C full-text requirements

In February 2001, the W3C published XML Query Requirements Working Draft [10], which lists the basic requirements that a standard XML query language should meet. As observed by Fuhr and Grossjohann [14], among all the features listed in the draft, the only feature specifically related to the document-centric aspect is the ability for processing simple text conditions; the other 18 features follow from the data-centric aspect. It is desirable that a standard XML query language also addresses the document-centric aspect of XML, which requires that the language incorporates the common concepts developed in the area of information retrieval.

Recently, the XQuery working group within the W3C published the XQuery and XPath Full-Text Requirements [7] and XQuery and XPath Full-Text Use Cases [1] Working Drafts, which describe combining full-text search and ranking capabilities within XQuery and XPath. While these documents are still works in progress, they clearly identify the primary functionalities which the W3C wishes to achieve with these requirements. Specifically, full-text search is defined as "an extension to the XQuery/XPath language [that] ... provides a way to query text which has been tokenized"[7]. Thus the requirements include: word and phrase search, support for stopwords and stemming, proximity search, boolean operators, word normalization, diacritics, and ranking.

The requirements for the scope of the input queries [7] include that XQuery/XPath Full-Text must allow search within an arbitrary structure (or XPath expression), must allow a query to return arbitrary nodes, must allow the combination of predicates on different parts of the searched document "tree", and must be able to support full-text search within element attributes.

Ranking is introduced with the use case SCORE that "reflects relevance of matched material"[7]. The SCORE (a float, in the range of 0-1) therefore represents a similarity measure between a particular document fragment and a given query.

The XQuery/XPath Full-Text Use Cases working draft [1] follows upon the initial full-text requirements and identifies the types of queries in the SCORE case. It explicitly states that these queries relate to the ranking of the document fragments and thus can not be specified as queries that include Boolean operators. Therefore, these queries would include multiple words or phrases, which is very similar to the input of the search engine in traditional information retrieval systems. The draft furthermore specifies the context where the users would like to associate their information needs, and the resulting fragments that would be presented as the final answer to the query. The answers may be fragments referring to embedded elements within the documents, computed element scores (possibly over a specified threshold), or they may be constructed from multiple elements (for example, paragraph and figure as in INEX Topic 14). Since the implementation of the features in the SCORE case is left to the designers of the retrieval system, Figure 2 outlines the example of rewriting the "12.2.1 Q1 Multiple Word Query" [1] SCORE use case with INEX topic 14.

   Find all paragraphs and figures containing the words
   "Figure" "Corba" "Architecture".

   This query finds multiple words within an element, returning paragraphs
   and figures containing all the words first, then those with fewer,
   then those with one.

	1. Operands: "Corba" "Architecture"

	2. Functionality: word query, stemming, implementation-defined scoring

	3. Context: article//p, article//fig

	4. Return: article//p, article//fig

	5. Comments: This query on multiple words can not be written as a query 
	   with pure Boolean full-text predicate without additional functionality 
	   (e.g., and, or, proximity). This is however a common scored query. 
	   The scoring methodology in this use case is for illustrative purposes only. 
	   Scoring methodologies will be implementation-defined.

	6. Version: For consideration in v.1

Figure 2: The Multiple Word Query SCORE use case for INEX topic 14

Figure 2 shows some limitations of the current SCORE use case proposal. It is not clear how multiple occurrences of <ce> elements in the INEX topic can be rewritten with the SCORE use case. There is no clear explanation how different query terms (operands) can be associated with different context elements. For example, it seems not possible to specify the terms "Figure", "Corba" and "architecture" for the context article//p and the terms "Corba" and "architecture" for the context article//fig. Also, it is unclear how a relevant figure can be linked with the referring paragraph. In this sense, the W3C SCORE proposal only partially addresses the requirements of INEX Topic 14.

4. User information needs for document-centric XML retrieval

The recent W3C XQuery and XPath Full-Text proposals incorporate both ranking and relevance, two concepts widely recognized by the information retrieval community. However, the actual satisfaction of users' information needs is the most important part of judging the usefulness of a particular XML content-and-structure-based retrieval system. Therefore, we should be able to identify the basic retrieval needs that particular XML-search query language should address. Briefly, they relate to the whole XML document retrieval process, involving the expression of the user's query (or selection of the information need), ranking the final answers according to their similarity with the query, and the actual presentation of the final answers (or projection of the information need).

Information retrieval, in the most general sense, is the process of finding the information relevant to users' needs. More specifically, the information is often contained in document collections, and users' needs are typically represented as queries against those collections. The queries are then submitted to an information retrieval system (for example, a search engine), which extracts answers from the underlying document collection that best match the users' information needs. In general, the system does not provide direct answers to the users' queries; instead, it identifies what documents are relevant to some particular query. The latter constitutes the central part of extending the document-centric aspect of the XML retrieval process, as we now discuss.

Ranked information retrieval is used to find relevant documents or document fragments in a particular flat-text or structured document collection, respectively. At the end of the process, the users are presented with an ordered list of documents or fragments that most closely match the information need expressed in the query. In this case, a query represents a list of terms that are used to give an indication of whether a document or a fragment is likely to be relevant or not. There are no strict semantics (as with Boolean retrieval), and partial matching (indicating that a document or a fragment may contain any of the query terms) is possible. The relevance between the document or the fragment and a given query is based on a similarity measure, and most information retrieval systems use some statistical estimation of similarity, based on the words that occur in both the query and the documents in the collection.

Following the extended document-centric aspect of XML described above, we identify the three basic needs for an XML retrieval system.

First, the retrieval system should be able to allow the users to express their queries without knowing the structure of the underlying data. It should be noted that users might wish to satisfy their needs by entering a set of words identified as queries to the retrieval system without specifying the context where the particular word appears, in which case the system itself should be responsible for further refinement of the users' queries.
Second, the retrieval system should be able to rank the final results according to their similarity with the users' query.
Third, the retrieval system should be able to return document fragments, rather than full documents, whether or not the type of the fragment returned is specified in the input query. If the type of the returned fragment is not specified, the retrieval system would then decide the best fragments to return. In this case when arbitrary fragments are returned, the knowledge of the actual context of the returned fragment with respect to the structured document hierarchy can be very helpful for the overall satisfaction of the users' information needs.

5. Expressing the query

The INEX group designed a test set of topics that represent realistic users' queries to the IEEE document collection. The group did not aim at defining a query language for XML search, but a way to desribe structured and unstructured retrieval needs from a large collection of XML documents. Those retrieval needs could then be (automatically) translated into native queries for a particular search engine; possibly into the XQuery/XPath search language, once the full-text search requirements are implemented into it.

In this section we analyse a few issues that have arisen from our experience of INEX, and relate these to the full text requirements for XML querying.

5.1 Ranking similarity

The topics were described in XML using a mix of natural language description within a more formal <Title> tag, where both CO and CAS queries could be expressed more precisely. The decision was to use simple XPath expressions to specify the elements which were to be selected or returned.

For example, the query: "retrieve bibliographic references for works by W. Bruce Croft where Croft is not the author of the paper containing the reference" was initially defined using the syntax:

<Title>
   <te>//bb</te>
   <cw> W. Bruce Croft </cw>
   <ce>//bb/au<ce>
   <cw> not (W. Bruce Croft)</cw>
   <ce> //fm/au </ce>
</Title>

At the INEX workshop improvements to this syntax were considered. It would have be better to write the query using XPath expressions and the predicate "contains" in the following way:

<Title>
   <te>//bb[./au, contains("W. Bruce Croft")]</te>
   <ce>//fm/au[., not contains("W. Bruce Croft)]</ce>
</Title>

However, the XPath predicate "contains" is too strict for what a real query engine will have to deal with. Indeed, the actual name of the author could be written as Bruce Croft or W.B. Croft. In order to address this issue, the first INEX workshop proposed introducing the predicate "about" to define a query (or a structured query, in more general sense) where the constraints can be relaxed. The same would apply to the topic in Figure 1, where one looks for figures "about" the Corba architecture, and where some other terms can be introduced. The implementation of the predicate "about" is left to the choice of a particular system.

This is very much in line with the W3C requirement about the SCORE function that can be used to allow implementation dependent similarity measures. However, as outlined in Section 3, it is not clear from the requirements whether it is possible to introduce a SCORE function on two or more different contexts, such as defined by the XPaths //bb/au and //fm/au in the above example.

5.2 Links to figures

The current INEX language for defining the <Title> of a topic is not able to properly express the information need described in Figure 1. It would be very desirable, though, to use XPath for expressing a query for retrieving the figures "about the Corba architecture" for which the paragraphs containing references to those figures already exist, as follows:

//fig[about(., "Corba architecture") and
      ./@id = //p[about(., "Corba architecture")]//ref/@rid]

Although the current implementation of XPath does not allow retrieving together the figures and the referring paragraphs, this would be possible with XQuery. However, the full power of XQuery is not required by INEX; rather, a simple language to express references and realistic queries is sufficient, such as a language that may be inspired by WEBSQL[18].

For example, a figure that is referred to by a paragraph could be denoted:

fig[p -> .]

where the "->" means following a link or reference. Similarly a paragraph that contains a reference to a figure could be denoted:

p [. -> fig]

We can now specify a figure about Corba Architecture with paragraphs about Corba architecture referring to them, as:

fig[about (., "Corba Architecture") and p[about(., "Corba architecture")] ->.]

From the W3C XQuery requirements, there is no doubt that the above predicates will be interpreted in the strict sense and that figures without referring paragraphs would not be returned. Is this really what users want, or would they prefer to get those as well, maybe with a lower score?

5.3 Ignore presentation tags

The XML collection used for the INEX experiments in 2002 contains documents where a significant portion of the markup relates to presentation rather than content. For example, the semantic meaning of the tag <it> is to emphasize the appearance of a given document phrase. These tags are likely to be of little interest to the users and queries must be able to ignore them. In our current implementation we a priori decided which tags our indexer would ignore. However, it would be preferable to be able to specify this at query time, that is the query language should support the ability to ignore certain tags in particular contexts.

It is pleasing to observe such use cases [1] within the W3C requirements.

5.4 Definition of "roles"

Topic 14 in Figure 1 also illustrates the need for defining classes of tags. These are also referred to as structured roles [23]. The INEX DTD is quite complex and, for example, involves various types of paragraph elements: ip1 to encode the "first" paragraph of a section, while there are also other paragraphs such as p, p1, p2, or p3. It would be very tedious to repeat the same XPath with all different types of paragraphs that can be involved. Instead, the INEX group proposed the definition of generic names for groups of tags that can be used for specifying a query.

For example:

define $par as ip1, p, p1, p2,
define $fig as fig, figw

We think that the definition of roles is an important requirement for the XQuery/XPath search language that is currently missing.

5.5 Target elements

Content only (CO) queries do not specify the type of elements that must be returned as an answer to a user query. The underlying system may choose the granularity of the answers. In some cases, a single paragraph will be relevant, or a section or the full document. What element(s) is(are) actually the best answer to a user query must be assessed by humans.

Structured (CAS) topics can possibly define the type of elements to be returned (as specified within the the target elements <te>). There has been a strong debate within the INEX workgroup between the structuralists (database-oriented) and the others (information retrieval-oriented) whether elements other than the target elements could receive any positive assessment. For the first year, the decision was that only answers corresponding to target elements should be assessed positively. However it is not quite clear whether other types of elements (for example embedding elements, or descendant elements) could also receive some positive score in the evaluation process. When there is no target elements in a structure query, the system can decide the best elements to rank and return, as in the CO queries.

The W3C proposal includes a requirement for returning arbitrary elements but yet they have to be specified (the requirement 6.2.3 in the XQuery/XPath Full-Text Requirements working draft [7]). However it does not support the idea that the underlying system can decide the best granularity of elements to return and to rank them.

5.6 Grouping and joins

The specification of the INEX answers does not allow for grouping. For example, for the topic in Figure 1, there was no way to indicate that one result would be made of a figure and its associated referring paragraphs. Would the specification language cater for this, the expected format of the answers would still not support it. The current INEX format for each result to a query is a pair (URI, XPath) where URI is the address of the document in the INEX collection, and XPath is the absolute XPath of the element within this document.

The W3C requirements [7] implicitly suggest that the elements returned for a document are grouped together. Since the actual syntax is not defined yet, one can just guess that it would use the grouping mechanism that exists in XQuery, where result elements can be explicitly built as part of the query.

6. Related work on XML and information retrieval

The need for XML information retrieval has already been identified in the research community [9,2]. In a general sense, two information retrieval approaches that deal with retrieval from structured (XML) documents can be identified. The first, content-based approach, addresses the ranking issue, but somehow ignores the structural conditions specified in the document schema. Here, the retrieval system determines some sequence of document sentences (for example, passages) that best fit the information need expressed in the query [17]. In this case, particular XML tags may be used as passage delimiters. On the other hand, the structural approach [19] is very much in-line with the document-centric aspect of XML retrieval. It deals with both the content and structure of the underlying data, relies on the Boolean retrieval and does not support any kind of ranking of the final results.

In an effort to combine both the structural and content-based approaches described above, Fuhr and Grossjohann [14] present XIRQL, an XML query language that "extends" XQL by incorporating the notion of weighting, relevance-oriented search, vague data types and semantic relativism. Although the XIRQL query language somehow bridges the gap between the two approaches of XML retrieval, it still represents a very complicated language for the average user.

The rest of this section discusses about some existing XML information retrieval systems, grouping them explicitly in three categories. These categories represent the three basic user needs that an XML retrieval system should address, as outlined in Section 4.

6.1 Expressing the queries

Traditional information retrieval techniques, deployed with XML collections, usually assume that the granularity of the indexes for a particular XML collection is restricted to whole documents or a few predefined elements within the documents, so the basic term statistics on which further ranking will be based should refer to a very precise scope.

Wolff et al. [23] introduce structural roles that can associate query terms with a set of elements in the XML documents. These roles can then be used to explicitly define the scope of the users' queries, by distinguishing among various existing semantic contexts in the XML document collection. In that way, the concept of roles can be seen as a contribution towards improving both expression of the users' queries and the process of delivering the final answers from the XML search engine. Section 5.4 outlines one possible case, among many others, which finds this concept very useful.

Grabs and Schek [15] follow the observation that even a single XML document may have very heterogeneous content, containing many different parts, or categories. Here, the data for the basic term statistics are generated on-the-fly at query time, depending on the category to which the user query belongs (unlike XIRQL, where the basic statistics data are generated during the indexing time, so the assignment of categories to indexing elements is static).

Egnor and Lord [11] present a commercial system for structured information retrieval using XML that incorporates the notion of user interaction. Although their approach is oriented towards the data-centric aspect of XML retrieval, one of the main ideas behind their XML retrieval system is a user interface model that uses dialogue-oriented interaction with the users of the system. The interaction process may allow further refinement of the initial queries, which in turn may result in construction of more complex structured queries.

In addition to their previous work with XIRQL, Grossjohann et al. [16] present a user interface for formulating queries in an XPath-like language. Their "Query by Example" interface allows the users to express various query conditions. Users can make a query condition by picking a favourite term in an XML document, in which case they will obtain a list of all the preferred regular path expressions for the location of that term together with some form of generalizations (by using the // operator and the * wild card). The form also allows users to specify their favourite result elements.

6.2 Ranking models

Two major theoretical paradigms of ranked information retrieval [3] are the vector space models, which use geometric interpretations of the document and query space, and probabilistic models, which evaluate the probability of a document being relevant to a query. Although these models rely on different theoretical foundations, both are ultimately based on fundamental statistics derived from a document collection, which in turn represent information about the distribution of terms within documents and within the collection as a whole.

For the ranking of the final answers to the users' queries, Wolff et al. [23] extend the probabilistic model by incorporating the notion of both the roles and the structural elements that comprise the documents in the XML collection.

A number of ranking techniques for extending the conventional vector space model to XML retrieval have also been proposed.

Schlieder and Meuss [21] present an XML retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and furthermore supports structured queries. Their query model is based on the tree matching formalism, representing users' queries and XML documents as "labelled trees". Users can use this formalism to formulate their queries without knowing the exact structure of the underlying data, and partial matches of the users' query structure are also supported.

Grabs and Schek [15] present a retrieval model for "flexible XML retrieval" that incorporates single-category, multi-category and nested retrieval. Their approach uses the vector space ranking model for calculating the relevance of a particular category with the query.

Carmel et al. [8] express the users' needs as different "XML fragments" that might have similar structure as the elements within the XML documents in a particular collection. An important aspect of this approach is that it uses a variation of the vector space model that allows ranking not only on the text of the elements but also on the structural context of the elements. We refer to this aspect again in the Section 7.

6.3 Result visualization

For the purpose of visualizing the final results of the user query, Grossjohann et al. [16] present a unified form containing the tree structure of the underlying XML schema together with different fragments that may belong to different XML documents. For achieving a meaningful granularity of the final answers, their final presentation omits the result elements that are not a retrieved element or an ancestor of a retrieved element.

Although the process of visualization of the final answers to the users' queries makes a big contribution for improving the overall usefulness of the particular XML retrieval system, the key questions regarding the effectiveness of such XML retrieval system still remains open. What constitutes an acceptable answer to a user query in an XML environment and how can the retrieval system effectively deliver that answer to the user with regard to the overall satisfaction of the initial information needs?

7. Conclusions and future work

Retrieval of XML documents does require a more complex language than the initial one proposed by INEX, but does not require the full power of XQuery. INEX is interested in defining a simple sub-set of XQuery/XPath that will be sufficient for expressing realistic queries over XML document collections. However, it is not to say that even this simple language can be used by the end user, nor that the system interface would not generate much more complex queries to an underlying XML search engine. Many other kinds of semi-structured XML are used in a variety of applications ranging from data interchange to semi-structured documents. How well the types of information needs expressed in the INEX dataset apply to other document classes should also be explored.

The W3C full-text requirements and use cases seem to address most of the requirements identified through our first experiments with INEX. Regarding the INEX content only queries (CO), the W3C requires that XQuery/XPath Full-Text must allow a query to return arbitrary elements [7]. However, it is not possible to let the retrieval system decide which are the best type of elements to rank and return as answers. Also, the INEX predicate "about" is only partially addressed through the SCORE function. The future work of the XML-search community may be to define and implement good "SCORE" functions and define new methods for evaluating results from structured queries that take in account the actual users' needs.

Although the SCORE case extends the existing XQuery language with ranking capabilities, the language still assumes that all the structured parts of the query have to be fulfilled before ranking the final results. However, it could be very desirable for the XML retrieval system to support additional ranking on the document structure, allowing users to associate some arbitrary context with their information needs. That would mean that the final rank of a particular document fragment will include not only its ranking score of the query terms, but also the notion of how similar the fragment is with the initial context in the query.

So, it is still an open question as to how well the proposed extensions to XQuery bridge the gap between the data-centric database community and the document-centric information retrieval community.

References

[1] S. Amer-Yahia and P. Case (editors), "XQuery and XPath Full-Text Use Cases", W3C Working Draft, February 2003.
[HREF11]

[2] R. Baeza-Yates, N. Fuhr, Y. Maarek (editors), "Second Edition of the XML and Information Retrieval", A SIGIR 2002 Workshop, Tampere, Finland, August 2002.

[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.

[4] A. Berglund, S. Boag, D. Chamberlin, M. Fernandez, M. Kay, J. Robie and J. Simeon (editors), "XML Path Language (XPath) 2.0", W3C Working Draft, November 2002.
[HREF8]

[5] S. Boag, D. Chamberlin, M. Fernandez, D. Floresku, J. Robie and J. Simeon (editors), "XQuery 1.0: An XML Query Language", W3C Working Draft, November 2002.
[HREF12]

[6] T. Bray, J. Paoli, C. M. Sperberg-McQueen and E. Maler (editors), "Extensible Markup Language (XML) 1.0 (Second Edition)", W3C Recommendation, October 2000.
[HREF15]

[7] S. Buxton and M. Rys (editors), "XQuery and XPath Full-Text Requirements", W3C Working Draft, May 2003.
[HREF13]

[8] D. Carmel, N. Efraty, G.M. Landau, Y. Maarek and Y. Mass, "An Extension of the Vector Space Model for Querying XML Documents via XML Fragments", In Proceedings of the 2nd XML and Information Retrieval Workshop - 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002.

[9] D. Carmel, Y. Maarek, A. Soffer (editors), "XML and Information Retrieval", A SIGIR 2000 Workshop, Athens, Greece, July 2000.

[10] D. Chamberlin, F. Fankhauser, M. Marchiori and J. Robie (editors), "XML Query Requirments", W3C Working Draft, February 2001.
[HREF9]

[11] N. Craswell, D. Hawking, A. Krumpholz, I. Mathieson, J. A. Thom, A.-M. Vercoustre, P. Wilkins and M. Wu, "XML document retrieval with PADRE", In Proceedings of the 7th Australasian Document Computing Symposium, pages 79-86, Sydney, Australia, December 2002.

[12] A. Deutsch, M. Fernandez, D. Florescu, A. Levy and D. Suciu, "XML-QL: A query language for XML", Submission to the W3C, NOTE-xml-ql-19980819, August 1998.
[HREF16]

[13] D. Egnor and R. Lord, "Structured Information Retrieval using XML", In Proceedings of the 1st XML and Information Retrieval Workshop - 23th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, July 2000.

[14] N. Fuhr and K. Grossjohann, "XIRQL: A Query Language for Information Retrieval in XML Documents", In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172-180, New Orleans, USA, September 2001.

[15] T. Grabs and H-J. Schek, "Generating Vector Spaces On-the-fly for Flexible XML Retrieval", In Proceedings of the 2nd XML and Information Retrieval Workshop - 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002.

[16] K. Grossjohann, N. Fuhr, D. Effing and S. Kriewel, "Query Formulation and Result Visualization for XML Retrieval", In Proceedings of the 2nd XML and Information Retrieval Workshop - 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002.

[17] M. Kaszkiel, J. Zobel and R. Sacks-Davis, "Efficient Passage Ranking for Document Databases", ACM Transactions on Information Systems, 17(4):406-439, 1999.

[18] A. Mendelzon, G. Mihaila and T. Milo. "Querying the World Wide Web", Journal of Digital Libraries, 1(1):68-88, 1997.

[19] G. Navarro and R. Baeza-Yates, "Proximal Nodes: a Model to Query Document Databases by Content and Structure", ACM Transactions on Information Systems, 15(4):400-435, 1997.

[20] J. Robie, J. Lapp and D. Schach, "XML Query Language (XQL)", QL'98 - The Query Languages Workshop, W3C, December 1998.
[HREF10]

[21] T. Schlieder and H. Meuss, "Querying and Ranking XML Documents", Journal of the American Society for Information Science and Technology, 53(6):489-503, March 2002.

[22] A.-M. Vercoustre, J. A. Thom, A. Krumpholz, I. Mathieson, P. Wilkins, M. Wu, N. Craswell and D. Hawking, "CSIRO INEX experiments: XML search using PADRE", In Proceedings of the INEX 2002 Workshop, Dagstuhl, Germany, December 2002.

[23] J.E. Wolff, H. Florke and A. B. Cremers, "Searching and browsing collections of structural information", In Proceedings of IEEE Advances in Digital Libraries, pages 141-150, 2000.

Hypertext References

[HREF1]: http://www.cs.rmit.edu.au/~jovanp/
[HREF2]: http://www.cs.rmit.edu.au/
[HREF3]: http://www.cs.rmit.edu.au/~jat/
[HREF4]: http://www.cmis.csiro.au/Anne-Marie.Vercoustre
[HREF5]: http://www.ted.cmis.csiro.au/
[HREF6]: http://www.w3.org
[HREF7]: http://qmir.dcs.qmw.ac.uk/inex/
[HREF8]: http://www.w3.org/TR/xpath20/
[HREF9]: http://www.w3.org/TR/xmlquery-req
[HREF10]: http://www.w3.org/TandS/QL/QL98/pp/xql.html
[HREF11]: http://www.w3.org/TR/xmlquery-full-text-use-cases
[HREF12]: http://www.w3.org/TR/xquery/
[HREF13]: http://www.w3.org/TR/xmlquery-full-text-requirements
[HREF14]: http://trec.nist.gov/
[HREF15]: http://www.w3.org/TR/REC-xml
[HREF16]: http://www.w3.org/TR/NOTE-xml-ql/

Copyright

� Copyright 2003, Jovan Pehcevski, James Thom and CSIRO Australia. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web. No Rights to Research Data is given. CSIRO and the Author/s remain free to use their own research data including tables, formulae, diagrams and the outputs of scientific instruments.