Towards a common framework for linguistic annotation

Nancy Ide 1 Laurent Romary 1
1 LANGUE ET DIALOGUE - Human-machine dialogue with a significant language component
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Corpora are now being annotated for a variety of linguistic features, including not only phenomena such as morpho-syntactic category and syntactic structure, but also discourse structure, co-reference, etc. Typically, annotation schemes—even those representing the same phenomenon—are developed independently at different annotation sites, and as a result, merging and comparison of annotated data is difficult or impossible. Similarly, annotations for different phenomena are difficult to combine to enable consideration of relationships and patterns among different linguistic levels. We have been working to develop a framework for linguistic annotation that would solve most if not all of these difficulties. The framework consists of two fundamental pieces: (1) a generalized, abstract model that captures the underlying structure of linguistic annotations; and (2) a means to identify and formally define common (core) annotation categories, together with mechanisms to map equivalences, refine or modify existing categories, and specify hierarchical relations among categories at different levels of specificity. The aim is to provide an infrastructure for linguistic annotation that enables the commonality needed for reuse and merging, and is at the same time flexible enough to allow for user-specific annotation practices. To this end, we have outlined an annotation framework instantiated via the Extended Markup Language (XML) and the Resource Definition Framework (RDF) and demonstrated its applicability to the representation of lexical information (Ide, et al., 2000) and syntactic annotation (Ide and Romary, 2001). In this paper, we outline the principles and mechanisms that support the proposed framework and demonstrate its use and flexibility. Because it is based in existing and emerging data representation standards and informed by state-of-the-art methods from areas such as database theory, object-oriented design, knowledge representation, etc., we feel strongly that the framework supports the most advanced and efficient means to exploit annotated corpora. The framework we are proposing would serve as a central repository and service for annotators, providing off-the-shelf formats and categories together with scripts and tools for using and modifying them, creating new categories, and mapping among them. A core feature of the framework is an RDF-based data category registry, whose implementation depends critically on input from the research community. Therefore, our goal is not only to show our results so far, but also to solicit input and feedback from corpus annotators and users that can contribute to further development.
