Skip to Main content Skip to Navigation
Book sections

The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation In The Biomedical Domain

Abstract : A major question in linguistics is whether theoretical accounts of the general language work for specific domains. Similarly, in natural language processing, it is clear that general-domain solutions often fail when applied to specialized domains. One such specialized domain, which is increasingly seen as crucial to understanding human biology and disease, is the biomedical domain. For this reason, biomedical corpus construction has been an area of considerable activity in recent years—for example, just in the past five years: (ordered by year of publication and then by first author's last name). Historically, the great majority of work in biomedical natural language processing has been done using abstracts of journal articles. In contrast, the Colorado Richly Annotated Full Text (CRAFT) corpus consists entirely of full-text journal articles. The primary motivation for the annotation project was the accumulating body of evidence indicating that the bodies of journal articles contain much information that is not present in the abstracts, and that the textual and structural characteristics of article bodies are different from those of abstracts [8, 26, 90, 84, 18, 2, 48, 51, 13]. When we began the project, there was no large resource of full-text journal articles for system building or evaluation; other than the CRAFT corpus, this continues to be the case. Earlier projects on full-text biomedical journal articles are typically not manually annotated, and none of them that we are aware of have annotation of linguistic structure.
Document type :
Book sections
Complete list of metadata

Cited literature [75 references]  Display  Hide  Download
Contributor : Karën Fort Connect in order to contact the contributor
Submitted on : Tuesday, June 2, 2015 - 3:11:39 PM
Last modification on : Wednesday, December 9, 2020 - 3:08:58 PM
Long-term archiving on: : Tuesday, April 25, 2017 - 1:07:13 AM


Files produced by the author(s)


  • HAL Id : hal-01159065, version 1


Kevin Bretonnel Cohen, Karin Verspoor, Karën Fort, Christopher Funk, Michael Bada, et al.. The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation In The Biomedical Domain. Handbook of Linguistic Annotation, 2016. ⟨hal-01159065⟩



Les métriques sont temporairement indisponibles