GROBID - Information Extraction from Scientific Publications

Abstract : Scientific papers potentially offer a wealth of information that allows one to put the corresponding work in context and offer a wide range of services to researchers. GROBID is a high performing software environment to extract such information as metadata, bibliographic references or entities in scientific texts. Most modern digital library techniques rely on the availability of high quality textual documents. In practice, however, the majority of full text collections are in raw PDF or in incomplete and inconsistent semi-structured XML. To address this fundamental issue, the development of the Java library GROBID started in 2008 [1]. The tool exploits “Conditional Random Fields” (CRF), a machine-learning technique for extracting and restructuring content automatically from raw and heterogeneous sources into uniform standard TEI (Text Encoding Initiative) documents.
Complete list of metadatas

Cited literature [2 references]  Display  Hide  Download

https://hal.inria.fr/hal-01673305
Contributor : Laurent Romary <>
Submitted on : Friday, December 29, 2017 - 10:00:22 AM
Last modification on : Thursday, April 4, 2019 - 1:31:11 AM

Files

03-romary-final.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

  • HAL Id : hal-01673305, version 1

Citation

Laurent Romary, Patrice Lopez. GROBID - Information Extraction from Scientific Publications. ERCIM News, ERCIM, 2015, Scientific Data Sharing and Re-use, 100. ⟨hal-01673305⟩

Share

Metrics

Record views

570

Files downloads

518