Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

Abstract : We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Complete list of metadatas

Cited literature [13 references]  Display  Hide  Download

https://hal.inria.fr/hal-02294424
Contributor : Luca Foppiano <>
Submitted on : Saturday, September 28, 2019 - 9:01:21 PM
Last modification on : Friday, October 18, 2019 - 10:54:40 AM

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Collections

Citation

Luca Foppiano, Laurent Romary, Masashi Ishii, Mikiko Tanifuji. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature. DocEng '19 - ACM Symposium on Document Engineering 2019, Sep 2019, Berlin, Germany. pp.1-4, ⟨10.1145/3342558.3345411⟩. ⟨hal-02294424v2⟩

Share

Metrics

Record views

30

Files downloads

277