Automatic Identification and Normalisation of Physical Measurements in Scientific Literature - Archive ouverte HAL Access content directly
Conference Papers Year :

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

(1) , (2) , (1) , (1)
1
2

Abstract

We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Fichier principal
Vignette du fichier
Automatic_Identification_and_Normalisation_of_Physical_Measurements_in_Scientific_Literature(1).pdf (508.31 Ko) Télécharger le fichier
Vignette du fichier
presentation-doceng-2019_reviewed.pdf (1.04 Mo) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

hal-02294424 , version 1 (26-09-2019)
hal-02294424 , version 2 (28-09-2019)

Licence

Attribution - CC BY 4.0

Identifiers

Cite

Luca Foppiano, Laurent Romary, Masashi Ishii, Mikiko Tanifuji. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature. DocEng '19 - ACM Symposium on Document Engineering 2019, Sep 2019, Berlin, Germany. pp.1-4, ⟨10.1145/3342558.3345411⟩. ⟨hal-02294424v2⟩
213 View
459 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More