Automatic Identification and Normalisation of Physical Measurements in Scientific Literature - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

Résumé

We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make un-structured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (nu-meric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science
Fichier principal
Vignette du fichier
Automatic_Identification_and_Normalisation_of_Physical_Measurements_in_Scientific_Literature (1).pdf (508.22 Ko) Télécharger le fichier
presentation-doceng-2019_reviewed.pdf (1.04 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-02294424 , version 1 (26-09-2019)
hal-02294424 , version 2 (28-09-2019)

Licence

Paternité

Identifiants

Citer

Luca Foppiano, Laurent Romary, Masashi Ishii, Mikiko Tanifuji. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature. DocEng '19 - ACM Symposium on Document Engineering 2019, Sep 2019, Berlin, Germany. pp.1-4, ⟨10.1145/3342558.3345411⟩. ⟨hal-02294424v1⟩
263 Consultations
552 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More