Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].


INTRODUCTION
The data overflow in scientific publications makes rapid access to relevant information a challenging issue, for both researchers and readers. One of the essential element found in scientific literature is the physical quantity or measurement, which combine quantification of units (such as grams or micrometres) and quantified object or substances. The automatic extraction of measurements has been studied for many years. Nowadays, although the technology has been evolved, there are still several challenges to overcome: (1) natural language and writing style have varieties of expressions (for example length can be expressed as m, meter, metre). (2) Overlaps between the different units of measurement (pico Henry inductance and acidity share the same notation, pH ). (3) The physical quantities or measurements are scalable by accompanying units (e.g., 1 pl. = 453.6 g), meaning that value and unit combination and its normalisation are necessary for semantic recognition. The need for a precise automatic generation of databases from physical measurements is common to a wide range of domains.
In this paper, we present Grobid-quantities, an open-source application for identifying, parsing and normalising measurements from scientific and patent literature. Using Conditional Random Field (CRF) [12], it provides a machine learning framework for extracting information in a robust manner, and then normalise them toward the International System of Units (SI). This article is organised as follow. In Section 2 we introduce related work. Then, we describe the system in Section 3 and report its evaluation results in Sections 4. Use cases and future scopes are described in Section 5. Section 6 concludes this paper.

RELATED WORK
Attempts to extract measurements from text have been made using rule-based (formal grammars engines, lookups in terminological databases) and ML approaches. A known commercial tool, Quantalyze 1 , was reported by [10] showing weak recall and supporting only a limited subset of units [3]. Another approach [1], using GATE (General Architecture for Text Engineering), addressed the identification of numeric properties from patents. [2] investigated issues applied to Russian-derived languages. These approaches lack either the generalisation to an extensive corpus or deal mainly with specific languages. [4] described an attempt to recognise units by looking up terms from an ontology, using ML in combination with pattern matching and string metrics. Other ML-based approaches exist, although limited to specific domains: [11] and [8] describe measurements extraction from experimental results in biology and nanocrystal device development, respectively. Our work is not restricted to a specific domain or subset of measurements and includes a normalisation process.

SYSTEM DESCRIPTION
Grobid-quantities is a Java web application, based on Grobid (Gen-eRation Of BIbliographic Data) [6] [13], a machine learning framework for parsing and structuring raw documents such as PDF or plain text. Grobid-quantities is designed to process large quantity of data, via web, through a rest API or locally, via the file-system (batch mode). Output information are standardised as stand-off annotations, and they can be stored in databases or indexed in search engines. Each annotation can be visualised on top of PDFs using the GROBID build-in positional coordinates. The data model ( Figure 1) lay its foundation on the concept of Measurement, which links an object or a substance with one or more quantities. We defined four Measurements types: (a) atomic, in case of a single measurement (e.g., 10 grams). (b) interval (from 3 to 5 km) and (c) range (100 ± 4 mm) for continuous values, and, (d) a list of discrete values. A Quantity links the quantitative value and the unit. Since data extracted from PDFs unavoidably present irregular tokens from wrong UTF-8 encoding or missing fonts, we designed this model to allow partial results. The Value and Unit entities allow three different representations ( Figure 1): Raw as appear in input, Parsed unifies the value into the numerical expression, and the unit with its properties (system, type). Finally, Normalised contains the transformed unit and values to the SI system. Value object supports four types of representations: numeric (2, 1000), alphabetic (two, thousand), scientific notation (3 · 10 5 ), and time, which is also an expression of measurement. Units objects are organised following the SI, which allows representing units as products of simpler compounds (e.g. m/s to m · s -1 ) further decomposed as triples (prefix, base and power).

Architecture
The system takes in input text or PDF (the content is extracted in a structured way using the Grobid framework) and performs three steps: (a) tokenisation, (b) measurement extraction and parsing and (c) quantity normalisation. The details of each step are summarised as follows.

Tokenisation.
This process splits input data into tokens. Grobidquantities uses a two-phase tokenisation: (1) first it splits by punctuation marks, then (2) each resulting token is re-tokenised to separate adjacent digits and alphanumeric characters. Given the example 25mˆ2, first returns a list [25m,ˆ, 2] and then recursively divides   Table 1 describes the labels predicted by the Quantities CRF model. Notice that, to reconstruct complex structured objects from the flat sequence generated by engine, additional labels are necessary (such as <unitLeft>, <unitRight>, for units).
Previous work (Section 2) presented extensive use of databases or ontologies. In our solution, we used a similar approach. We created a list of units (in English, French and German) with their characteristics: system (SI base, SI derived, imperial, ...) and type (volume, length, ...), and their representations: notations (m 3 , mˆ3), Automatic Identification and Normalisation of Physical Measurements in Scientific Literature DocEng '19, September 23-26, 2019, Berlin, Germany The Units CRF model works at character level and uses the Unit Lexicon to highlight known units or prefixes. The input tokens are parsed and transformed to a product of triples (prefix, base, power) as shown in Table 2. For example Kg/mm 2 , corresponds to Kд ·mm -2 and becomes [(K, g, 1), (m, m, -2)] as product of triples.

Label
Description Example <prefix> prefix of the unit km 2 <base> unit base km 2 <pow> unit power km 2 <other> everything else -We then use the structured triples to fetch the corresponding information (system, type) from the Unit Lexicon and attach them to the resulting object. This implementation processes the unit characters using right-to-left order. Priority modifiers, such as parenthesis, are ignored. They are generally not frequent in units expressions, and require a more complex logic.
In parallel, the CRF Values model unifies the format of identified values into numerical formats. It supports four types: numeric, alphabetic, scientific notation, and time expression (see Table 3). Different techniques are applied for each type: alphabetic expressions are looked up in the word-to-number gazetteer, scientific notations are parsed and calculated mathematically. Time expressions are further segmented using the Grobid built-in Date CRF model.

Normalisation.
The measurements extracted are transformed to the base SI unit (grams to kg, Celsius to Kelvin, etc.). We used an external Java library called Units of Measurement [5], which provides a set of standard interfaces and implementations for safely handling units and quantities. Manipulating measurements with transformations often lead to common mistakes due to wrong rounding and approximations. At the time this paper is being written, the final revised version of this library has been accepted under the Java Standardisation Process JSR-385.

EVALUATION AND RESULTS
We trained and evaluated our system's models using a dataset based on 32 scientific publication (English, Open Access (OA)) and three patents (with translation in English, French and German) randomly selected from different domains such as medicine, robotics, astronomy, and physiology. The training data was generated automatically and then corrected and cross-checked by three annotators. We used 10-fold cross-validation to evaluate each CRF model, independently, and produce precision, recall and f1 scores, as summarised in Table  4. The Quantities CRF model reported an f1 macro average of 80.14% with precision and recall of 84.93% and 76.86%, respectively. The paragraph accuracy was 68.97%, indicating that more than half of the evaluated paragraphs were correctly labelled. These scores are promising, considering the complexity of the task and the rather small size of the training corpus. In particular, <list> and <unitRight> require more example. The Units CRF model f1 macro average was 98.86%, with precision and recall reaching 98.75% and 98.97%, respectively. Compared with our other models, performances were extremely high (more than 10% for f1 score). Such difference can be attributed to (a) the data distribution and (b) the lower variability of unit expressions. We analysed the training data, and we noticed that the distribution is biased toward simple units (composed by a single triple). Intuitively, this makes sense, because simple units are statistically more frequent; on the other hand, it highlights the necessity of having more complex examples in our dataset. Secondly, unit expressions appear, by nature, with lower variability, leading to the generation of more duplicates than in other models' training datasets. For example, the expressions 1% and 2% have two different values (1,2) and the same unit (%), which would appear twice. Since we cannot alter the statistical distribution of the dataset, we would obtain better and more precise measurements of the model generalisation capabilities by using a separate and independent evaluation corpus. The Value CRF model scored average macro f1 at 85.64% with precision and recall at 81.82% and 93.29%, respectively. We noticed that both <base>, <pow> and <time> have lower f1-score. While <base> and <pow> require more training data, <time> expressions may overlap with <number> suggesting more contextual information should be introduced.

APPLICATIONS
Recently, the normalised data extraction is strongly required in materials research. The inverse problem in which high-performance materials are predicted from properties is expected to be solved with well-organised big data. At the National Institute for Materials Science (NIMS), a project to discover new superconducting materials from scientific literature is in progress. The system being developed relies on Grobid-quantities to extract and normalise superconducting properties, such as critical temperature (Tc) with units of mK and K and critical pressure expressed with units of Pa, MPa, and GPa [9].
Grobid-quantities was showcased in a Text and Data Mining (TDM) platform (within the scope of the French national-wide ISTEX [7] project) where it provided measurement annotations used to prototype a quantities-based semantic search 2 .
Finally, another use was made in a system for extracting semantic measurements and meaning in Earth Science, Marve [10]. 2 The demo can be accessed at https://traces1.inria.fr/istex_sample/

CONCLUSION
In this paper, we presented Grobid-quantities, a system for extracting and normalising measurement from scientific and patent literature. The project, the training data and the documentation are accessible on Github 3 .
Results are promising, and the integration in real production platforms proved this application reached a certain level of maturity. Our dataset, although it requires more training examples, is released as open access and can be improved from external contributors. Moreover, as previously discussed, the introduction of an end to end evaluation corpus could provide more objective evaluation results.
In the future, we plan to introduce recurrent neural networks (RNN) and embeddings for sequence labelling. In particular, contextualised embeddings, trained with values and units could improve the model generalisation. Finally, we plan to integrate domain information and additional layout features (such as superscript/subscripts) to improve unit discrimination.