Skip to Main content Skip to Navigation
New interface
Conference papers

Extracting Linked Data from statistic spreadsheets

Abstract : Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. In this work, we provide a conceptual model for the open data comprised in statistics published by INSEE, the national French economic and societal statistics institute. Then, we describe a novel method for extracting RDF Linked Open Data, to populate an instance of this model. We used our method to produce RDF data out of 20k+ Excel spreadsheets, and our validation indicates a 91% rate of successful extraction. Further, we also present a novel algorithm enabling the exploitation of such statistic tables, by (i) identifying the statistic datasets most relevant for a given fact-checking query, and (ii) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.
Document type :
Conference papers
Complete list of metadata

Cited literature [31 references]  Display  Hide  Download
Contributor : Tien-Duc Cao Connect in order to contact the contributor
Submitted on : Wednesday, November 7, 2018 - 1:17:23 PM
Last modification on : Wednesday, April 6, 2022 - 3:48:49 PM
Long-term archiving on: : Friday, February 8, 2019 - 2:17:40 PM


Files produced by the author(s)


  • HAL Id : hal-01915148, version 1


Tien-Duc Cao, Ioana Manolescu, Xavier Tannier. Extracting Linked Data from statistic spreadsheets. Conférence sur la Gestion de Données – Principes, Technologies et Applications, Oct 2018, Bucarest, Romania. ⟨hal-01915148⟩



Record views


Files downloads