Skip to Main content Skip to Navigation
New interface
Other publications

Graph integration of structured, semistructured and unstructured data for data journalism

Abstract : Nowadays, journalism is facilitated by the existence of large amounts of digital data sources, including many Open Data ones. Such data sources are extremely heterogeneous, ranging from highly struc-tured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to de ne and deploy custom extract-transform-load work ows. These are di cult to set up not only for arbitrary heterogeneous inputs , but also given that users may want to add (or remove) datasets to (from) the corpus. We describe a complete approach for integrating dynamic sets of heterogeneous data sources along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
Complete list of metadata

Cited literature [54 references]  Display  Hide  Download
Contributor : Ioana Manolescu Connect in order to contact the contributor
Submitted on : Thursday, October 29, 2020 - 11:00:47 PM
Last modification on : Thursday, October 27, 2022 - 1:45:02 PM


Files produced by the author(s)


  • HAL Id : hal-02904797, version 2
  • ARXIV : 2007.12488


Oana Balalau, Catarina Conceição, Helena Galhardas, Ioana Manolescu, Tayeb Merabti, et al.. Graph integration of structured, semistructured and unstructured data for data journalism. 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (informal publication only), 2020. ⟨hal-02904797v2⟩



Record views


Files downloads