Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Keyword Search in Heterogeneous Data Sources

Abstract : Data journalism is the field of investigative journalism work based first and foremost on digital data. As more and more of human activity leaves strong digital traces, data journalism is an increasingly important trend. Important journalism projects increasingly involve diverse data sources, having heterogeneous data models, different structures, or no structure at all; the Offshore Leaks is a prime example. Inspired by our collaboration with Le Monde, a leading French newspaper , we designed a novel content management architecture, together with an algorithm for exploiting such heterogeneous corpora through keyword search: given a set of search terms, find links between them within and across the different datasets which we interconnect in a graph. Our work recalls keyword search in structured and unstructured data, but data heterogeneity makes it computationally harder. We analyze the performance of our algorithm on real-life datasets.
Document type :
Preprints, Working Papers, ...
Complete list of metadata

Cited literature [19 references]  Display  Hide  Download
Contributor : Ioana Manolescu Connect in order to contact the contributor
Submitted on : Thursday, April 30, 2020 - 4:35:09 PM
Last modification on : Friday, August 5, 2022 - 12:39:58 PM


Files produced by the author(s)


  • HAL Id : hal-02559688, version 1


Felipe Cordeiro, Helena Galhardas, Julien Leblay, Ioana Manolescu, Tayeb Merabti. Keyword Search in Heterogeneous Data Sources. 2020. ⟨hal-02559688⟩



Record views


Files downloads