Query-Oriented Summarization of RDF Graphs

Šejla Čebirić 1, 2, 3 François Goasdoué 3, 4 Ioana Manolescu 1, 2, 3
3 CEDAR - Rich Data Analytics at Cloud Scale
LIX - Laboratoire d'informatique de l'École polytechnique [Palaiseau], Inria Saclay - Ile de France
4 SHAMAN - Symbolic and Human-centric view of dAta MANagement
Abstract : RDF is the data model of choice for Semantic Web applications. RDF graphs are often large and heterogeneous, thus users may have a hard time getting familiar with the structure and semantics of a graph, or determining whether a graph is useful for a certain application. We consider answering such questions by inspecting a graph summary, a compact structure conveying as much information as possible about the input graph. A summary is representative of a graph if it represents both its explicit and implicit triples, the latter resulting from RDF Schema constraints. To ensure represen- tativeness, we define a novel RDF-specific summarization framework based on RDF node equivalence and graph quotients; our framework can be instantiated with many different RDF node equivalence relations. We show that our summaries are representative, and establish a sufficient condition on the RDF equivalence relation to ensure that a graph can be efficiently summarized, without materializing its implicit triples. We demonstrate that the state-of-the art bisimulation equivalence relations between graph nodes fit into our framework. Further, we instantiate the framework through four novel summaries, based on the new concept of property cliques, specifically tailored to cope with highly heterogeneous RDF graphs; we show that they are orders of magnitude more compact than bisimulation summaries. Finally, we show that the bisimulation and two of our clique summaries can be built efficiently so that they represent the explicit and implicit data of the input graph without saturating the graph. The performance benefits of our efficient summarization method is confirmed through a set of experiments.
