Scalable Iterative Graph Duplicate Detection

Abstract : Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale-up duplicate detection in graph data (ddg) to large amounts of data and pairwise comparisons, using the support of a relational database management system. To this end, we first present a framework that generalizes the ddg process. We then present algorithms to scale ddg in space (amount of data processed with bounded main memory) and in time. Finally, we extend our framework to allow batched and parallel ddg, thus further improving efficiency. Experiments on data of up to two orders of magnitude larger than data considered so far in ddg show that our methods achieve the goal of scaling ddg to large volumes of data.
Type de document :
Article dans une revue
IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, 2012, 24 (11), pp.2094-2108
Liste complète des métadonnées

Littérature citée [29 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00757604
Contributeur : Melanie Herschel <>
Soumis le : mardi 27 novembre 2012 - 11:53:44
Dernière modification le : jeudi 11 janvier 2018 - 06:24:28
Document(s) archivé(s) le : jeudi 28 février 2013 - 03:43:30

Fichier

TKDE2012a_herschel.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : hal-00757604, version 1

Citation

Melanie Herschel, Felix Naumann, Sascha Szott, Maik Taubert. Scalable Iterative Graph Duplicate Detection. IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, 2012, 24 (11), pp.2094-2108. 〈hal-00757604〉

Partager

Métriques

Consultations de la notice

216

Téléchargements de fichiers

556