Efficient and Effective Duplicate Detection in Hierarchical Data

Luís Leitão 1 Pável Calado 1 Melanie Herschel 2, 3, *
* Auteur correspondant
2 OAK - Database optimizations and architectures for complex large data
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
3 BD
LRI - Laboratoire de Recherche en Informatique
Abstract : Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several datasets. XMLDup is also able to outperform another state of the art duplicate detection solution, both in terms of efficiency and of effectiveness.
Type de document :
Article dans une revue
IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, 2012, 99 (PrePrints), 〈10.1109/TKDE.2012.60〉
Liste complète des métadonnées

Littérature citée [20 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00722505
Contributeur : Melanie Herschel <>
Soumis le : vendredi 11 avril 2014 - 15:16:13
Dernière modification le : lundi 28 mai 2018 - 14:38:02
Document(s) archivé(s) le : vendredi 11 juillet 2014 - 10:37:15

Fichier

leitao_calado_herschel_tkde12....
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

Collections

Citation

Luís Leitão, Pável Calado, Melanie Herschel. Efficient and Effective Duplicate Detection in Hierarchical Data. IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, 2012, 99 (PrePrints), 〈10.1109/TKDE.2012.60〉. 〈hal-00722505〉

Partager

Métriques

Consultations de la notice

916

Téléchargements de fichiers

916