Efficient and Effective Duplicate Detection in Hierarchical Data

Luís Leitão; Pável Calado; Melanie Herschel

doi:10.1109/TKDE.2012.60

Article Dans Une Revue IEEE Transactions on Knowledge and Data Engineering Année : 2012

Efficient and Effective Duplicate Detection in Hierarchical Data

(1) , (1) , (2)

1
2

Luís Leitão

Fonction : Auteur

Instituto Superior Técnico, Universidade Técnica de Lisboa

Pável Calado

Fonction : Auteur

Instituto Superior Técnico, Universidade Técnica de Lisboa

Melanie Herschel

Fonction : Auteur correspondant
PersonId : 928516

Connectez-vous pour contacter l'auteur

Database optimizations and architectures for complex large data

Résumé

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several datasets. XMLDup is also able to outperform another state of the art duplicate detection solution, both in terms of efficiency and of effectiveness.

Domaines

Base de données [cs.DB]

Fichier principal

leitao_calado_herschel_tkde12.pdf (541.48 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Melanie Herschel : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00722505

Soumis le : vendredi 11 avril 2014-15:16:13

Dernière modification le : mardi 27 février 2024-09:16:33

Archivage à long terme le : vendredi 11 juillet 2014-10:37:15

Dates et versions

hal-00722505 , version 1 (11-04-2014)

Identifiants

HAL Id : hal-00722505 , version 1
DOI : 10.1109/TKDE.2012.60

Citer

Luís Leitão, Pável Calado, Melanie Herschel. Efficient and Effective Duplicate Detection in Hierarchical Data. IEEE Transactions on Knowledge and Data Engineering, 2012, 99 (PrePrints), ⟨10.1109/TKDE.2012.60⟩. ⟨hal-00722505⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS CNRS INRIA UMR8623 INRIA2 LRI-LAHDAK UNIV-PARIS-SACLAY

274 Consultations

881 Téléchargements

Efficient and Effective Duplicate Detection in Hierarchical Data

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager