Efficient and Effective Duplicate Detection in Hierarchical Data - Archive ouverte HAL Access content directly
Journal Articles IEEE Transactions on Knowledge and Data Engineering Year : 2012

Efficient and Effective Duplicate Detection in Hierarchical Data

(1) , (1) , (2, 3)
1
2
3

Abstract

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several datasets. XMLDup is also able to outperform another state of the art duplicate detection solution, both in terms of efficiency and of effectiveness.
Fichier principal
Vignette du fichier
leitao_calado_herschel_tkde12.pdf (541.48 Ko) Télécharger le fichier
Origin : Publisher files allowed on an open archive
Loading...

Dates and versions

hal-00722505 , version 1 (11-04-2014)

Identifiers

Cite

Luís Leitão, Pável Calado, Melanie Herschel. Efficient and Effective Duplicate Detection in Hierarchical Data. IEEE Transactions on Knowledge and Data Engineering, 2012, 99 (PrePrints), ⟨10.1109/TKDE.2012.60⟩. ⟨hal-00722505⟩
263 View
803 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More