E. Rahm and H. H. Do, Data cleaning: Problems and current approaches, IEEE Data Engineering Bulletin, vol.23, pp.3-13, 2000.

F. Naumann and M. Herschel, An Introduction to Duplicate Detection, Synthesis Lectures on Data Management, vol.2, issue.1, 2010.
DOI : 10.2200/S00262ED1V01Y201003DTM003

R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy Duplicates in Data Warehouses, Conference on Very Large Databases (VLDB), pp.586-597, 2002.
DOI : 10.1016/B978-155860869-6/50058-5

D. V. Kalashnikov and S. Mehrotra, Domain-independent data cleaning via analysis of entity-relationship graph, ACM Transactions on Database Systems, vol.31, issue.2, pp.716-767, 2006.
DOI : 10.1145/1138394.1138401

M. Weis and F. Naumann, DogmatiX tracks down duplicates in XML, Proceedings of the 2005 ACM SIGMOD international conference on Management of data , SIGMOD '05, pp.431-442, 2005.
DOI : 10.1145/1066157.1066207

L. Leitão, P. Calado, and M. Weis, Structure-based inference of xml similarity for fuzzy duplicate detection, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management , CIKM '07, pp.293-302, 2007.
DOI : 10.1145/1321440.1321483

A. M. Kade and C. A. Heuser, Matching XML documents in highly dynamic applications, Proceeding of the eighth ACM symposium on Document engineering, DocEng '08, pp.191-198, 2008.
DOI : 10.1145/1410140.1410178

D. Milano, M. Scannapieco, and T. Catarci, Structure aware XML object identification, VLDB Workshop on Clean Databases (CleanDB), 2006.

P. Calado, M. Herschel, and L. Leitão, An Overview of XML Duplicate Detection Algorithms, Soft Computing in XML Data Management, Studies in Fuzziness and Soft Computing, 2010.
DOI : 10.1007/978-3-642-14010-5_8

S. Puhlmann, M. Weis, and F. Naumann, XML duplicate detection using sorted neigborhoods, Conference on Extending Database Technology (EDBT), pp.773-791, 2006.

J. C. Carvalho and A. S. Da-silva, Finding similar identities among objects from multiple web sources, Proceedings of the fifth ACM international workshop on Web information and data management , WIDM '03, pp.90-93, 2003.
DOI : 10.1145/956699.956719

M. A. Hernández and S. J. Stolfo, The merge/purge problem for large databases, Conference on the Management of Data (SIGMOD), pp.127-138, 1995.

J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, 1988.

L. Leitão and P. Calado, Duplicate detection through structure optimization, Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM '11, pp.443-452, 2011.
DOI : 10.1145/2063576.2063644

E. H. Simpson, Measurement of Diversity, Nature, vol.163, issue.4148, p.688, 1949.
DOI : 10.1038/163688a0

H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, Support vector regression machines, Advances in Neural Information Processing Systems (NIPS), pp.155-161, 1996.

S. Kirkpatrick, C. D. Gelatt, J. , and M. P. Vecchi, Optimization by Simulated Annealing, Science, vol.220, issue.4598, pp.671-680, 1983.
DOI : 10.1126/science.220.4598.671

T. Joachims, Making large-scale support vector machine learning practical, pp.169-184, 1999.

Z. Nie, Y. Zhang, J. Wen, and W. Ma, Object-level ranking, Proceedings of the 14th international conference on World Wide Web , WWW '05, pp.567-574, 2005.
DOI : 10.1145/1060745.1060828

L. Chen, L. Zhang, F. Jing, K. Deng, and W. Ma, Ranking web objects from multiple communities, Proceedings of the 15th ACM international conference on Information and knowledge management , CIKM '06, pp.377-386, 2006.
DOI : 10.1145/1183614.1183670