Big Data Entity Resolution:: From Highly to Somehow Similar Entity Descriptions in the Web

Abstract : —In the Web of data, entities are described by inter-linked data rather than documents on the Web. In this work, we focus on entity resolution in the Web of data, i.e., identifying descriptions that refer to the same real-world entity. To reduce the required number of pairwise comparisons, methods for entity resolution perform blocking as a pre-processing step. A blocking technique places similar entity descriptions into blocks and executes comparisons only between descriptions within the same block. We experimentally evaluate blocking techniques proposed for the Web of data and present dataset characteristics that determine the effectiveness and efficiency of such methods. Furthermore, we analyze the characteristics of the missed matching entity descriptions and examine different types of links that blocking techniques can potentially identify. I. INTRODUCTION Nowadays, knowledge bases (KBs) offer comprehensive, machine-readable descriptions of a large variety of real-world entities (e.g., persons, places) published on the Web as Linked Data (LD). Although KBs (e.g., DBpedia, Freebase) may be derived from the same data source (e.g., Wikipedia), they may provide multiple descriptions of the same entities. This is mainly due to the different information extraction tools and curation policies [3] employed by KBs, resulting to complementary and sometimes conflicting descriptions. Entity resolution (ER) aims to identify descriptions that refer to the same entity within or across KBs [2], [4]. Compared to data warehouses, the new ER challenges stem from the openness of the Web of data in describing entities by an unbounded number of KBs, the semantic and structural diversity of the descriptions provided across domains even for the same entities, and the autonomy of KBs in terms of adopted processes for creating and curating descriptions. In general, the way two descriptions can be effectively compared to efficiently decide if they refer to the same entity is challenged by the scale, diversity and graph structuring of the descriptions in the Web. This requires an understanding of the relationships among somehow similar descriptions that goes beyond duplicate detection. Also, the huge volume of entity collections that we need to resolve in the Web is prohibitive when examining pairwise all descriptions. In this context of big Web data, blocking is typically used as a pre-processing step for ER to reduce the number of required comparisons. After blocking, each description can be compared only to others placed within the same block. The desiderata of blocking are to place (i) similar
Type de document :
Communication dans un congrès
2015 IEEE International Conference on Big Data (IEEE BigData 2015), Oct 2015, Santa Clara, CA,, United States. 2015, 〈http://cci.drexel.edu/bigdata/bigdata2015/〉. 〈10.1109/BigData.2015.7363781〉
Liste complète des métadonnées

Littérature citée [16 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01199399
Contributeur : Vassilis Christophides <>
Soumis le : mardi 15 septembre 2015 - 12:59:45
Dernière modification le : mercredi 13 janvier 2016 - 15:59:19
Document(s) archivé(s) le : mardi 29 décembre 2015 - 07:12:12

Fichier

Big Data Entity Resolution.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Vasilis Efthymiou, Kostas Stefanidis, Vassilis Christophides. Big Data Entity Resolution:: From Highly to Somehow Similar Entity Descriptions in the Web. 2015 IEEE International Conference on Big Data (IEEE BigData 2015), Oct 2015, Santa Clara, CA,, United States. 2015, 〈http://cci.drexel.edu/bigdata/bigdata2015/〉. 〈10.1109/BigData.2015.7363781〉. 〈hal-01199399〉

Partager

Métriques

Consultations de la notice

340

Téléchargements de fichiers

324