An Extensible Framework for Data Cleaning

Abstract : Data integration solutions dealing with large amounts of data have been strongly required in the last few years. Besides the traditional data integration problems (e.g. schema integration, local to global schema mappings), three additional data problems have to be dealt with: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyborad errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process. In this work, we propose a framework which offers the fundamental services required by this process: data transformation, duplicate elimination and multi-table matching. These services are implemented using a set of purposely designed macro-operators. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability of explicitly including the human interaction in the process. The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation. We measure the benefits of each.
Type de document :
Rapport
[Research Report] RR-3742, INRIA. 1999
Liste complète des métadonnées

https://hal.inria.fr/inria-00072922
Contributeur : Rapport de Recherche Inria <>
Soumis le : mercredi 24 mai 2006 - 11:17:04
Dernière modification le : vendredi 25 mai 2018 - 12:02:05
Document(s) archivé(s) le : dimanche 4 avril 2010 - 23:27:53

Fichiers

Identifiants

  • HAL Id : inria-00072922, version 1

Collections

Citation

Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon. An Extensible Framework for Data Cleaning. [Research Report] RR-3742, INRIA. 1999. 〈inria-00072922〉

Partager

Métriques

Consultations de la notice

500

Téléchargements de fichiers

895