An Extensible Framework for Data Cleaning

Helena Galhardas; Daniela Florescu; Dennis Shasha; Eric Simon

Rapport (Rapport De Recherche) Année : 1999

An Extensible Framework for Data Cleaning

(1) , (1) , (1) , (1)

Helena Galhardas

Fonction : Auteur

Information Mediation Systems

Daniela Florescu

Fonction : Auteur

Information Mediation Systems

Dennis Shasha

Fonction : Auteur
PersonId : 833427

Information Mediation Systems

Eric Simon

Fonction : Auteur

Information Mediation Systems

Résumé

Data integration solutions dealing with large amounts of data have been strongly required in the last few years. Besides the traditional data integration problems (e.g. schema integration, local to global schema mappings), three additional data problems have to be dealt with: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyborad errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process. In this work, we propose a framework which offers the fundamental services required by this process: data transformation, duplicate elimination and multi-table matching. These services are implemented using a set of purposely designed macro-operators. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability of explicitly including the human interaction in the process. The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation. We measure the benefits of each.

Mots clés

DATA INTEGRATION DATA CLEANING QUERY OPTIMIZATION QUERY LANGUAGE APPROXIMATE JOIN DATA TRANSFORMATION

Domaines

Autre [cs.OH]

Fichier principal

RR-3742.pdf (563.56 Ko)

Rapport De Recherche Inria : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00072922

Soumis le : mercredi 24 mai 2006-11:17:04

Dernière modification le : mercredi 30 août 2023-12:29:48

Archivage à long terme le : dimanche 4 avril 2010-23:27:53

Dates et versions

inria-00072922 , version 1 (24-05-2006)

Identifiants

HAL Id : inria-00072922 , version 1

Citer

Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon. An Extensible Framework for Data Cleaning. [Research Report] RR-3742, INRIA. 1999. ⟨inria-00072922⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA-RRRT INRIA2 LARA

666 Consultations

984 Téléchargements

An Extensible Framework for Data Cleaning

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager