Skip to Main content Skip to Navigation
Reports

Declarative Data Cleaning : Language, Model, and Algorithms

Abstract : The problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning.
Document type :
Reports
Complete list of metadata

https://hal.inria.fr/inria-00072476
Contributor : Rapport de Recherche Inria <>
Submitted on : Wednesday, May 24, 2006 - 10:04:08 AM
Last modification on : Friday, May 25, 2018 - 12:02:05 PM
Long-term archiving on: : Sunday, April 4, 2010 - 11:09:31 PM

Identifiers

  • HAL Id : inria-00072476, version 1

Collections

Citation

Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian Saita. Declarative Data Cleaning : Language, Model, and Algorithms. [Research Report] RR-4149, INRIA. 2001. ⟨inria-00072476⟩

Share

Metrics

Record views

1532

Files downloads

1828