HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Declarative Data Cleaning : Language, Model, and Algorithms

Abstract : The problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning.
Document type :
Complete list of metadata

Contributor : Rapport de Recherche Inria Connect in order to contact the contributor
Submitted on : Wednesday, May 24, 2006 - 10:04:08 AM
Last modification on : Thursday, February 3, 2022 - 11:18:55 AM
Long-term archiving on: : Sunday, April 4, 2010 - 11:09:31 PM


  • HAL Id : inria-00072476, version 1



Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian Saita. Declarative Data Cleaning : Language, Model, and Algorithms. [Research Report] RR-4149, INRIA. 2001. ⟨inria-00072476⟩



Record views


Files downloads