Skip to Main content Skip to Navigation
Journal articles

A Data Cleaning Solution by Perl Scripts for the KDD Cup 2003 Task 2

Martine Cadot 1 Joseph Di Martino 2
1 ORPAILLEUR - Knowledge representation, reasonning
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
2 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In this paper, we present our solution for the KDD CUP 2003 task 2 competition. Our approach is based on a data cleaning methodology using Perl scripts. These scripts contain regular expressions for automatically extracting relevant information from the 35472 LaTeX texts. These expressions were optimized by statistical investigations on the texts. Our solution has permitted us to obtain 144,087 associations.
Document type :
Journal articles
Complete list of metadata

https://hal.inria.fr/inria-00100173
Contributor : Joseph Di Martino <>
Submitted on : Tuesday, September 26, 2006 - 10:15:09 AM
Last modification on : Friday, February 26, 2021 - 3:28:05 PM

Identifiers

  • HAL Id : inria-00100173, version 1

Collections

Citation

Martine Cadot, Joseph Di Martino. A Data Cleaning Solution by Perl Scripts for the KDD Cup 2003 Task 2. SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, Association for Computing Machinery (ACM), 2004, 5 (2), pp.158-159. ⟨inria-00100173⟩

Share

Metrics

Record views

296