A Data Cleaning Solution by Perl Scripts for the KDD Cup 2003 Task 2

Martine Cadot 1 Joseph Di Martino 2
1 ORPAILLEUR - Knowledge representation, reasonning
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
2 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In this paper, we present our solution for the KDD CUP 2003 task 2 competition. Our approach is based on a data cleaning methodology using Perl scripts. These scripts contain regular expressions for automatically extracting relevant information from the 35472 LaTeX texts. These expressions were optimized by statistical investigations on the texts. Our solution has permitted us to obtain 144,087 associations.
Document type :
Journal articles
Complete list of metadatas

https://hal.inria.fr/inria-00100173
Contributor : Joseph Di Martino <>
Submitted on : Tuesday, September 26, 2006 - 10:15:09 AM
Last modification on : Thursday, January 11, 2018 - 6:19:55 AM

Identifiers

  • HAL Id : inria-00100173, version 1

Collections

Citation

Martine Cadot, Joseph Di Martino. A Data Cleaning Solution by Perl Scripts for the KDD Cup 2003 Task 2. SIGKDD Explorations, ACM, 2004, 5 (2), pp.158-159. ⟨inria-00100173⟩

Share

Metrics

Record views

275