Improved CHAID Algorithm for Document Structure Modelling

Abdel Belaïd; Philippe Moinel; Yves Rangoni

doi:10.1117/12.839794

Communication Dans Un Congrès Année : 2010

Improved CHAID Algorithm for Document Structure Modelling

(1) , (2) , (1)

1
2

Abdel Belaïd

Fonction : Auteur
PersonId : 830137

READ

Philippe Moinel

Fonction : Auteur

Laboratoire Lorrain de Recherche en Informatique et ses Applications

Yves Rangoni

Fonction : Auteur
PersonId : 830537

READ

Résumé

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the \Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the rst uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The nal error rate for determining the logical labels (among 9 dierent ones) is less than 6%.

Mots clés

Document Image Analysis and Recognition Physical and logical layout analysis OCR Improved CHAID Algorithm XML based formats

Domaines

Traitement du signal et de l'image [eess.SP] Traitement du signal et de l'image [eess.SP]

Fichier principal

DRR_Improved_CHAID.pdf (466.45 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Abdel Belaid : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00579684

Soumis le : jeudi 24 mars 2011-15:48:44

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : samedi 25 juin 2011-02:49:41

Dates et versions

inria-00579684 , version 1 (24-03-2011)

Identifiants

HAL Id : inria-00579684 , version 1
DOI : 10.1117/12.839794

Citer

Abdel Belaïd, Philippe Moinel, Yves Rangoni. Improved CHAID Algorithm for Document Structure Modelling. Document Recognition and Retrieval XVII, Jan 2010, San Jose, United States. pp.7, ⟨10.1117/12.839794⟩. ⟨inria-00579684⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE LORIA

170 Consultations

225 Téléchargements

Improved CHAID Algorithm for Document Structure Modelling

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager