Segmentation et classification des zones d'une page de document

Jean-Marc Vauthier 1 Abdel Belaïd 1
1 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This paper proposes a methodology for complex document segmentation based on textual content and shape. The textual content corresponds with printed text and it is verified by text-word analysis using dictionary and regular expressions variable that are adapted to noise. This allows knowing where the interested expressions are placed (address, phone number etc.) The non-textual content is segmented in zone considering size and distance between connected components in order to classify zones like logo, signature, and table. To make that, features are extracted like run length, Bi level Co-occurrence... This classification is based on a modified boosting method and decision trees. The modification is about the calculation of the probability to draw training data. Compare to OCRs that are able to classify text, tables and pictures, our methodology increases the performance and allows the detection of other zones like handwritten text, logo, signature, table and tampon.
Liste complète des métadonnées

Cited literature [3 references]  Display  Hide  Download

https://hal.inria.fr/hal-00779232
Contributor : Abdel Belaid <>
Submitted on : Wednesday, January 23, 2013 - 5:28:07 PM
Last modification on : Tuesday, December 18, 2018 - 4:38:02 PM
Document(s) archivé(s) le : Wednesday, April 24, 2013 - 3:54:26 AM

File

cifed2012_submission_24.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00779232, version 1

Collections

Citation

Jean-Marc Vauthier, Abdel Belaïd. Segmentation et classification des zones d'une page de document. CIFED-CORIA, Mar 2012, Bordeaux, France. ⟨hal-00779232⟩

Share

Metrics

Record views

274

Files downloads

544