Skip to Main content Skip to Navigation
Conference papers

Segmentation et classification des zones d'une page de document

Jean-Marc Vauthier 1 Abdel Belaïd 1 
1 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This paper proposes a methodology for complex document segmentation based on textual content and shape. The textual content corresponds with printed text and it is verified by text-word analysis using dictionary and regular expressions variable that are adapted to noise. This allows knowing where the interested expressions are placed (address, phone number etc.) The non-textual content is segmented in zone considering size and distance between connected components in order to classify zones like logo, signature, and table. To make that, features are extracted like run length, Bi level Co-occurrence... This classification is based on a modified boosting method and decision trees. The modification is about the calculation of the probability to draw training data. Compare to OCRs that are able to classify text, tables and pictures, our methodology increases the performance and allows the detection of other zones like handwritten text, logo, signature, table and tampon.
Complete list of metadata

Cited literature [3 references]  Display  Hide  Download
Contributor : Abdel Belaid Connect in order to contact the contributor
Submitted on : Wednesday, January 23, 2013 - 5:28:07 PM
Last modification on : Saturday, October 16, 2021 - 11:26:09 AM
Long-term archiving on: : Wednesday, April 24, 2013 - 3:54:26 AM


Files produced by the author(s)


  • HAL Id : hal-00779232, version 1



Jean-Marc Vauthier, Abdel Belaïd. Segmentation et classification des zones d'une page de document. CIFED-CORIA, Mar 2012, Bordeaux, France. ⟨hal-00779232⟩



Record views


Files downloads