LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : A labeling approach for automatic recognition of Tables of Contents (ToC) is described in this paper. A prototype is used for electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labeling without using any a priori model. Labeling is based on a Part of Speech Tagging (PoS) which is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: title'' and authors''. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different ToC layouts and character recognition qualities. Without manual intervention, 96.3\% rate of correct segmentation was obtained on 38 journals including 2020 articles and 93.0\% rate of correct field extraction.
Mots-clés :
Document type :
Journal articles
Domain :

https://hal.inria.fr/inria-00100452
Contributor : Publications Loria <>
Submitted on : Tuesday, September 26, 2006 - 2:45:53 PM
Last modification on : Friday, February 26, 2021 - 3:28:06 PM

Identifiers

• HAL Id : inria-00100452, version 1

Citation

Abdel Belaïd. Recognition of Table of Contents for Electronic Library Consulting. International Journal on Document Analysis and Recognition, Springer Verlag, 2001, 4 (1), pp.35-45. ⟨inria-00100452⟩

Record views