HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Recognition of Table of Contents for Electronic Library Consulting

Abdel Belaïd 1
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : A labeling approach for automatic recognition of Tables of Contents (ToC) is described in this paper. A prototype is used for electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labeling without using any a priori model. Labeling is based on a Part of Speech Tagging (PoS) which is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: ``title'' and ``authors''. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different ToC layouts and character recognition qualities. Without manual intervention, 96.3\% rate of correct segmentation was obtained on 38 journals including 2020 articles and 93.0\% rate of correct field extraction.
Document type :
Journal articles
Complete list of metadata

Contributor : Publications Loria Connect in order to contact the contributor
Submitted on : Tuesday, September 26, 2006 - 2:45:53 PM
Last modification on : Wednesday, March 9, 2022 - 5:40:04 PM


  • HAL Id : inria-00100452, version 1



Abdel Belaïd. Recognition of Table of Contents for Electronic Library Consulting. International Journal on Document Analysis and Recognition, Springer Verlag, 2001, 4 (1), pp.35-45. ⟨inria-00100452⟩



Record views