Recognition of Table of Contents for Electronic Library Consulting

Abdel Belaïd 1
1 READ - READ
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : A labeling approach for automatic recognition of Tables of Contents (ToC) is described in this paper. A prototype is used for electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labeling without using any a priori model. Labeling is based on a Part of Speech Tagging (PoS) which is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: ``title'' and ``authors''. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different ToC layouts and character recognition qualities. Without manual intervention, 96.3\% rate of correct segmentation was obtained on 38 journals including 2020 articles and 93.0\% rate of correct field extraction.
Type de document :
Article dans une revue
International Journal on Document Analysis and Recognition, Springer Verlag, 2001, 4 (1), pp.35-45
Liste complète des métadonnées

https://hal.inria.fr/inria-00100452
Contributeur : Publications Loria <>
Soumis le : mardi 26 septembre 2006 - 14:45:53
Dernière modification le : mardi 24 avril 2018 - 13:34:30

Identifiants

  • HAL Id : inria-00100452, version 1

Collections

Citation

Abdel Belaïd. Recognition of Table of Contents for Electronic Library Consulting. International Journal on Document Analysis and Recognition, Springer Verlag, 2001, 4 (1), pp.35-45. 〈inria-00100452〉

Partager

Métriques

Consultations de la notice

140