Recognition-based Approach of Numeral Extraction in Handwritten Chemistry Documents using Contextual Knowledge

Nabil Ghanmi 1 Abdel Belaid 1
1 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This paper presents a complete procedure that uses contextual and syntactic information to identify and recognize amount fields in the table regions of chemistry documents. The proposed method is composed of two main modules. Firstly, a structural analysis based on connected component (CC) dimensions and positions identifies some special symbols and clusters other CCs into three groups: fragment of characters, isolated characters or connected characters. Then, a specific processing is performed on each group of CCs. The fragment of characters are merged with the nearest character or string using geometric relationship based rules. The characters are sent to a recognition module to identify the numeral components. For the connected characters, the final decision on the string nature (numeric or non-numeric) is made based on a global score computed on the full string using the height regularity property and the recognition probabilities of its segmented fragments. Finally, a simple syntactic verification at table row level is conducted in order to correct eventual errors. The experimental tests are carried out on real-world chemistry documents provided by our industrial partner eNovalys. The obtained results show the effectiveness of the proposed system in extracting amount fields.
Type de document :
Communication dans un congrès
11th IAPR International workshop on Document Analysis Systems, Apr 2016, Santorini, Greece
Liste complète des métadonnées

Littérature citée [13 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01321269
Contributeur : Nabil Ghanmi <>
Soumis le : mercredi 25 mai 2016 - 12:14:01
Dernière modification le : jeudi 11 janvier 2018 - 06:25:25
Document(s) archivé(s) le : vendredi 26 août 2016 - 10:45:14

Fichier

GHANMI_DAS-1.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01321269, version 1

Collections

Citation

Nabil Ghanmi, Abdel Belaid. Recognition-based Approach of Numeral Extraction in Handwritten Chemistry Documents using Contextual Knowledge. 11th IAPR International workshop on Document Analysis Systems, Apr 2016, Santorini, Greece. 〈hal-01321269〉

Partager

Métriques

Consultations de la notice

115

Téléchargements de fichiers

111