Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

Abstract : The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.
Type de document :
Communication dans un congrès
CLEF 2012, Sep 2012, Roma, Italy. 2012
Liste complète des métadonnées

Littérature citée [18 références]  Voir  Masquer  Télécharger

Contributeur : Laurent Romary <>
Soumis le : jeudi 6 septembre 2012 - 15:55:11
Dernière modification le : vendredi 3 novembre 2017 - 08:24:01
Document(s) archivé(s) le : vendredi 16 décembre 2016 - 11:11:04


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-00728779, version 1



Andrew Thean, Jean-Marc Deltorn, Patrice Lopez, Laurent Romary. Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012. CLEF 2012, Sep 2012, Roma, Italy. 2012. 〈hal-00728779〉



Consultations de la notice


Téléchargements de fichiers