Text extraction from graphical document images using sparse representation

Thai V. Hoang 1 Salvatore Tabbone 1
1 QGAR - Querying Graphics through Analysis and Recognition
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : A novel text extraction method from graphical document images is presented in this paper. Graphical document images containing text and graphics components are considered as two-dimensional signals by which text and graphics have different morphological characteristics. The proposed algorithm relies upon a sparse representation framework with two appropriately chosen discriminative overcomplete dictionaries, each one gives sparse representation over one type of signal and non-sparse representation over the other. Separation of text and graphics components is obtained by promoting sparse representation of input images in these two dictionaries. Some heuristic rules are used for grouping text components into text strings in post-processing steps. The proposed method overcomes the problem of touching between text and graphics. Preliminary experiments show some promising results on different types of document.
Type de document :
Communication dans un congrès
International Workshop on Document Analysis Systems - DAS'2010, Jun 2010, Boston, United States. ACM, pp.143-150, 2010, ACM International Conference Proceeding Series. 〈10.1145/1815330.1815349〉
Liste complète des métadonnées

Littérature citée [30 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00494513
Contributeur : Thai V. Hoang <>
Soumis le : mercredi 23 juin 2010 - 14:23:01
Dernière modification le : jeudi 11 janvier 2018 - 06:23:16
Document(s) archivé(s) le : vendredi 24 septembre 2010 - 17:27:15

Fichier

DAS2010.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Thai V. Hoang, Salvatore Tabbone. Text extraction from graphical document images using sparse representation. International Workshop on Document Analysis Systems - DAS'2010, Jun 2010, Boston, United States. ACM, pp.143-150, 2010, ACM International Conference Proceeding Series. 〈10.1145/1815330.1815349〉. 〈inria-00494513〉

Partager

Métriques

Consultations de la notice

145

Téléchargements de fichiers

800