A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

Jin Chen; Daniel Lopresti; Bart Lamiroy

doi:10.1145/2034617.2034620

Communication Dans Un Congrès Année : 2011

A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

(1) , (1) , (2)

1
2

Jin Chen

Fonction : Auteur

Computer Science & Engineering Department

Daniel Lopresti

Fonction : Auteur
PersonId : 835805

Computer Science & Engineering Department

Bart Lamiroy

Fonction : Auteur
PersonId : 1298
IdHAL : bart-lamiroy
ORCID : 0000-0003-0871-0149
IdRef : 111726980

Querying Graphics through Analysis and Recognition

Résumé

Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, we introduce a noisy and unstructured handwriting dataset that aims for promoting and evaluating robust document analysis algorithms for real-world challenges, as a result of emphasizing the process of building and curating a dataset. First, we explain the data acquisition process and characterize its critical features as noisy and unstructured. Then, we discuss a set of real-world scenarios that might benefit from using our notebook dataset. As an on-going activity, so far we have collected 18 handwritten note-books from nine college students, resulting in a total of 499 pages. We expect to collect over 100 notebooks, or equivalently about 3,000 pages, from at least 50 students. This dataset is available to the research community via the Lehigh document analysis and exploitation (DAE) platform.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Bart Lamiroy : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00627844

Soumis le : jeudi 29 septembre 2011-16:36:58

Dernière modification le : vendredi 24 mars 2023-14:52:54

Dates et versions

inria-00627844 , version 1 (29-09-2011)

Identifiants

HAL Id : inria-00627844 , version 1
DOI : 10.1145/2034617.2034620

Citer

Jin Chen, Daniel Lopresti, Bart Lamiroy. A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research. Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data - (J-MOCR-AND 2011), IAPR, Sep 2011, Beijing, China. ⟨10.1145/2034617.2034620⟩. ⟨inria-00627844⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE LORIA

85 Consultations

0 Téléchargements

A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager