Skip to Main content Skip to Navigation
Conference papers

Semi-Synthetic Data Augmentation of Scanned Historical Documents

Romain Karpinski 1 Abdel Belaïd 2
2 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This paper proposes a fully automatic new method for generating semi-synthetic images of historical documents to increase the number of training samples in small datasets. This method extracts and mixes background only images (BOI) with text only images (TOI) issued from two different sources to create semi-synthetic images. The TOIs are extracted with the help of a binary mask obtained by binarizing the image. The BOIs are reconstructed from the original image by replacing TOI pixels using an inpainting method. Finally, a TOI can be efficiently integrated in a BOI using the gradient domain, thus creating a new semi-synthetic image. The idea behind this technique is to automatically obtain documents close to real ones with different backgrounds to highlight the content. Experiments are conducted on the public HisDB dataset which contains few labeled images. We show that the proposed method improves the performance results of a semantic segmentation and baseline extraction task.
Document type :
Conference papers
Complete list of metadata

Cited literature [18 references]  Display  Hide  Download

https://hal.inria.fr/hal-02460891
Contributor : Abdel Belaid <>
Submitted on : Thursday, January 30, 2020 - 12:34:50 PM
Last modification on : Friday, January 15, 2021 - 5:42:02 PM

File

Romain_Icdar_Augmentation.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02460891, version 1

Collections

Citation

Romain Karpinski, Abdel Belaïd. Semi-Synthetic Data Augmentation of Scanned Historical Documents. ICDAR, Sep 2019, Sydney, Australia. ⟨hal-02460891⟩

Share

Metrics

Record views

75

Files downloads

546