Skip to Main content Skip to Navigation
New interface
Conference papers

Semi-Synthetic Data Augmentation of Scanned Historical Documents

Romain Karpinski 1 Abdel Belaïd 2 
2 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This paper proposes a fully automatic new method for generating semi-synthetic images of historical documents to increase the number of training samples in small datasets. This method extracts and mixes background only images (BOI) with text only images (TOI) issued from two different sources to create semi-synthetic images. The TOIs are extracted with the help of a binary mask obtained by binarizing the image. The BOIs are reconstructed from the original image by replacing TOI pixels using an inpainting method. Finally, a TOI can be efficiently integrated in a BOI using the gradient domain, thus creating a new semi-synthetic image. The idea behind this technique is to automatically obtain documents close to real ones with different backgrounds to highlight the content. Experiments are conducted on the public HisDB dataset which contains few labeled images. We show that the proposed method improves the performance results of a semantic segmentation and baseline extraction task.
Document type :
Conference papers
Complete list of metadata

Cited literature [18 references]  Display  Hide  Download
Contributor : Abdel Belaid Connect in order to contact the contributor
Submitted on : Thursday, January 30, 2020 - 12:34:50 PM
Last modification on : Wednesday, November 3, 2021 - 7:09:40 AM


Files produced by the author(s)


  • HAL Id : hal-02460891, version 1



Romain Karpinski, Abdel Belaïd. Semi-Synthetic Data Augmentation of Scanned Historical Documents. ICDAR, Sep 2019, Sydney, Australia. ⟨hal-02460891⟩



Record views


Files downloads