Semi-Synthetic Data Augmentation of Scanned Historical Documents - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Semi-Synthetic Data Augmentation of Scanned Historical Documents

Romain Karpinski
  • Fonction : Auteur
  • PersonId : 987957

Résumé

This paper proposes a fully automatic new method for generating semi-synthetic images of historical documents to increase the number of training samples in small datasets. This method extracts and mixes background only images (BOI) with text only images (TOI) issued from two different sources to create semi-synthetic images. The TOIs are extracted with the help of a binary mask obtained by binarizing the image. The BOIs are reconstructed from the original image by replacing TOI pixels using an inpainting method. Finally, a TOI can be efficiently integrated in a BOI using the gradient domain, thus creating a new semi-synthetic image. The idea behind this technique is to automatically obtain documents close to real ones with different backgrounds to highlight the content. Experiments are conducted on the public HisDB dataset which contains few labeled images. We show that the proposed method improves the performance results of a semantic segmentation and baseline extraction task.
Fichier principal
Vignette du fichier
Romain_Icdar_Augmentation.pdf (13.47 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02460891 , version 1 (30-01-2020)

Identifiants

  • HAL Id : hal-02460891 , version 1

Citer

Romain Karpinski, Abdel Belaïd. Semi-Synthetic Data Augmentation of Scanned Historical Documents. ICDAR, Sep 2019, Sydney, Australia. ⟨hal-02460891⟩
84 Consultations
205 Téléchargements

Partager

Gmail Facebook X LinkedIn More