Use of language models for document stream segmentation - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

Use of language models for document stream segmentation

Chems Eddine Neche
  • Fonction : Auteur

Résumé

Page stream segmentation into single documents is a very common task which is practiced in companies and administrations when processing their incoming mail. It is not a straightforward task because the limits of the documents are not always obvious, and it is not always easy to find common features between the pages of the same document. In this paper, we seek to compare existing segmentation models and propose a new segmentation one based on GRUs (Gated Recurrent Unit) and an attention mechanism, named AGRU. This model uses the text content of the previous page and the current page to determine if both pages belong to the same document. So, due to its attention mechanism, this model is capable to recognize words that define the first page of a document. Training and evaluation are carried out on two datasets: Tobacco-800 and READ-Corpus. The former is a public dataset on which our model reaches an F1 score equal to 90%, and the later is private for which our model reaches an F1 score equal to 96%.
Fichier principal
Vignette du fichier
version_soumise.pdf (1.59 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02975046 , version 1 (22-10-2020)

Identifiants

  • HAL Id : hal-02975046 , version 1

Citer

Chems Eddine Neche, Yolande Belaïd, Abdel Belaïd. Use of language models for document stream segmentation. International Conference on Pattern Recognition Applications and Methods, Feb 2020, Valleta, Malta. ⟨hal-02975046⟩
174 Consultations
472 Téléchargements

Partager

Gmail Facebook X LinkedIn More