Automatically Inferring the Document Class of a Scientific Article - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Automatically Inferring the Document Class of a Scientific Article

Résumé

We consider the problem of automatically inferring the (LaTeX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LaTeX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.
Fichier principal
Vignette du fichier
main.pdf (1.64 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04138880 , version 1 (23-06-2023)

Licence

Paternité

Identifiants

Citer

Antoine Gauquier, Pierre Senellart. Automatically Inferring the Document Class of a Scientific Article. DocEng 2023 - 23rd ACM Symposium on Document Engineering, Aug 2023, Limerick, Ireland. ⟨10.1145/3573128.3604894⟩. ⟨hal-04138880⟩
65 Consultations
50 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More