Automatically Inferring the Document Class of a Scientific Article - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Conference Papers Year : 2023

Automatically Inferring the Document Class of a Scientific Article

Abstract

We consider the problem of automatically inferring the (LaTeX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LaTeX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.
Fichier principal
Vignette du fichier
main.pdf (1.64 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-04138880 , version 1 (23-06-2023)

Licence

Attribution

Identifiers

Cite

Antoine Gauquier, Pierre Senellart. Automatically Inferring the Document Class of a Scientific Article. DocEng 2023 - 23rd ACM Symposium on Document Engineering, Aug 2023, Limerick, Ireland. ⟨10.1145/3573128.3604894⟩. ⟨hal-04138880⟩
67 View
50 Download

Altmetric

Share

Gmail Facebook X LinkedIn More