Automatically Inferring the Document Class of a Scientific Article

Antoine Gauquier; Pierre Senellart

doi:10.1145/3573128.3604894

Communication Dans Un Congrès Année : 2023

Automatically Inferring the Document Class of a Scientific Article

(1, 2, 3) , (1, 4)

1
2
3
4

Antoine Gauquier

Fonction : Auteur
PersonId : 1288282
IdHAL : antoine-gauquier
ORCID : 0009-0005-9573-6364

Value from Data

Ecole nationale supérieure Mines-Télécom Lille Douai

Télécom Paris

Pierre Senellart

Fonction : Auteur
PersonId : 11778
IdHAL : pierre-senellart
ORCID : 0000-0002-7909-5369
IdRef : 124713769

Value from Data

Institut universitaire de France

Résumé

We consider the problem of automatically inferring the (LaTeX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LaTeX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.

Mots clés

Scholarly articles Information extraction Image classification Document class PDF Information systems

Domaines

Recherche d'information [cs.IR]

Fichier principal

main.pdf (1.64 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Senellart : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-04138880

Soumis le : vendredi 23 juin 2023-11:52:33

Dernière modification le : vendredi 19 avril 2024-16:18:58

Archivage à long terme le : dimanche 24 septembre 2023-19:34:30

Dates et versions

hal-04138880 , version 1 (23-06-2023)

Licence

Paternité

Identifiants

HAL Id : hal-04138880 , version 1
DOI : 10.1145/3573128.3604894

Citer

Antoine Gauquier, Pierre Senellart. Automatically Inferring the Document Class of a Scientific Article. DocEng 2023 - 23rd ACM Symposium on Document Engineering, Aug 2023, Limerick, Ireland. ⟨10.1145/3573128.3604894⟩. ⟨hal-04138880⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM ENS-PARIS CNRS INRIA PARISTECH INRIA2 PSL IP_PARIS ANR PRAIRIE-IA IMT-NORD-EUROPE

65 Consultations

50 Téléchargements

Automatically Inferring the Document Class of a Scientific Article

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager