Data-Efficient French Language Modeling with CamemBERTa

Wissam Antoun; Benoît Sagot; Djamé Seddah

doi:10.18653/v1/2023.findings-acl.320

Communication Dans Un Congrès Année : 2023

Data-Efficient French Language Modeling with CamemBERTa

(1) , (1) , (1)

Wissam Antoun

Fonction : Auteur

Automatic Language Modelling and ANAlysis & Computational Humanities

Benoît Sagot

Fonction : Auteur

Automatic Language Modelling and ANAlysis & Computational Humanities

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Automatic Language Modelling and ANAlysis & Computational Humanities

Résumé

Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model's performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to CamemBERT, despite being trained on only 30% of its total number of input tokens. In addition to our experimental results, we also publicly release the weights and code implementation of CamemBERTa, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective. https://gitlab.inria.fr/almanach/CamemBERTa

Mots clés

NLP BERT DeBERTa CamemBERT French NLP Efficient Pre-Training

Domaines

Traitement du texte et du document

Fichier principal

French_DeBERTa___ACL_2023___Arxiv (1).pdf (215.91 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Wissam Antoun : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03963729

Soumis le : mercredi 27 mars 2024-12:06:05

Dernière modification le : vendredi 29 mars 2024-03:22:27

Dates et versions

hal-03963729 , version 1 (30-01-2023)

hal-03963729 , version 2 (27-03-2024)

Licence

Paternité

Identifiants

HAL Id : hal-03963729 , version 2
ARXIV : 2306.01497
DOI : 10.18653/v1/2023.findings-acl.320

Citer

Wissam Antoun, Benoît Sagot, Djamé Seddah. Data-Efficient French Language Modeling with CamemBERTa. 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), Jul 2023, Toronto, Canada. pp.5174-5185, ⟨10.18653/v1/2023.findings-acl.320⟩. ⟨hal-03963729v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2 ANR PRAIRIE-IA

285 Consultations

94 Téléchargements

Data-Efficient French Language Modeling with CamemBERTa

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager