Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios? - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Résumé

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.

Dates et versions

hal-03527328 , version 1 (15-01-2022)

Identifiants

Citer

Arij Riabi, Benoît Sagot, Djamé Seddah. Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?. Seventh Workshop on Noisy User-generated Text (W-NUT 2021, colocated with EMNLP 2021), Jan 2022, Punta Cana, Dominican Republic. ⟨hal-03527328⟩
78 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More