Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Benjamin Muller; Benoît Sagot; Djamé Seddah

Pré-Publication, Document De Travail Année : 2021

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

(1) , (1) , (1)

Benjamin Muller

Fonction : Auteur

Automatic Language Modelling and ANAlysis & Computational Humanities

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Automatic Language Modelling and ANAlysis & Computational Humanities

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Automatic Language Modelling and ANAlysis & Computational Humanities

Résumé

Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.

Domaines

Traitement du texte et du document

Djamé Seddah : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03161677

Soumis le : dimanche 7 mars 2021-23:33:46

Dernière modification le : jeudi 1 février 2024-10:05:16

Dates et versions

hal-03161677 , version 1 (07-03-2021)

Identifiants

HAL Id : hal-03161677 , version 1
ARXIV : 2005.00318

Citer

Benjamin Muller, Benoît Sagot, Djamé Seddah. Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi. 2021. ⟨hal-03161677⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 INRIA IRISA INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES ANR PRAIRIE-IA UR1-MATH-NUM

68 Consultations

0 Téléchargements

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager