Enhancing BERT for Lexical Normalization

Benjamin Muller; Benoît Sagot; Djamé Seddah

Communication Dans Un Congrès Année : 2019

Enhancing BERT for Lexical Normalization

(1) , (1) , (1)

Benjamin Muller

Fonction : Auteur

Automatic Language Modelling and ANAlysis & Computational Humanities

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Automatic Language Modelling and ANAlysis & Computational Humanities

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Automatic Language Modelling and ANAlysis & Computational Humanities

Résumé

Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neu-ral models on a great variety of tasks. However , it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC) in a resource-scarce scenario , we study the ability of BERT (Devlin et al., 2018) to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge , it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.

Domaines

Informatique et langage [cs.CL]

Fichier principal

Enhancing_BERT_for_lexical_normalisation_WNUT2019_proceeding-5.pdf (207.86 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Benoît Sagot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02294316

Soumis le : lundi 30 septembre 2019-18:33:53

Dernière modification le : jeudi 1 février 2024-10:04:13

Archivage à long terme le : lundi 10 février 2020-02:14:32

Dates et versions

hal-02294316 , version 1 (30-09-2019)

Identifiants

HAL Id : hal-02294316 , version 1

Citer

Benjamin Muller, Benoît Sagot, Djamé Seddah. Enhancing BERT for Lexical Normalization. The 5th Workshop on Noisy User-generated Text (W-NUT), Nov 2019, Hong Kong, China. ⟨hal-02294316⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 INRIA IRISA INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

569 Consultations

2178 Téléchargements

Enhancing BERT for Lexical Normalization

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager