Skip to Main content Skip to Navigation
Conference papers

Enhancing BERT for Lexical Normalization

Abstract : Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neu-ral models on a great variety of tasks. However , it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC) in a resource-scarce scenario , we study the ability of BERT (Devlin et al., 2018) to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge , it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.
Document type :
Conference papers
Complete list of metadata

Cited literature [28 references]  Display  Hide  Download
Contributor : Benoît Sagot Connect in order to contact the contributor
Submitted on : Monday, September 30, 2019 - 6:33:53 PM
Last modification on : Wednesday, June 8, 2022 - 12:50:06 PM
Long-term archiving on: : Monday, February 10, 2020 - 2:14:32 AM


Files produced by the author(s)


  • HAL Id : hal-02294316, version 1



Benjamin Muller, Benoît Sagot, Djamé Seddah. Enhancing BERT for Lexical Normalization. The 5th Workshop on Noisy User-generated Text (W-NUT), Nov 2019, Hong Kong, China. ⟨hal-02294316⟩



Record views


Files downloads