Imputing out-of-vocabulary embeddings with LOVE makes language models robust with little cost

State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT), and makes it robust to OOV with few additional parameters. Extensive evaluations demonstrate that our lightweight model achieves similar or even better performances than prior competitors, both on original datasets and on corrupted variants. Moreover, it can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness.

Keywords

Language models Word embeddings Out-of-vocabulary OOV words

Domains

Artificial Intelligence [cs.AI]

Fichier principal

Imputing OOV Embeddings with LOVE.pdf (787.54 Ko)

Origin : Files produced by the author(s)

Lihu Chen : Connect in order to contact the contributor

https://hal.science/hal-03613101

Submitted on : Saturday, March 19, 2022-10:52:09 AM

Last modification on : Wednesday, August 30, 2023-4:27:19 AM

Dates and versions

hal-03613101 , version 2 (19-03-2022)

Identifiers

HAL Id : hal-03613101 , version 2

Cite

Lihu Chen, Gaël Varoquaux, Fabian Suchanek. Imputing out-of-vocabulary embeddings with LOVE makes language models robust with little cost. ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, May 2022, Dublin, Ireland. ⟨hal-03613101⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM INRIA INRIA2 UNIV-PARIS-SACLAY LTCI INFRES DIG IP_PARIS ANR GS-ENGINEERING GS-COMPUTER-SCIENCE

322 View

356 Download