On the Robustness of Text Vectorizers

Rémi Catellier; Samuel Vaiter; Damien Garreau

Conference Papers Year : 2023

On the Robustness of Text Vectorizers

(1) , (1) , (1, 2)

1
2

Rémi Catellier

Function : Author

Université Côte d'Azur

Samuel Vaiter

Function : Author
PersonId : 1995
IdHAL : samuel-vaiter
ORCID : 0000-0002-4077-708X
IdRef : 182993116

Université Côte d'Azur

Damien Garreau

Function : Author

Université Côte d'Azur

Modèles et algorithmes pour l’intelligence artificielle

Abstract

A fundamental issue in machine learning is the robustness of the model with respect to changes in the input. In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes, for instance replacing a word by another in an input sentence. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H\"older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.

Keywords

Natural language processing robustness

Domains

Artificial Intelligence [cs.AI]

Fichier principal

catellier23a.pdf (949.93 Ko)

Origin : Publisher files allowed on an open archive

Damien Garreau : Connect in order to contact the contributor

https://hal.science/hal-04403681

Submitted on : Thursday, January 18, 2024-4:42:30 PM

Last modification on : Friday, May 3, 2024-10:32:58 AM

Dates and versions

hal-04403681 , version 1 (18-01-2024)

Identifiers

HAL Id : hal-04403681 , version 1
ARXIV : 2303.07203

Cite

Rémi Catellier, Samuel Vaiter, Damien Garreau. On the Robustness of Text Vectorizers. ICML 2023 - Fortieth International Conference on Machine Learning, Jul 2023, Honolulu, United States. ⟨hal-04403681⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA I3S INSMI DIEUDONNE INRIA2 UNIV-COTEDAZUR ANR

14 View

13 Download

On the Robustness of Text Vectorizers

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Altmetric

Share