Adapting Language Models When Training on Privacy-Transformed Data

Mehmet Ali Tugtekin Turan; Dietrich Klakow; Emmanuel Vincent; Denis Jouvet

Communication Dans Un Congrès Année : 2022

Adapting Language Models When Training on Privacy-Transformed Data

(1) , (2) , (1) , (1)

1
2

Mehmet Ali Tugtekin Turan

Fonction : Auteur
PersonId : 1095307

Speech Modeling for Facilitating Oral-Based Communication

Dietrich Klakow

Fonction : Auteur
PersonId : 1095147

Universität des Saarlandes [Saarbrücken]

Emmanuel Vincent

Fonction : Auteur
PersonId : 1256
IdHAL : emmanuelv
ORCID : 0000-0002-0183-7289
IdRef : 089360176

Speech Modeling for Facilitating Oral-Based Communication

Denis Jouvet

Fonction : Auteur
PersonId : 15904
IdHAL : denis-jouvet
IdRef : 029418666

Speech Modeling for Facilitating Oral-Based Communication

Résumé

In recent years, voice-controlled personal assistants have revolutionized the interaction with smart devices and mobile applications. The collected data are then used by system providers to train language models (LMs). Each spoken message reveals personal information, hence removing private information from the input sentences is necessary. Our data sanitization process relies on recognizing and replacing named entities by other words from the same class. However, this may harm LM training because privacy-transformed data is unlikely to match the test distribution. This paper aims to fill the gap by focusing on the adaptation of LMs initially trained on privacy-transformed sentences using a small amount of original untransformed data. To do so, we combine class-based LMs, which provide an effective approach to overcome data sparsity in the context of n-gram LMs, and neural LMs, which handle longer contexts and can yield better predictions. Our experiments show that training an LM on privacy-transformed data result in a relative 11% word error rate (WER) increase compared to training on the original untransformed data, and adapting that model on a limited amount of original untransformed data leads to a relative 8% WER improvement over the model trained solely on privacy-transformed data.

Mots clés

language model adaptation privacy-preserving learning speech recognition class-based language modeling

Domaines

Informatique et langage [cs.CL] Apprentissage [cs.LG]

Fichier principal

Paper_1854.pdf (230.95 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Emmanuel Vincent : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03189354

Soumis le : dimanche 8 mai 2022-21:08:51

Dernière modification le : jeudi 1 février 2024-10:03:33

Dates et versions

hal-03189354 , version 1 (03-04-2021)

hal-03189354 , version 2 (08-05-2022)

Identifiants

HAL Id : hal-03189354 , version 2

Citer

Mehmet Ali Tugtekin Turan, Dietrich Klakow, Emmanuel Vincent, Denis Jouvet. Adapting Language Models When Training on Privacy-Transformed Data. LREC 2022 - 13th Language Resources and Evaluation Conference, Jun 2022, Marseille, France. ⟨hal-03189354v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SILECS HYAIAI UR1-MATH-NUM

220 Consultations

444 Téléchargements

Adapting Language Models When Training on Privacy-Transformed Data

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager