Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Adapting Language Models When Training on Privacy-Transformed Data

Mehmet Ali Tugtekin Turan 1 Dietrich Klakow 2 Emmanuel Vincent 1 Denis Jouvet 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In recent years, voice-controlled personal assistants have revolutionized the interaction with smart devices and mobile applications. These dialogue tools are then used by system providers to improve and retrain the language models (LMs). Each spoken message reveals personal information, hence, it is necessary to remove the private data from the input utterances. However, this may harm the LM training because privacy-transformed data is unlikely to match the test distribution. This paper aims to fill the gap by focusing on the adaptation of LM initially trained on privacy-transformed utterances. Our data sanitization process relies on named-entity recognition. We propose an LM adaptation strategy over the private data with minimum losses. Class-based modeling is an effective approach to overcome data sparsity in the context of n-gram model training. On the other hand, neural LMs can handle longer contexts which can yield better predictions. Our methodology combines the predictive power of class-based models and the generalization capability of neural models together. With privacy transformation, we have a relative 11% word error rate (WER) increase compared to an LM trained on the clean data. Despite the privacy-preserving, we can still achieve comparable accuracy. Empirical evaluations attain a relative WER improvement of 8% over the initial model.
Complete list of metadata

https://hal.inria.fr/hal-03189354
Contributor : Mehmet Ali Tugtekin Turan <>
Submitted on : Saturday, April 3, 2021 - 12:35:37 PM
Last modification on : Wednesday, April 7, 2021 - 3:30:40 AM

File

Paper_1854.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03189354, version 1

Collections

Citation

Mehmet Ali Tugtekin Turan, Dietrich Klakow, Emmanuel Vincent, Denis Jouvet. Adapting Language Models When Training on Privacy-Transformed Data. 2021. ⟨hal-03189354⟩

Share

Metrics

Record views

23

Files downloads

81