Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Représentation discrète invariante de l'augmentation pour la modélisation générative de la parole

Résumé

Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the invariance of discrete input representations to non-spoken augmentations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn invariant discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudolabeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speechto-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.
Fichier principal
Vignette du fichier
2209.15483.pdf (454.16 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04208443 , version 1 (21-09-2023)

Identifiants

Citer

Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, et al.. Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling. 20th International Conference on Spoken Language Translation (IWSLT 2023), Jul 2023, Toronto, Canada. pp.465-477, ⟨10.18653/v1/2023.iwslt-1.46⟩. ⟨hal-04208443⟩
87 Consultations
77 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More