Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Itai Gat; Felix Kreuk; Tu Anh Nguyen; Ann Lee; Jade Copet; Gabriel Synnaeve; Emmanuel Dupoux; Yossi Adi

doi:10.18653/v1/2023.iwslt-1.46

Communication Dans Un Congrès Année : 2023

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Représentation discrète invariante de l'augmentation pour la modélisation générative de la parole

(1) , (1) , (1, 2) , (1) , (1) , (1) , (1, 3) , (1)

1
2
3

Itai Gat

Fonction : Auteur

Meta AI

Felix Kreuk

Fonction : Auteur

Meta AI

Tu Anh Nguyen

Fonction : Auteur

Meta AI

Automatic Language Modelling and ANAlysis & Computational Humanities

Ann Lee

Fonction : Auteur

Meta AI

Jade Copet

Fonction : Auteur

Meta AI

Gabriel Synnaeve

Fonction : Auteur

Meta AI

Emmanuel Dupoux

Fonction : Auteur

Meta AI

École des hautes études en sciences sociales

Yossi Adi

Fonction : Auteur

Meta AI

Résumé

Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the invariance of discrete input representations to non-spoken augmentations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn invariant discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudolabeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speechto-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.

Domaines

Informatique [cs]

Fichier principal

2209.15483.pdf (454.16 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Tu Anh Nguyen : Connectez-vous pour contacter le contributeur

https://cnrs.hal.science/hal-04208443

Soumis le : jeudi 21 septembre 2023-22:15:43

Dernière modification le : lundi 18 mars 2024-10:24:07

Archivage à long terme le : vendredi 22 décembre 2023-18:01:33

Dates et versions

hal-04208443 , version 1 (21-09-2023)

Identifiants

HAL Id : hal-04208443 , version 1
DOI : 10.18653/v1/2023.iwslt-1.46

Citer

Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, et al.. Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling. 20th International Conference on Spoken Language Translation (IWSLT 2023), Jul 2023, Toronto, Canada. pp.465-477, ⟨10.18653/v1/2023.iwslt-1.46⟩. ⟨hal-04208443⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA EHESS INRIA2

87 Consultations

77 Téléchargements

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Représentation discrète invariante de l'augmentation pour la modélisation générative de la parole

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager