On Generative Spoken Language Modeling from Raw Audio - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Transactions of the Association for Computational Linguistics Année : 2021

On Generative Spoken Language Modeling from Raw Audio

Résumé

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
Fichier principal
Vignette du fichier
2102.01192.pdf (1.32 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03329219 , version 1 (11-10-2021)

Identifiants

Citer

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, et al.. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 2021, ⟨10.1162/tacl_a_00430⟩. ⟨hal-03329219⟩
166 Consultations
108 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More