Skip to Main content Skip to Navigation
New interface
Journal articles

On Generative Spoken Language Modeling from Raw Audio

Abstract : We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
Document type :
Journal articles
Complete list of metadata
Contributor : Emmanuel Dupoux Connect in order to contact the contributor
Submitted on : Monday, October 11, 2021 - 11:37:45 AM
Last modification on : Friday, November 18, 2022 - 9:23:14 AM
Long-term archiving on: : Wednesday, January 12, 2022 - 6:41:02 PM


Files produced by the author(s)




Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, et al.. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 2021, ⟨10.1162/tacl_a_00430⟩. ⟨hal-03329219⟩



Record views


Files downloads