On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia; Evgeny Kharitonov; Wei-Ning Hsu; Yossi Adi; Adam Polyak; Benjamin Bolte; Tu-Anh Nguyen; Jade Copet; Alexei Baevski; Adelrahman Mohamed; Emmanuel Dupoux

doi:10.1162/tacl_a_00430

Article Dans Une Revue Transactions of the Association for Computational Linguistics Année : 2021

On Generative Spoken Language Modeling from Raw Audio

(1) , (1) , (1) , (1) , (1) , (1) , (2, 3) , (1) , (1) , (1) , (2, 3, 1)

1
2
3

Kushal Lakhotia

Fonction : Auteur
PersonId : 1108687

Facebook AI Research [Paris]

Evgeny Kharitonov

Fonction : Auteur
PersonId : 1053020

Facebook AI Research [Paris]

Wei-Ning Hsu

Fonction : Auteur
PersonId : 1108688

Facebook AI Research [Paris]

Yossi Adi

Fonction : Auteur
PersonId : 1108689

Facebook AI Research [Paris]

Adam Polyak

Fonction : Auteur
PersonId : 1108690

Facebook AI Research [Paris]

Benjamin Bolte

Fonction : Auteur
PersonId : 1108691

Facebook AI Research [Paris]

Tu-Anh Nguyen

Fonction : Auteur

Laboratoire de sciences cognitives et psycholinguistique

Apprentissage machine et développement cognitif

Jade Copet

Fonction : Auteur
PersonId : 1108692

Facebook AI Research [Paris]

Alexei Baevski

Fonction : Auteur
PersonId : 1108693

Facebook AI Research [Paris]

Adelrahman Mohamed

Fonction : Auteur

Facebook AI Research [Paris]

Emmanuel Dupoux

Fonction : Auteur
PersonId : 757939
ORCID : 0000-0002-7814-2952

Laboratoire de sciences cognitives et psycholinguistique

Apprentissage machine et développement cognitif

Facebook AI Research [Paris]

Résumé

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Domaines

Informatique et langage [cs.CL]

Fichier principal

2102.01192.pdf (1.32 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Emmanuel Dupoux : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03329219

Soumis le : lundi 11 octobre 2021-11:37:45

Dernière modification le : vendredi 19 avril 2024-16:18:55

Archivage à long terme le : mercredi 12 janvier 2022-18:41:02

Dates et versions

hal-03329219 , version 1 (11-10-2021)

Identifiants

HAL Id : hal-03329219 , version 1
DOI : 10.1162/tacl_a_00430

Citer

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, et al.. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 2021, ⟨10.1162/tacl_a_00430⟩. ⟨hal-03329219⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA EHESS LSCP DEC INRIA2 PSL ANR PRAIRIE-IA

166 Consultations

108 Téléchargements

On Generative Spoken Language Modeling from Raw Audio

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager