Skip to Main content Skip to Navigation
New interface
Books

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Abstract : We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semisupervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semisupervised ASR and speech-to-text translation under challenging out-of-domain settings.
Document type :
Books
Complete list of metadata

https://hal.inria.fr/hal-03329290
Contributor : Emmanuel Dupoux Connect in order to contact the contributor
Submitted on : Monday, October 11, 2021 - 11:08:45 AM
Last modification on : Friday, November 18, 2022 - 9:23:14 AM
Long-term archiving on: : Wednesday, January 12, 2022 - 6:41:04 PM

File

2101.00390.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, et al.. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. 2021, ⟨10.18653/v1/2021.acl-long.80⟩. ⟨hal-03329290⟩

Share

Metrics

Record views

22

Files downloads

83