Skip to Main content Skip to Navigation
Conference papers

Deep Variational Generative Models for Audio-visual Speech Separation

Viet-Nhat Nguyen 1 Mostafa Sadeghi 1, 2 Elisa Ricci 3 Xavier Alameda-Pineda 1, 4 
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology
2 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-maximization algorithm. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches as well as a supervised deep learning-based technique.
Complete list of metadata
Contributor : Perception team Connect in order to contact the contributor
Submitted on : Friday, September 4, 2020 - 2:36:19 PM
Last modification on : Wednesday, May 4, 2022 - 11:58:03 AM

Links full text


  • HAL Id : hal-02930662, version 1
  • ARXIV : 2008.07191


Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, Xavier Alameda-Pineda. Deep Variational Generative Models for Audio-visual Speech Separation. MLSP 2021 - IEEE International Workshop on Machine Learning for Signal Processing, Oct 2021, Gold Coast, Australia. ⟨hal-02930662⟩



Record views