Skip to Main content Skip to Navigation
Journal articles

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi 1, 2 Xavier Alameda-Pineda 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology
2 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.
Complete list of metadata

https://hal.inria.fr/hal-02926172
Contributor : Team Perception <>
Submitted on : Tuesday, March 9, 2021 - 9:34:24 AM
Last modification on : Tuesday, May 25, 2021 - 11:31:47 AM
Long-term archiving on: : Thursday, June 10, 2021 - 6:19:23 PM

File

ave_tsp-R3.pdf
Files produced by the author(s)

Identifiers

Citation

Mostafa Sadeghi, Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers, 2021, 69, pp.1899-1909. ⟨10.1109/TSP.2021.3066038⟩. ⟨hal-02926172⟩

Share

Metrics