Unsupervised Speech Enhancement using Dynamical Variational Autoencoders

Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.

Mots clés

Speech enhancement Noise measurement Training Recording Inference algorithms Time-domain analysis Time series analysis

Domaines

Intelligence artificielle [cs.AI] Apprentissage [cs.LG]

Fichier principal

Bie et al-2022-Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders.pdf (6.95 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Xavier Alameda-Pineda : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03295630

Soumis le : vendredi 16 décembre 2022-13:55:06

Dernière modification le : jeudi 4 avril 2024-21:08:22

Dates et versions

hal-03295630 , version 1 (16-12-2022)

Identifiants

HAL Id : hal-03295630 , version 1
ARXIV : 2106.12271
DOI : 10.1109/TASLP.2022.3207349

Citer

Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin. Unsupervised Speech Enhancement using Dynamical Variational Autoencoders. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2022, 30, pp.2993 - 3007. ⟨10.1109/TASLP.2022.3207349⟩. ⟨hal-03295630⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA INSA-RENNES IRISA GIPSA IETR SUP_IETR LJK LJK_GI GIPSA-CRISSP CENTRALESUPELEC IETR-FAST INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE GIPSA-PPC MIAI ANR UR1-MATH-NUM HUB-IA LJK-GI-ROBOTLEARN NANTES-UNIVERSITE IETR-AIMAC

183 Consultations

62 Téléchargements