J. S. Lim, Speech enhancement, 1983.

J. Benesty, S. Makino, and J. Chen, Speech enhancement, 2006.

C. Philipos and . Loizou, Speech enhancement: theory and practice, 2007.

W. Sumby and I. Pollack, Visual contribution to speech intelligibility in noise, The Journal of the Acoustical Society of America, vol.26, issue.2, pp.212-215, 1954.

N. Erber, Auditory-visual perception of speech, Journal of Speech and Hearing Disorders, vol.40, issue.4, pp.481-492, 1975.

A. Macleod and Q. Summerfield, Quantifying the contribution of vision to speech perception in noise, British Journal of Audiology, vol.21, issue.2, pp.131-141, 1987.

L. Girin, G. Feng, and J. Schwartz, Noisy speech enhancement with filters estimated from the speaker's lips, Proc. European Conference on Speech Communication and Technology, pp.1559-1562, 1995.

L. Girin, J. Schwartz, and G. Feng, Audio-visual enhancement of speech in noise, The Journal of the Acoustical Society of America, vol.109, issue.6, pp.3007-3020, 2001.

W. John, I. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola, Learning joint statistical models for audio-visual fusion and segregation, Proc. Advances in Neural Information Processing Systems (NIPS), pp.772-778, 2001.

S. Deligne, G. Potamianos, and C. Neti, Audiovisual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization), Proc. IEEE International Workshop on Sensor Array and Multichannel Signal Processing, pp.68-71, 2002.

R. Goecke, G. Potamianos, and C. Neti, Noisy audio feature enhancement using audio-visual speech data, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.2025-2028, 2002.

R. John, M. Hershey, and . Casey, Audio-visual sound separation via hidden Markov models, Proc. Advances in Neural Information Processing Systems (NIPS), pp.1173-1180, 2002.

S. Ahmed-hussen-abdelaziz, D. Zeiler, and . Kolossa, Twin-HMM-based audio-visual speech enhancement, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3726-3730, 2013.

T. Afouras, J. S. Chung, and A. Zisserman, The conversation: Deep audio-visual speech enhancement, Proc. Conference of the International Speech Communication Association (INTER-SPEECH), pp.3244-3248, 2018.

A. Gabbay, A. Shamir, and S. Peleg, Visual speech enhancement, Proc. Conference of the International Speech Communication Association (INTERSPEECH), pp.1170-1174, 2018.

A. Gabbay, A. Ephart, T. Halperin, and S. Peleg, Seeing through noise: Speaker separation and enhancement using visuallyderived speech, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.3051-3055, 2018.

J. Hou, S. Wang, Y. Lai, Y. Tsao, H. Chang et al., Audio-visual speech enhancement using multimodal deep convolutional neural networks, IEEE Transactions on Emerging Topics in Computational Intelligence, vol.2, issue.2, pp.117-128, 2018.

M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, DNN driven speaker independent audio-visual mask estimation for speech separation, Proc. Conference of the International Speech Communication Association (INTERSPEECH), pp.2723-2727, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01868604

Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.716-720, 2018.

S. Leglaive, L. Girin, and R. Horaud, A variance modeling framework based on variational autoencoders for speech enhancement, Proc. IEEE International Workshop on Machine Learning for Signal Processing, pp.1-6, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01832826

K. Sekiguchi, Y. Bando, K. Yoshii, and T. Kawahara, Bayesian multichannel speech enhancement with a deep speech prior, Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp.1233-1239, 2018.

S. Leglaive, L. Girin, and R. Horaud, Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.101-105, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02005102

S. Leglaive, ?. Umut, A. Im?ekli, L. Liutkus, R. Girin et al., Speech enhancement with variational autoencoders and alpha-stable distributions, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.541-545, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02005106

M. Pariente, A. Deleforge, and E. Vincent, A statistically principled and computationally efficient approach to speech enhancement using variational autoencoders, Proc. Conference of the International Speech Communication Association (INTERSPEECH), 2019.
URL : https://hal.archives-ouvertes.fr/hal-02089062

K. Sohn, H. Lee, and X. Yan, Learning structured output representation using deep conditional generative models, Proc. Advances in Neural Information Processing Systems (NIPS), pp.3483-3491, 2015.

X. Li and R. Horaud, Multichannel speech enhancement based on time-frequency masking using subband long short-term memory, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.298-302, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02264247

A. Ahmed-hussen, NTCD-TIMIT: A new database and baseline for noise-robust audio-visual speech recognition, Proc. Conference of the International Speech Communication Association (INTERSPEECH, pp.3752-3756, 2017.

M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audiovisual corpus for speech perception and automatic speech recognition, J. Acoustical Society of America, vol.120, issue.5, pp.2421-2424, 2006.

S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.27, issue.2, pp.113-120, 1979.

J. S. Lim and A. V. Oppenheim, Enhancement and bandwidth compression of noisy speech, Proceedings of the IEEE, vol.67, issue.12, pp.1586-1604, 1979.

Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.32, issue.6, pp.1109-1121, 1984.

R. Martin, Speech enhancement based on minimum mean-square error estimation and supergaussian priors, IEEE Transactions on Speech and Audio Processing, vol.13, issue.5, pp.845-856, 2005.

J. Erkelens, R. Hendriks, R. Heusdens, and J. Jensen, Minimum mean-square error estimation of discrete Fourier coefficients with generalized Gamma priors, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.6, pp.1741-1752, 2007.

Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.33, issue.2, pp.443-445, 1985.

I. Cohen and B. Berdugo, Speech enhancement for nonstationary noise environments, Signal processing, vol.81, issue.11, pp.2403-2418, 2001.

C. Févotte, N. Bertin, and J. Durrieu, Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis, Neural computation, vol.21, issue.3, pp.793-830, 2009.

K. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, Speech denoising using nonnegative matrix factorization with priors, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4029-4032, 2008.

B. Raj, R. Singh, and T. Virtanen, Phoneme-dependent NMF for speech enhancement in monaural mixtures, Proc. Conference of the International Speech Communication Association (INTER-SPEECH), pp.1217-1220, 2011.

N. Mohammadiha, P. Smaragdis, and A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.10, pp.2140-2151, 2013.

D. Wang and J. Chen, Supervised speech separation based on deep learning: An overview, IEEE Transactions on Audio, Speech, and Language Processing, vol.26, issue.10, pp.1702-1726, 2018.

X. Lu, Y. Tsao, S. Matsuda, and C. Hori, Speech enhancement based on deep denoising autoencoder, Proc. Conference of the International Speech Communication Association, pp.436-440, 2013.

Y. Xu, J. Du, L. Dai, and C. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE Transactions on Audio, Speech, and Language Processing, vol.23, issue.1, pp.7-19, 2015.

S. Fu, Y. Tsao, and X. Lu, SNR-aware convolutional neural network modeling for speech enhancement, Proc. Conference of the International Speech Communication Association (INTERSPEECH), pp.3768-3772, 2016.

Y. Wang and D. Wang, Towards scaling up classificationbased speech separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.7, pp.1381-1390, 2013.

Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM transactions on audio, vol.22, issue.12, pp.1849-1858, 2014.

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux et al., Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, Proc. International Conference on Latent Variable Analysis and Signal Separation, pp.91-99, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01163493

.. P. Diederik, D. J. Kingma, S. Rezende, M. Mohamedy, and . Welling, Semi-supervised learning with deep generative models, Adv. Neural Information Processing Systems (NIPS), pp.3581-3589, 2014.

H. Kameoka, L. Li, S. Inoue, and S. Makino, Supervised determined source separation with multichannel variational autoencoder, Neural Computation, vol.31, issue.9, pp.1-24, 2019.

L. Li, H. Kameoka, and S. Makino, Fast MVAE: Joint separation and classification of mixed sources based on multichannel variational autoencoder with auxiliary classifier, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.546-550, 2019.

S. Inoue, H. Kameoka, L. Li, S. Seki, and S. Makino, Joint separation and dereverberation of reverberant mixtures with multichannel variational autoencoder, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.96-100, 2019.

I. Almajai and B. Milner, Visually derived wiener filters for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.6, pp.1642-1651, 2010.

C. G. Greg, M. A. Wei, and . Tanner, A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms, Journal of the American statistical Association, vol.85, issue.411, pp.699-704, 1990.

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction to variational methods for graphical models, Machine learning, vol.37, issue.2, pp.183-233, 1999.

D. M. Blei, A. Kucukelbir, and J. D. Mcauliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association, vol.112, issue.518, pp.859-877, 2017.

S. Petridis, T. Stafylakis, P. Ma, and F. Cai, End-to-end audiovisual speech recognition, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.6548-6552, 2018.

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot et al., ?-vae: Learning basic visual concepts with a constrained variational framework, International Conference on Learning Representations (ICLR, 2017.

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society. Series B (Methodological), vol.39, issue.1, pp.1-38, 1977.

P. Christian, G. Robert, and . Casella, Monte Carlo Statistical Methods, 2005.

C. Févotte and J. Idier, Algorithms for nonnegative matrix factorization with the ?-divergence, Neural computation, vol.23, issue.9, pp.2421-2456, 2011.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett et al., TIMIT acoustic phonetic continuous speech corpus, Linguistic data consortium, 1993.

H. Hirsch, FaNT-filtering and noise adding tool, 2005.

J. Thiemann, N. Ito, and E. Vincent, The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings, Proc. International Congress on Acoustics, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00796707

P. Smaragdis, B. Raj, and M. Shashanka, Supervised and semi-supervised separation of sounds from single-channel mixtures, Proc. Int. Conf. Indep. Component Analysis and Signal Separation, pp.414-421, 2007.

J. Gautham, P. Mysore, and . Smaragdis, A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations (ICLR), 2015.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.14, issue.4, pp.1462-1469, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.749-752, 2001.

H. Cees, R. C. Taal, R. Hendriks, J. Heusdens, and . Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Trans. Audio, Speech, Language Process, vol.19, issue.7, pp.2125-2136, 2011.

A. A. Nugraha, K. Sekiguchi, and K. Yoshii, A deep generative model of speech complex spectrograms, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.905-909, 2019.