D. Wang and J. Chen, Supervised speech separation based on deep learning: An overview, Speech, and Language Processing, vol.26, pp.1702-1726, 2018.

D. S. Williamson, Y. Wang, and D. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, issue.3, pp.483-492, 2016.

S. Fu, T. Hu, Y. Tsao, and X. Lu, Complex spectrogram enhancement by convolutional neural network with multi-metrics learning, International Workshop on Machine Learning for Signal Processing, pp.1-6, 2017.

Z. Wang, P. Wang, and D. Wang, Complex spectral mapping for single-and multi-channel speech enhancement and robust asr, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.28, pp.1778-1787, 2020.

K. Tan, J. Chen, and D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, vol.27, issue.1, pp.189-198, 2019.

F. Weninger, F. Eyben, and B. Schuller, Single-channel speech separation with memory-enhanced recurrent neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3709-3713, 2014.

J. Chen and D. Wang, Long short-term memory for speaker generalization in supervised speech separation, The Journal of the Acoustical Society of America, vol.141, issue.6, pp.4705-4714, 2017.

Y. Wang, K. Han, and D. Wang, Exploring monaural features for classification-based speech segregation, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.2, pp.270-279, 2013.

Y. Jiang, D. Wang, R. Liu, and Z. Feng, Binaural classification for reverberant speech segregation using deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.22, issue.12, pp.2112-2121, 2014.

J. Heymann, L. Drude, and R. Haeb-umbach, Neural network based spectral mask estimation for acoustic beamforming, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.196-200, 2016.

C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-umbach, Exploring practical aspects of neural mask-based beamforming for far-field speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.6697-6701, 2018.

X. Zhang and D. Wang, Deep learning based binaural speech separation in reverberant environments, IEEE/ACM transactions on audio, vol.25, issue.5, pp.1075-1084, 2017.

Z. Wang and D. Wang, Combining spectral and spatial features for deep learning based blind speaker separation, Speech, and Language Processing, vol.27, pp.457-468, 2019.

T. Yoshioka, Z. Chen, C. Liu, X. Xiao, H. Erdogan et al., Low-latency speaker-independent continuous speech separation, International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6980-6984, 2019.

P. Pertilä and J. Nikunen, Distant speech separation using predicted time-frequency masks from spatial features, Speech communication, vol.68, pp.97-106, 2015.

S. Chakrabarty and E. A. Habets, Time-frequency masking based online multi-channel speech enhancement with convolutional recurrent neural networks, IEEE Journal of Selected Topics in Signal Processing, vol.13, issue.4, pp.787-799, 2019.

X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey et al., Deep beamforming networks for multi-channel speech recognition, International Conference on Acoustics, Speech and Signal Processing, pp.5745-5749, 2016.

B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani, Neural network adaptive beamforming for robust multichannel speech recognition, 2016.

Z. Meng, S. Watanabe, J. R. Hershey, and H. Erdogan, Deep long shortterm memory adaptive beamforming networks for multichannel robust speech recognition, International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.271-275, 2017.

X. Li, L. Girin, S. Gannot, and R. Horaud, Non-stationary noise power spectral density estimation based on regional statistics, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.181-185, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01250892

T. Gerkmann and R. C. Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.4, pp.1383-1393, 2012.

S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech, IEEE Transactions on Signal Processing, vol.49, issue.8, pp.1614-1626, 2001.

X. Li, L. Girin, R. Horaud, and S. Gannot, Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.320-324, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01119186

X. Li, S. Leglaive, L. Girin, and R. Horaud, Audio-noise power spectral density estimation using long short-term memory, IEEE Signal Processing Letters, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02100059

A. Schwarz and W. Kellermann, Coherent-to-diffuse power ratio estimation for dereverberation, Speech, and Language Processing, vol.23, pp.1006-1018, 2015.

S. Gannot, E. Vincent, S. Markovich-golan, and A. Ozerov, A consolidated perspective on multimicrophone speech enhancement and source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue.4, pp.692-730, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01414179

Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.32, issue.6, pp.1109-1121, 1984.

I. Cohen and B. Berdugo, Speech enhancement for non-stationary noise environments, Signal processing, vol.81, issue.11, pp.2403-2418, 2001.

M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications, 2013.

X. Li and R. Horaud, Multichannel speech enhancement based on timefrequency masking using subband long short-term memory, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02264247

P. Wang and K. Tan, Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.28, pp.39-48, 2019.

Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, Speech, and Language Processing, vol.22, pp.1849-1858, 2014.

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, The third 'CHiME' speech separation and recognition challenge: Dataset, task and baselines, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.504-511, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01211376

F. Chollet, Keras, 2015.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2015.

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, International Conference on Acoustics, Speech, and Signal Processing, vol.2, pp.749-752, 2001.

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.7, pp.2125-2136, 2011.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.14, issue.4, pp.1462-1469, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

J. F. Santos, M. Senoussaoui, and T. H. Falk, An improved non-intrusive intelligibility metric for noisy and reverberant speech, International Workshop on Acoustic Signal Enhancement (IWAENC), pp.55-59, 2014.

T. Hori, Z. Chen, J. R. Hershey, J. L. Roux, V. Mitra et al., The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.475-481, 2015.

X. Anguera, C. Wooters, and J. Hernando, Acoustic beamforming for speaker diarization of meetings, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.7, pp.2011-2022, 2007.