J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, Deep clustering: Discriminative embeddings for segmentation and separation, ICASSP, pp.31-35, 2016.

Y. Luo and N. Mesgarani, Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation, IEEE/ACM transactions on audio, speech, and language processing, vol.27, issue.8, pp.1256-1266, 2019.

I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson et al., Universal sound separation, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.175-179, 2019.

K. Reindl, Y. Zheng, and W. Kellermann, Speech enhancement for binaural hearing aids based on blind source separation, ISCCSP, pp.1-6, 2010.

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann et al., Front-end processing for the chime-5 dinner party scenario, 2018.

T. Menne, I. Sklyar, R. Schlüter, and H. Ney, Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech, 2019.

T. Neumann, K. Kinoshita, L. Drude, C. Boeddeker, M. Delcroix et al., End-to-end training of time domain audio separation and recognition, ICASSP, pp.7004-7008, 2020.

T. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani et al., All-neural online source separation, counting, and diarization for meeting analysis, ICASSP, pp.91-95, 2019.

E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen, and D. P. Ellis, Improving universal sound separation using sound classification, ICASSP, pp.96-100, 2020.

S. Wisdom, H. Erdogan, D. P. Ellis, R. Serizel, N. Turpault et al., What's all the fuss about free universal sound separation data?, 2020.

A. Narayanan and D. Wang, Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training, IEEE/ACM transactions on audio, vol.23, issue.1, pp.92-101, 2014.

S. Settle, J. L. Roux, T. Hori, S. Watanabe, and J. R. Hershey, End-to-end multi-speaker speech recognition, ICASSP. IEEE, pp.4819-4823, 2018.

S. Watanabe, T. Hori, J. L. Roux, and J. R. Hershey, Studentteacher network learning with enhanced features, ICASSP, pp.5275-5279, 2017.

D. Bagchi, P. Plantinga, A. Stiff, and E. Fosler-lussier, Spectral feature mapping with mimic loss for robust speech recognition, ICASSP. IEEE, pp.5609-5613, 2018.

N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, Sound event detection in domestic environments with weakly labeled data and soundscape synthesis, Workshop on Detection and Classification of Acoustic Scenes and Events, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02160855

G. Dekkers, S. Lauwereins, B. Thoen, M. W. Adhana, H. Brouckxon et al., The sins database for detection of daily activities in a home environment using an acoustic sensor network, 2017.

A. Mesaros, T. Heittola, and T. Virtanen, TUT database for acoustic scene classification and sound event detection, EUSIPCO, 2016.

L. Delphin-poulat and C. Plapous, Mean teacher with data agumentation for dcase 2019 task 4, 2009.

A. Tarvainen and H. Valpola, Mean teachers are better role models: Weight-averaged consistency targets improve semisupervised deep learning results, Advances in neural information processing systems, pp.1195-1204, 2017.

J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, Sdrhalf-baked or well done?" in ICASSP, pp.626-630, 2019.

P. Wang and K. Tan, Bridging the gap between monaural speech enhancement and recognition with distortionindependent acoustic modeling, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.28, pp.39-48, 2019.

H. Seki, T. Hori, S. Watanabe, J. L. Roux, and J. R. Hershey, A purely end-to-end system for multi-speaker speech recognition, 2018.

X. Chang, Y. Qian, K. Yu, and S. Watanabe, End-to-end monaural multi-speaker asr system without pretraining, ICASSP, pp.6256-6260, 2019.

D. Yu, M. Kolbaek, Z. Tan, and J. Jensen, Permutation invariant training of deep models for speaker-independent multi-talker speech separation, ICASSP, pp.241-245, 2017.

M. Kolbaek, D. Yu, Z. Tan, and J. Jensen, Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue.10, pp.1901-1913, 2017.

M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis et al., Asteroid: the pytorch-based audio source separation toolkit for researchers, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02962964

J. Johnson, A. Alahi, and L. Fei-fei, Perceptual losses for real-time style transfer and super-resolution, pp.694-711, 2016.

F. G. Germain, Q. Chen, and V. Koltun, Speech denoising with deep feature losses, 2018.

Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, End-to-End Neural Speaker Diarization with Permutation-Free Objectives, pp.4300-4304, 2019.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.