N. Ryant, E. Bergelson, K. Church, A. Cristia, J. Du et al., Enhancement and analysis of conversational speech: JSALT 2017, in ICASSP, pp.5154-5158, 2018.

G. Sell, D. Snyder, A. Mccree, D. Garcia-romero, J. Villalba et al., Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge, in Interspeech, pp.2808-2812, 2018.

M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova et al., BUT system for DIHARD speech diarization challenge, in Interspeech, pp.2798-2802, 2018.

N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du et al., The second DIHARD diarization challenge: Dataset, task, and baselines, pp.978-982, 2019.

J. Li, L. Deng, R. Haeb-umbach, and Y. Gong, Robust Automatic Speech Recognition -A Bridge to Practical Applications. Elsevier, 2015.

S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, New Era for Robust Speech Recognition -Exploiting Deep Learning, 2017.

E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01881431

R. Haeb-umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister et al., Speech processing for digital home assistants: Combining signal processing with deep-learning techniques, IEEE Signal Processing Magazine, vol.36, issue.6, pp.111-124, 2019.

K. Boakye, B. Trueba-hornero, O. Vinyals, and G. Friedland, Overlapped speech detection for improved speaker diarization in multiparty meetings, ICASSP, pp.4353-4356, 2008.

K. Boakye, O. Vinyals, and G. Friedland, Two's a crowd: Improving speaker diarization by automatically identifying and excluding overlapped speech, pp.32-35, 2008.

S. Otterson and M. Ostendorf, Efficient use of overlap information in speaker diarization, pp.683-686, 2007.

J. Ram?rez, J. C. Segura, C. Ben?tez, A. De-la-torre, and A. Rubio, Efficient voice activity detection algorithms using long-term speech information, Speech Communication, vol.42, issue.3-4, pp.271-287, 2004.

L. P. García-perera, J. Villalba, H. Bredin, J. Du, D. Castán et al.,

B. Gill, S. Ben-yair, X. Abdoli, W. Wang, H. Bouaziz et al., , pp.415-422

S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora et al., CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings, CHiME, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02546993

J. Geiger, F. Eyben, B. Schuller, and G. Rigoll, Detecting overlapping speech with long short-term memory recurrent neural networks, pp.1668-1672, 2013.

V. Andrei, H. Cucu, and C. Burileanu, Detecting overlapped speech on short timeframes using deep learning, pp.1198-1202, 2017.

N. Sajjan, S. Ganesh, N. Sharma, S. Ganapathy, and N. Ryant, Leveraging LSTM models for overlap detection in multi-party meetings, in ICASSP, pp.5249-5253, 2018.

I. Mccowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban et al., The AMI meeting corpus, 5th International Conference on Methods and Techniques in Behavioral Research, pp.137-140, 2005.

M. Kune?ová, M. Hrúz, Z. Zajíc, and V. Radová, Detection of overlapping speech for the purposes of speaker diarization, International Conference on Speech and Computer, pp.247-257, 2019.

L. Bullock, H. Bredin, and L. P. Garcia-perera, Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection, ICASSP, pp.7114-7118, 2020.

F. Stöter, S. Chakrabarty, B. Edler, and E. Habets, Classification vs. regression in supervised learning for single channel speaker count estimation, in ICASSP, pp.436-440, 2018.

F. Stöter, S. Chakrabarty, B. Edler, and E. A. Habets, Countnet: Estimating the number of concurrent speakers using supervised learning, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.27, issue.2, pp.268-282, 2019.

V. Andrei, H. Cucu, and C. Burileanu, Overlapped speech detection and competing speaker counting -humans versus deep learning, IEEE Journal of Selected Topics in Signal Processing, vol.13, issue.4, pp.850-862, 2019.

S. Bai, J. Kolter, and V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, 2018.

Y. Luo and N. Mesgarani, Conv-tasnet: Surpassing ideal timefrequency magnitude masking for speech separation, speech, and language processing, vol.27, pp.1256-1266, 2019.

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015 IEEE International Conference on Computer Vision (ICCV), pp.1026-1034, 2015.

J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer normalization, 2016.

L. Liu, H. Jiang, P. He, W. Chen, X. Liu et al., On the variance of the adaptive learning rate and beyond, 2019.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, ICASSP, pp.5206-5210, 2015.

D. Diaz-guerra, A. Miguel, and J. R. Beltran, gpuRIR: A Python library for room impulse response simulation with GPU acceleration, 2018.

M. Mcauliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, Montreal Forced Aligner: Trainable text-speech alignment using Kaldi, pp.498-502, 2017.

D. Povey and A. , The Kaldi speech recognition toolkit, ASRU, 2011.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft coco: Common objects in context, European Conference on Computer Vision (ECCV), pp.740-755, 2014.

J. Dem?ar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, vol.7, 2006.