N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du et al., The second dihard diarization challenge: Dataset, task, and baselines, Proc. Interspeech, 2019.

G. Sell, D. Snyder, A. Mccree, D. Garcia-romero, J. Villalba et al., Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge, Proc. Interspeech, 2018.

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, The fifth 'CHiME speech separation and recognition challenge: Dataset, task and baselines, Proc. Interspeech, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01744021

N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du et al., Second dihard challenge evaluation plan, Linguistic Data Consortium, Tech. Rep, 2019.

M. Diez, BUT system for DIHARD speech diarization challenge 2018, Proc. Interspeech, pp.2798-2802, 2018.

F. G. Germain, Q. Chen, and V. Koltun, Speech denoising with deep feature losses, 2018.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770-778, 2016.

J. Hu, L. Shen, and G. Sun, Squeeze-and-excitation networks, Proc. IEEE CVPR, pp.7132-7141, 2018.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

T. Salimans and D. P. Kingma, Weight normalization: A simple reparameterization to accelerate training of deep neural networks, Advances in Neural Information Processing Systems, pp.901-909, 2016.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, Proc. ICASSP, pp.5206-5210, 2015.

B. Külebi and A. Ktem, Building an open source automatic speech recognition system for catalan, Proc. IberSPEECH, pp.2018-2024, 2018.

. Surfintech, ST-CMDS-20170001 1 -Free ST Chinese Mandarin Corpus

. Freesound and . Freesound,

. Youtube and . Youtube,

D. Snyder, G. Chen, and D. Povey, MUSAN: A Music, Speech, and Noise Corpus, 2015.

R. Scheibler, E. Bezzam, and I. Dokmani?, Pyroomacoustics: A python package for audio room simulation and array processing algorithms, Proc. ICASSP. IEEE, pp.351-355, 2018.

G. Gelly and J. Gauvain, Minimum word error training of RNNbased voice activity detection, Proc. Interpseech, 2015.

, pyannote-audio: neural building blocks for speaker diarization

S. Chakroborty, A. Roy, and G. Saha, Improved closed set textindependent speaker identification by combining MFCC with evidence from flipped filter banks, International Journal of Signal Processing, vol.4, issue.2, pp.114-122, 2007.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, Phoneme recognition using time-delay neural networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.37, issue.3, pp.328-339, 1989.

D. Snyder, D. Garcia-romero, G. Sell, D. Povey, and S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, Proc. ICASSP, pp.5329-5333, 2018.

F. Chollet, Keras, 2015.

M. Abadi, TensorFlow: large-scale machine learning on heterogeneous systems, 2015.

V. Nair and G. Hinton, Rectified linear units improve restricted Boltzmann machines, Proc. ICML, pp.807-814, 2010.

S. Ioffe and C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, Proc. ICML, pp.448-456, 2015.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proc. of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp.249-256, 2010.

D. P. Kingma and J. Ba, Adam: a method for stochastic optimization, Proc. ICLR, pp.1-15, 2015.

X. Anguera, C. Wooters, and J. Hernando, Acoustic beamforming for speaker diarization of meetings, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.7, pp.2011-2021, 2007.

L. Sun, J. Du, C. Jiang, X. Zhang, S. He et al., Speaker diarization with enhancing speech for the first dihard challenge, Proc. Interspeech, pp.2793-2797, 2018.

Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li et al., Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network, 2018 IEEE Spoken Language Technology Workshop (SLT), pp.558-565, 2018.

. Annonymous, Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition, IEEE Automatic Speech Recognition and Understanding Workshop (Submitted), 2019.

C. Knapp and G. Carter, The Generalized Correlation Method for Estimation of Time Delay, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.24, issue.4, pp.320-327, 1976.

A. Spriet, M. Moonen, and J. Wouters, Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction, Signal Processing, vol.84, issue.12, pp.2367-2387, 2004.

J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, Deep clustering: Discriminative embeddings for segmentation and separation, Proc. ICASSP. IEEE, pp.31-35, 2016.

J. Patino, H. Delgado, and N. Evans, The EURECOM submission to the first DIHARD Challenge, Proc. INTERSPEECH, 2018.

R. Yin, H. Bredin, and C. Barras, Neural speech turn segmentation and affinity propagation for speaker diarization, Proc. Interspeech, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01912236