T. Virtanen, B. Raj, and R. Singh, Techniques for Noise Robustness in Automatic Speech Recognition, 2012.

J. Li, L. Deng, R. Haeb-umbach, and Y. Gong, Robust Automatic Speech Recognition -A Bridge to Practical Applications. Elsevier, 2015.

S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, New Era for Robust Speech Recognition -Exploiting Deep Learning, 2017.

E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01881431

S. Makino and E. , Audio Source Separation, 2018.
URL : https://hal.archives-ouvertes.fr/inria-00544199

R. Haeb-umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister et al., Speech processing for digital home assistants: Combining signal processing with deep-learning techniques, IEEE Signal Processing Magazine, vol.36, issue.6, pp.111-124, 2019.

A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri et al., SPEECHDAT-CAR. a large speech database for automotive environments, Proc. 2nd Int. Conf. on Language Resources and Evaluation (LREC), 2000.

J. H. Hansen, P. Angkititrakul, J. Plucienkowski, S. Gallant, U. Yapanel et al., CU-Move": Analysis & corpus development for interactive in-vehicle speech systems, Proc. Eurospeech, pp.2023-2026, 2001.

L. Lamel, F. Schiel, A. Fourcin, J. Mariani, and H. Tillman, The translingual English database (TED), Proc. 3rd Int. Conf. on Spoken Language Processing (ICSLP), 1994.

E. Zwyssig, F. Faubel, S. Renals, and M. Lincoln, Recognition of overlapping speech using digital MEMS microphone arrays, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.7068-7072, 2013.

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes, Computer Speech and Language, vol.46, pp.605-626, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01382108

E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, An analysis of environment, microphone and data simulation mismatches in robust speech recognition, Computer Speech and Language, vol.46, pp.535-557, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01399180

G. Gravier, G. Adda, N. Paulsson, M. Carré, A. Giraudel et al., The ETAPE corpus for the evaluation of speechbased TV content processing in the French language, Proc. 8th
URL : https://hal.archives-ouvertes.fr/hal-00712591

. Int and . Conf, on Language Resources and Evaluation (LREC), pp.114-118, 2012.

P. Bell, M. J. Gales, T. Hain, J. Kilgour, P. Lanchantin et al., The MGB challenge: Evaluating multi-genre broadcast media recognition, Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp.687-693, 2015.

J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, The PASCAL CHiME speech separation and recognition challenge, Computer Speech and Language, vol.27, issue.3, pp.621-633, 2013.
URL : https://hal.archives-ouvertes.fr/inria-00584051

E. Vincent, J. Barker, S. Watanabe, J. L. Roux, F. Nesta et al., The second CHiME speech separation and recognition challenge: Datasets, tasks and baselines, Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp.126-130, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00796625

A. Brutti, L. Cristoforetti, W. Kellermann, L. Marquardt, and M. Omologo, WOZ acoustic data collection for interactive TV, Proc. 6th Int. Conf. on Language Resources and Evaluation (LREC), pp.2330-2334, 2008.

M. Vacher, B. Lecouteux, P. Chahuara, F. Portet, B. Meillon et al., The Sweet-Home speech and multimodal corpus for home automation interaction, Proc. 9th Int. Conf. on Language Resources and Evaluation (LREC), pp.4499-4509, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00953006

M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi et al., The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments, Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp.275-282, 2015.

N. Bertin, E. Camberlein, E. Vincent, R. Lebarbenchon, S. Peillon et al., A French corpus for distant-microphone speech processing in real homes, Proc. Interspeech, pp.2781-2785, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01343060

N. Bertin, E. Camberlein, R. Lebarbenchon, E. Vincent, S. Sivasankaran et al., VoiceHome-2, an extended corpus for multichannel speech processing in real homes, Speech Communication, vol.106, pp.68-78, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01923108

W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer et al., Toward human parity in conversational speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue.12, pp.2410-2423, 2017.

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas et al., English conversational telephone speech recognition by humans and machines, Proc. Interspeech, pp.132-136, 2017.

J. J. Godfrey, E. C. Holliman, and J. Mcdaniel, SWITCHBOARD: Telephone speech corpus for research and development, Proc. IEEE International Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), vol.1, pp.517-520, 1992.

M. Harper, The automatic speech recognition in reverberant environments (ASpIRE) challenge, Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp.547-554, 2015.

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart et al., The ICSI meeting corpus, Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp.364-367, 2003.

D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu et al., The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms, Language Resources and Evaluation, vol.41, issue.3-4, pp.389-407, 2007.

S. Renals, T. Hain, and H. Bourlard, Interpretation of multiparty meetings: The AMI and AMIDA projects, Proc. 2nd Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), pp.115-118, 2008.

, Lincoln laboratory speech enhancement corpus, LLSEC, 1996.

A. Stupakov, E. Hanusa, D. Vijaywargi, D. Fox, and J. Bilmes, The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments, Computer Speech and Language, vol.26, issue.1, pp.52-66, 2011.

C. Fox, Y. Liu, E. Zwyssig, and T. Hain, The Sheffield wargames corpus, Proc. Interspeech, pp.1116-1120, 2013.

C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco et al., Voices obscured in complex environmental settings (VOiCES) corpus, Proc. Interspeech, pp.1566-1570, 2018.

M. Van-segbroeck, A. Zaid, K. Kutsenko, C. Huerta, T. Nguyen et al., DiPCo-dinner party corpus, 2019.

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, The fifth 'CHiME' speech separation and recognition challenge: Dataset, task and baselines, Proc. Interspeech, pp.1561-1565, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01744021

N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du et al., The second DIHARD diarization challenge: Dataset, task, and baselines, Proc. Interspeech, pp.978-982, 2019.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The Kaldi speech recognition toolkit, Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.

V. Manohar, S. J. Chen, Z. Wang, Y. Fujita, S. Watanabe et al., Acoustic modeling for overlapping speech recognition: JHU CHiME-5 challenge system, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.6665-6669, 2019.

D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu et al., Semi-orthogonal low-rank matrix factorization for deep neural networks, Proc. Interspeech, pp.3743-3747, 2018.

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann et al., Front-end processing for the CHiME-5 dinner party scenario, Proc. 5th Int. Workshop on Speech Processing in Everyday Environments, pp.35-40, 2018.

X. Anguera, C. Wooters, and J. Hernando, Acoustic beamforming for speaker diarization of meetings, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.7, pp.2011-2021, 2007.

L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-umbach, NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing, ITG Fachtagung Sprachkommunikation (ITG), 2018.

T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. Juang, Speech dereverberation based on variance-normalized delayed linear prediction, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.7, pp.1717-1731, 2010.

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, Voxceleb: Large-scale speaker verification in the wild, Computer Speech and Language, vol.60, p.101027, 2020.

G. Sell, D. Snyder, A. Mccree, D. Garcia-romero, J. Villalba et al., Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge, Proc. Interspeech, pp.2808-2812, 2018.

P. Kenny, Bayesian speaker verification with heavy tailed priors, Proc. Odyssey, 2010.

P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, Acoustic modelling from the signal domain using CNNs, Proc. Interspeech, pp.3434-3438, 2016.

D. Snyder, D. Garcia-romero, G. Sell, D. Povey, and S. Khudanpur, X-vectors: Robust DNN embeddings for speaker recognition, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.5329-5333, 2018.

X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland et al., Speaker diarization: A review of recent research, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.2, pp.356-370, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00733397

J. Du, T. Gao, L. Sun, F. Ma, Y. Fang et al.,

C. , The USTC-iFlytek systems for CHiME-5 challenge, Proc. 5th Int. Workshop on Speech Processing in Everyday Environments, pp.11-15, 2018.