J. Li, L. Deng, R. Haeb-umbach, and Y. Gong, Robust Automatic Speech Recognition ? A Bridge to Practical Applications, 2015.

E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01120685

S. Makino and E. , Audio Source Separation, 2018.
DOI : 10.1007/978-3-319-73031-8

URL : https://hal.archives-ouvertes.fr/inria-00544199

J. H. Hansen, P. Angkititrakul, J. Plucienkowski, S. Gallant, U. Yapanel et al., CU-Move " : Analysis & corpus development for interactive in-vehicle speech systems, Proc. Eurospeech, pp.2023-2026, 2001.

L. Lamel, F. Schiel, A. Fourcin, J. Mariani, and H. Tillman, The translingual English database (TED), Proc. 3rd Int. Conf. on Spoken Language Processing (ICSLP), 1994.

E. Zwyssig, F. Faubel, S. Renals, and M. Lincoln, Recognition of overlapping speech using digital MEMS microphone arrays, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2013-7068
DOI : 10.1109/ICASSP.2013.6639033

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, The third ???CHiME??? speech separation and recognition challenge: Analysis and outcomes, Computer Speech & Language, vol.46, pp.605-626, 2017.
DOI : 10.1016/j.csl.2016.10.005

URL : https://hal.archives-ouvertes.fr/hal-01382108

E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, An analysis of environment, microphone and data simulation mismatches in robust speech recognition, Computer Speech & Language, vol.46, pp.535-557, 2017.
DOI : 10.1016/j.csl.2016.11.005

URL : https://hal.archives-ouvertes.fr/hal-01399180

G. Gravier, G. Adda, N. Paulsson, M. Carré, A. Giraudel et al., The ETAPE corpus for the evaluation of speechbased TV content processing in the French language, Proc. 8th Int. Conf. on Language Resources and Evaluation (LREC), 2012, pp.114-118
URL : https://hal.archives-ouvertes.fr/hal-00712591

P. Bell, M. J. Gales, T. Hain, J. Kilgour, P. Lanchantin et al., The MGB challenge: Evaluating multi-genre broadcast media recognition, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.687-693, 2015.
DOI : 10.1109/ASRU.2015.7404863

URL : http://eprints.whiterose.ac.uk/101807/1/mgb-asru2015.pdf

J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, The PASCAL CHiME speech separation and recognition challenge, Computer Speech & Language, vol.27, issue.3, pp.621-633, 2013.
DOI : 10.1016/j.csl.2012.10.004

URL : https://hal.archives-ouvertes.fr/inria-00584051

E. Vincent, J. Barker, S. Watanabe, J. Le-roux, F. Nesta et al., The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp.162-167, 2013.
DOI : 10.1109/ASRU.2013.6707723

A. Brutti, L. Cristoforetti, W. Kellermann, L. Marquardt, and M. Omologo, WOZ acoustic data collection for interactive TV, Proc. 6th Int. Conf. on Language Resources and Evaluation (LREC), pp.2330-2334, 2008.
DOI : 10.1007/s10579-010-9116-x

URL : http://dicit.fbk.eu/Publications/Brutt_Cristoforetti_Kellermann_Marquardt_Omologo_LREC08.pdf

M. Vacher, B. Lecouteux, P. Chahuara, F. Portet, B. Meillon et al., The Sweet-Home speech and multimodal corpus for home automation interaction, Proc. 9th Int. Conf. on Language Resources and Evaluation (LREC), pp.4499-4509, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00953006

M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi et al., The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.275-282, 2015.
DOI : 10.1109/ASRU.2015.7404805

N. Bertin, E. Camberlein, E. Vincent, R. Lebarbenchon, S. Peillon et al., A French Corpus for Distant-Microphone Speech Processing in Real Homes, Interspeech 2016, pp.2781-2785, 2016.
DOI : 10.21437/Interspeech.2016-1384

URL : https://hal.archives-ouvertes.fr/hal-01343060

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer et al., Achieving human parity in conversational speech recognition, 2017.
DOI : 10.1109/taslp.2017.2756440

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas et al., English Conversational Telephone Speech Recognition by Humans and Machines, Interspeech 2017, 2017.
DOI : 10.21437/Interspeech.2017-405

URL : http://arxiv.org/pdf/1703.02136

J. J. Godfrey, E. C. Holliman, and J. Mcdaniel, SWITCH- BOARD: Telephone speech corpus for research and development, Proc. IEEE International Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pp.517-520, 1992.
DOI : 10.1109/icassp.1992.225858

M. Harper, The automatic speech recognition in reverberant environments (ASpIRE) challenge, Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.547-554, 2015.

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart et al., The ICSI Meeting Corpus, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., pp.364-367, 2003.
DOI : 10.1109/ICASSP.2003.1198793

D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu et al., The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms, Language Resources and Evaluation, vol.41, issue.3-4, pp.3-4, 2007.
DOI : 10.1007/s10579-007-9054-4

S. Renals, T. Hain, and H. Bourlard, Interpretation of Multiparty Meetings the AMI and Amida Projects, 2008 Hands-Free Speech Communication and Microphone Arrays, pp.115-118, 2008.
DOI : 10.1109/HSCMA.2008.4538700

A. Stupakov, E. Hanusa, D. Vijaywargi, D. Fox, and J. Bilmes, The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments, Computer Speech & Language, vol.26, issue.1, pp.52-66, 2011.
DOI : 10.1016/j.csl.2010.12.003

C. Fox, Y. Liu, E. Zwyssig, and T. Hain, The Sheffield wargames corpus, Proc. Interspeech, pp.1116-1120, 2013.

C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.24, issue.4, pp.320-327, 1976.
DOI : 10.1109/TASSP.1976.1162830

X. Anguera, C. Wooters, and J. Hernando, Acoustic Beamforming for Speaker Diarization of Meetings, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.7, pp.2011-2023, 2007.
DOI : 10.1109/TASL.2007.902460

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The Kaldi speech recognition toolkit, Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.

J. R. Novak, N. Minematsu, and K. Hirose, WFST-based grapheme-to-phoneme conversion: Open source tools for alignment , model-building and decoding, Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp.45-49, 2012.

J. Wu and S. Khudanpur, Building a topic-dependent maximum entropy model for very large corpora, IEEE International Conference on Acoustics Speech and Signal Processing, pp.777-780, 2002.
DOI : 10.1109/ICASSP.2002.5743833

T. Alumäe and M. Kurimo, Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension, Proc. Interspeech, 2010.

A. Stolcke, SRILM-an extensible language modeling toolkit, Interspeech, vol.2002, pp.901-904, 2002.

V. Peddinti, V. Manohar, Y. Wang, D. Povey, and S. Khudanpur, Far-Field ASR Without Parallel Data, Interspeech 2016, pp.1996-2000, 2016.
DOI : 10.21437/Interspeech.2016-1475

D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar et al., Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI, Interspeech 2016, pp.2751-2755, 2016.
DOI : 10.21437/Interspeech.2016-595

S. Tokui, K. Oono, S. Hido, and J. Clayton, Chainer: a nextgeneration open source framework for deep learning, Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), 2015.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang et al., Automatic differentiation in PyTorch, Proceedings of The future of gradientbased machine learning software and techniques (Autodiff) in the twenty-ninth annual conference on neural information processing systems (NIPS), 2017.

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, Hybrid CTC/Attention Architecture for End-to-End Speech Recognition, IEEE Journal of Selected Topics in Signal Processing, vol.11, issue.8, pp.1240-1253, 2017.
DOI : 10.1109/JSTSP.2017.2763455