N. Agarwal, B. Zeyuan-allen-zhu, E. Bullins, T. Hazan, and . Ma, Finding approximate local minima faster than gradient descent, STOC, 2017.
DOI : 10.1145/3055399.3055464
URL : http://arxiv.org/pdf/1611.01146

D. Amodei, R. Sundaram-ananthanarayanan, J. Anubhai, E. Bai, C. Battenberg et al., Deep Speech 2 : End-to-end speech recognition in English and Mandarin, ICML, 2016.

A. Anandkumar and R. Ge, Efficient approaches for escaping higher order saddle points in non-convex optimization, COLT, 2016.

D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta et al., Globally normalized transition-based neural networks, ACL, 2016.
DOI : 10.18653/v1/p16-1231
URL : https://doi.org/10.18653/v1/p16-1231

D. Arpit, S. K. Jastrzebski, N. Ballas, D. Krueger, E. Bengio et al., A closer look at memorization in deep networks, ICML, 2017.

J. Ba and R. Caruana, Do deep nets really need to be deep, NIPS, 2014.

R. Lalit, . Bahl, F. Peter, P. Brown, R. Souza et al., Maximum mutual information estimation of hidden Markov model parameters for speech recognition, ICASSP, 1986.

L. Peter and . Bartlett, For valid generalization the size of the weights is more important than the size of the network, NIPS, 1996.

Y. Bengio and Y. Lecun, Scaling learning algorithms towards AI. Large-Scale Kernel Machines, vol.34, pp.1-41, 2007.

M. Bianchini and F. Scarselli, On the complexity of neural network classifiers: A comparison between shallow and deep architectures, IEEE Trans. Neural Netw. Learning Syst, vol.25, issue.8, pp.1553-1565, 2014.

L. Bottou, O. Chapelle, D. Decoste, and J. Weston, Large-Scale Kernel Machines, 2007.

W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, ICASSP, pp.4960-4964, 2016.

J. Chen, L. Wu, K. Audhkhasi, B. Kingsbury, and B. Ramabhadran, Efficient one-vs-one kernel ridge regression for speech recognition, ICASSP, 2016.

C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen et al., State-of-the-art speech recognition with sequence-to-sequence models, ICASSP, 2018.

A. Choromanska, M. Henaff, and M. Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks, AISTATS, 2015.

K. L. Clarkson, Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm, ACM Trans. Algorithms, vol.6, issue.4, 2010.

G. Cybenko, Approximation by superpositions of a sigmoidal function, MCSS, vol.2, issue.4, pp.303-314, 1989.

G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Trans. Audio, Speech & Language Processing, vol.20, issue.1, pp.30-42, 2012.
DOI : 10.1109/tasl.2011.2134090
URL : http://www.cs.toronto.edu/%7Egdahl/papers/DRAFT_DBN4LVCSR-TransASLP.pdf

B. Dai, B. Xie, N. He, Y. Liang, A. Raj et al., Scalable kernel methods via doubly stochastic gradients, NIPS, 2014.

N. Yann, R. Dauphin, . Pascanu, K. Aglar-gülçehre, S. Cho et al., Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS, 2014.

D. Decoste and B. Schölkopf, Training invariant support vector machines, Machine Learning, vol.46, pp.161-190, 2002.

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Frontend factor analysis for speaker verification, IEEE Trans. Audio, Speech & Language Processing, vol.19, issue.4, pp.788-798, 2011.

C. John, Y. Duchi, and . Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research, vol.10, pp.2899-2934, 2009.

J. Fiscus, G. Doddington, A. Le, G. Sanders, M. Przybocki et al., NIST Rich Transcription evaluation data. Linguistic Data Consortium, 2003.

J. F. Mark and . Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech & Language, vol.12, issue.2, pp.75-98, 1998.

J. F. Mark, S. J. Gales, and . Young, The application of hidden Markov models in speech recognition, Foundations and Trends in Signal Processing, vol.1, issue.3, pp.195-304, 2007.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett et al., TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993.

M. Gibson and T. Hain, Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition, INTERSPEECH, 2006.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, AISTATS, 2010.

I. J. Goodfellow, Y. Bengio, and A. C. Courville, Deep Learning. Adaptive computation and machine learning, 2016.

A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, 2006.

R. Hamid, Y. Xiao, A. Gittens, and D. Decoste, Compact random feature maps, ICML, 2014.

S. Han, J. Pool, J. Tran, and W. J. Dally, Learning both weights and connections for efficient neural network, NIPS, 2015.

W. K. Härdle, M. Müller, S. Sperlich, and A. Werwatz, Nonparametric and semiparametric models, 2004.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, 2016.
DOI : 10.1109/cvpr.2016.90
URL : http://arxiv.org/pdf/1512.03385

G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, p.29, 2012.

K. Hornik, M. B. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, vol.2, issue.5, pp.359-366, 1989.
DOI : 10.1016/0893-6080(89)90020-8

P. Huang, H. Avron, and T. N. Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel methods match deep neural networks on TIMIT, 2014.

G. J. Jameson, A simple proof of Stirling's formula for the gamma function. The Mathematical Gazette, vol.99, pp.68-74, 2015.

J. Kaiser, B. Horvat, and Z. Kacic, A novel loss function for the overall risk criterion based discriminative training of HMM models, INTERSPEECH, 2000.

P. Kar and H. Karnick, Random feature maps for dot product kernels, AISTATS, 2012.

B. Kingsbury, Lattice-based optimization of sequence classification criteria for neuralnetwork acoustic modeling, ICASSP, 2009.

B. Kingsbury, J. Cui, X. Cui, J. F. Mark, K. Gales et al., A high-performance Cantonese keyword search system, ICASSP, 2013.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, 2012.
DOI : 10.1145/3065386
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

V. Quoc, T. Le, A. J. Sarlós, and . Smola, Fastfood -computing Hilbert space expansions in loglinear time, ICML, 2013.

Z. Lu, D. Guo, A. Bagheri-garakani, K. Liu, A. May et al., A comparison between deep neural nets and kernel acoustic models for speech recognition, ICASSP, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01329772

A. May, M. Collins, D. J. Hsu, and B. Kingsbury, Compact kernel models for acoustic modeling via random feature selection, ICASSP, 2016.

C. A. Micchelli, Y. Xu, and H. Zhang, Universal kernels, Journal of Machine Learning Research, vol.7, pp.2651-2667, 2006.

T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, Recurrent neural network based language model, INTERSPEECH, 2010.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, ICLR Workshop, 2013.

A. Mohamed, G. E. Dahl, and G. E. Hinton, Acoustic modeling using deep belief networks, IEEE Trans. Audio, Speech & Language Processing, vol.20, issue.1, pp.14-22, 2012.

G. F. Montúfar, R. Pascanu, K. Cho, and Y. Bengio, On the number of linear regions of deep neural networks, NIPS, 2014.

N. Morgan and H. Bourlard, Generalization and parameter estimation in feedforward nets: Some experiments, NIPS, 1990.

R. Behnam-neyshabur, N. Tomioka, and . Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, ICLR (Workshop), 2015.

J. Pennington and Y. Bahri, Geometry of neural network loss surfaces via random matrix theory, ICML, 2017.

J. Pennington, F. X. Yu, and S. Kumar, Spherical random features for polynomial kernels, NIPS, 2015.

J. C. Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods -Support Vector Learning, 1998.

D. Povey and B. Kingsbury, Evaluation of proposed modifications to MPE for large scale discriminative training, ICASSP, 2007.

D. Povey and P. C. Woodland, Minimum phone error and I-smoothing for improved discriminative training, ICASSP, 2002.

D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon et al., Boosted MMI for model and feature-space discriminative training, ICASSP, 2008.

D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar et al., Purely sequence-trained neural networks for ASR based on lattice-free MMI, INTERSPEECH, 2016.

A. Rahimi and B. Recht, Random features for large-scale kernel machines, NIPS, 2007.

T. N. Sainath, B. Kingsbury, and B. Ramabhadran, Petr Fousek, Petr Novák, and Abdel-rahman Mohamed. Making deep belief networks effective for large vocabulary continuous speech recognition, ASRU, 2011.

T. N. Sainath, B. Kingsbury, and V. Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with highdimensional output targets, ICASSP, 2013.

T. N. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran, Optimization techniques to improve training speed of deep neural networks for large speech tasks, IEEE Trans. Audio, Speech & Language Processing, vol.21, issue.11, pp.2267-2276, 2013.

T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, Deep convolutional neural networks for LVCSR, ICASSP, 2013.
DOI : 10.1109/icassp.2013.6639347

H. Sak, A. W. Senior, and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, INTERSPEECH, 2014.

G. Saon, T. Sercu, S. J. Rennie, and H. Kuo, The IBM 2016 English conversational telephone speech recognition system, INTERSPEECH, 2016.

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas et al., English conversational telephone speech recognition by humans and machines, INTERSPEECH, 2017.
DOI : 10.21437/interspeech.2017-405
URL : http://arxiv.org/pdf/1703.02136

B. Schölkopf and A. Smola, Learning with kernels, 2002.

F. Seide, G. Li, X. Chen, and D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, ASRU, 2011.

F. Seide, G. Li, and D. Yu, Conversational speech transcription using contextdependent deep neural networks, In INTERSPEECH, 2011.

T. Sercu and V. Goel, Advances in very deep convolutional neural networks for LVCSR, INTERSPEECH, 2016.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, 2015.

H. Soltau, G. Saon, and B. Kingsbury, The IBM Attila speech recognition toolkit, SLT, 2010.

H. Soltau, G. Saon, and T. N. Sainath, Joint training of convolutional and non-convolutional neural networks, In ICASSP, 2014.

S. Sonnenburg and V. Franc, COFFIN: a computational framework for linear SVMs, ICML, 2010.

I. Steinwart, Sparseness of support vector machines-some asymptotically sharp bounds, NIPS, 2003.

N. Ström, Sparse connection and pruning in large dynamic artificial neural networks, EUROSPEECH, 1997.

M. Sundermeyer, R. Schlüter, and H. Ney, LSTM neural networks for language modeling, INTERSPEECH, 2012.
DOI : 10.1109/taslp.2015.2400218

I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, NIPS, 2014.

I. W. Tsang, J. T. Kwok, and P. Cheung, Core vector machines: Fast SVM training on very large data sets, Journal of Machine Learning Research, vol.6, pp.363-392, 2005.

V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, MMIE training of large vocabulary recognition systems, Speech Communication, vol.22, issue.4, pp.303-314, 1997.

. Ewout-van-den, B. Berg, M. Ramabhadran, and . Picheny, Training variance and performance evaluation of neural networks in speech, 2017.

A. Vedaldi and A. Zisserman, Efficient additive kernels via explicit feature maps, IEEE Trans. Pattern Anal. Mach. Intell, vol.34, issue.3, pp.480-492, 2012.
DOI : 10.1109/tpami.2011.153

K. Veselý, A. Ghoshal, L. Burget, and D. Povey, Sequence-discriminative training of deep neural networks, In INTERSPEECH, 2013.

K. I. Christopher, M. W. Williams, and . Seeger, Using the Nyström method to speed up kernel machines, NIPS, 2000.

B. Xie, Y. Liang, and L. Song, Diverse neural network learns true target functions, AISTATS, 2017.

W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer et al., Toward human parity in conversational speech recognition, IEEE/ACM Trans. Audio, Speech & Language Processing, vol.25, issue.12, pp.2410-2423, 2017.

J. Xue, J. Li, and Y. Gong, Restructuring of deep neural network acoustic models with singular value decomposition, INTERSPEECH, 2013.

Z. Yang, M. Moczulski, M. Denil, A. J. Nando-de-freitas, L. Smola et al., Deep fried convnets. In ICCV, 2015.
DOI : 10.1109/iccv.2015.173

I. En-hsu-yen, T. Lin, S. Lin, P. Ravikumar, and I. S. Dhillon, Sparse random feature algorithm as coordinate descent in Hilbert space, NIPS, 2014.

F. X. Yu, S. Kumar, H. A. Rowley, and S. Chang, Compact nonlinear maps and circulant extensions, 2015.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, 2017.