O. Abdel-hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn et al., Convolutional Neural Networks for Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.10, pp.1533-1545, 2014.
DOI : 10.1109/TASLP.2014.2339736

J. Ba and R. Caruana, Do deep nets really need to be deep?, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pp.2654-2662, 2014.

L. Bahl, P. Brown, P. De-souza, and R. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.49-52, 1986.
DOI : 10.1109/ICASSP.1986.1169179

L. Peter, O. Bartlett, S. Bousquet, and . Mendelson, Localized Rademacher complexities, Proceedings of the 15th Annual Conference on Computational Learning Theory, COLT '02, pp.44-58, 2002.

M. Bianchini and F. Scarselli, On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures, IEEE Transactions on Neural Networks and Learning Systems, vol.25, issue.8, pp.1553-1565, 2014.
DOI : 10.1109/TNNLS.2013.2293637

C. Cheng and B. Kingsbury, Arccosine kernels: Acoustic modeling with infinite neural networks, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5200-5203, 2011.
DOI : 10.1109/ICASSP.2011.5947529

L. Kenneth and . Clarkson, Coresets, Sparse Greedy Approximation, and the Frank-Wolfe Algorithm, ACM Trans. Algorithms, vol.663, issue.4, pp.1-6330, 2010.

G. Cybenko, Approximation by superpositions of a sigmoidal function, MCSS, vol.2, issue.4, pp.303-314, 1989.

G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-trained deep neural networks for largevocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, vol.20, issue.1, pp.30-42, 2012.
DOI : 10.1109/tasl.2011.2134090

B. Dai, B. Xie, N. He, Y. Liang, A. Raj et al., Scalable kernel methods via doubly stochastic gradients, Zoubin Ghahramani

D. Decoste and B. Schölkopf, Training Invariant Support Vector Machines, Machine Learning, vol.46, issue.1/3, pp.161-190, 2002.
DOI : 10.1023/A:1012454411458

L. Deng, G. Tür, X. He, and D. Z. Hakkani-tür, Use of kernel deep convex networks and end-to-end learning for spoken language understanding, 2012 IEEE Spoken Language Technology Workshop (SLT), pp.210-215, 2012.
DOI : 10.1109/SLT.2012.6424224

C. John, Y. Duchi, and . Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research, vol.10, pp.2899-2934, 2009.

J. Fiscus, G. Doddington, A. Le, G. Sanders, M. Przybocki et al., NIST Rich Transcription evaluation data. https://catalog.ldc.upenn, Linguistic Data Consortium Catalog No. LDC2007S10, 2003.

M. Gales and S. Young, The Application of Hidden Markov Models in Speech Recognition, Foundations and Trends?? in Signal Processing, vol.1, issue.3, pp.195-304, 2007.
DOI : 10.1561/2000000004

M. J. Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech & Language, vol.12, issue.2, pp.75-98, 1998.
DOI : 10.1006/csla.1998.0043

URL : http://svr-www.eng.cam.ac.uk/~mjfg/lintran_CSL.ps.gz

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett et al., DARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993.

M. Gibson and T. Hain, Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition, INTERSPEECH 2006 -ICSLP, Ninth International Conference on Spoken Language Processing, 2006.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, pp.249-256, 2010.

R. Hamid, Y. Xiao, A. Gittens, and D. Decoste, Compact random feature maps, Proceedings of the 31th International Conference on Machine Learning, ICML 2014, pp.21-26, 2014.

G. Hinton, L. Deng, D. Yu, E. George, A. Dahl et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, IEEE Signal Processing Magazine, vol.29, issue.6, pp.2982-97, 2012.
DOI : 10.1109/MSP.2012.2205597

G. E. Hinton, S. Osindero, and Y. W. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, vol.18, issue.7, pp.1527-1554, 2006.
DOI : 10.1162/jmlr.2003.4.7-8.1235

URL : http://www.cs.berkeley.edu/~ywteh/research/ebm/nc2006.pdf

. Po-sen, H. Huang, T. N. Avron, V. Sainath, B. Sindhwani et al., Kernel methods match deep neural networks on TIMIT, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.205-209, 2014.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pp.6-11, 2015.

G. J. Jameson, A simple proof of Stirling's formula for the gamma function, The Mathematical Gazette, vol.115, issue.544, pp.68-74, 2015.
DOI : 10.2307/2323256

J. Kaiser, B. Horvat, and Z. Kacic, A novel loss function for the overall risk criterion based discriminative training of HMM models, Sixth International Conference on Spoken Language Processing, pp.887-890, 2000.

P. Kar and H. Karnick, Random feature maps for dot product kernels, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, pp.583-591, 2012.

B. Kingsbury, Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3761-3764, 2009.
DOI : 10.1109/ICASSP.2009.4960445

V. Quoc, T. Le, A. J. Sarlós, and . Smola, Fastfood ? approximating kernel expansions in loglinear time, Proceedings of the 30th International Conference on Machine Learning, ICML 2013, pp.16-21, 2013.

N. Le-roux, M. W. Schmidt, and F. R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting, pp.2672-2680, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00674995

Z. Lu, D. Quo, A. Bagheri-garakani, K. Liu, A. May et al., A comparison between deep neural nets and kernel acoustic models for speech recognition, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5070-5074, 2016.
DOI : 10.1109/ICASSP.2016.7472643

URL : https://hal.archives-ouvertes.fr/hal-01329772

A. May, M. Collins, D. J. Hsu, and B. Kingsbury, Compact kernel models for acoustic modeling via random feature selection, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.2424-2428, 2016.
DOI : 10.1109/ICASSP.2016.7472112

G. Abdel-rahman-mohamed, G. Dahl, and . Hinton, Acoustic Modeling Using Deep Belief Networks, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.1, pp.14-22, 2012.
DOI : 10.1109/TASL.2011.2109382

F. Guido, R. Montúfar, K. Pascanu, Y. Cho, and . Bengio, On the number of linear regions of deep neural networks, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pp.2924-2932, 2014.

N. Morgan and H. Bourlard, Generalization and parameter estimation in feedforward nets: Some experiments, Advances in Neural Information Processing Systems 2, 1990.

J. Pennington, F. X. Yu, and S. Kumar, Spherical random features for polynomial kernels, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp.1846-1854, 2015.

C. John and . Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods -Support Vector Learning, 1998.

D. Povey and P. C. Woodland, Minimum phone error and i-smoothing for improved discriminative training, Acoustics, Speech, and Signal Processing (ICASSP) IEEE International Conference on, pp.105-108, 2002.
DOI : 10.1109/icassp.2002.1005687

URL : https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester1_2007_8/povey_mpe.pdf

D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon et al., Boosted MMI for model and feature-space discriminative training, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4057-4060, 2008.
DOI : 10.1109/ICASSP.2008.4518545

D. Povey and B. Kingsbury, Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07, pp.321-324, 2007.
DOI : 10.1109/ICASSP.2007.366914

D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar et al., Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI, Interspeech 2016, 2016.
DOI : 10.21437/Interspeech.2016-595

A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in Neural Information Processing Systems Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, pp.1177-1184, 2007.

A. Rahimi and B. Recht, Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, Advances in Neural Information Processing Systems Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, pp.1313-1320, 2008.

N. Tara, B. Sainath, V. Kingsbury, E. Sindhwani, B. Ar?soy et al., Low-rank Matrix Factorization for Deep Neural Network Training with High-dimensional Output Targets, Acoustics , Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.6655-6659, 2013.

T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak et al., Making Deep Belief Networks effective for large vocabulary continuous speech recognition, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp.30-35, 2011.
DOI : 10.1109/ASRU.2011.6163900

URL : http://www.cs.toronto.edu/%7Easamir/papers/asru11.pdf

T. N. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran, Optimization techniques to improve training speed of deep neural networks for large speech tasks. Audio, Speech, and Language Processing, IEEE Transactions on, vol.21, issue.11, pp.2267-2276, 2013.

H. Sak, A. W. Senior, and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, pp.338-342, 2014.

B. Schölkopf and A. Smola, Learning with kernels, 2002.

F. Seide, G. Li, X. Chen, and D. Yu, Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp.24-29, 2011.
DOI : 10.1109/ASRU.2011.6163899

F. Seide, G. Li, and D. Yu, Conversational speech transcription using context-dependent deep neural networks, 12th Annual Conference of the International Speech Communication Association, pp.437-440, 2011.

H. Soltau, G. Saon, and B. Kingsbury, The IBM Attila speech recognition toolkit, 2010 IEEE Spoken Language Technology Workshop, pp.97-102, 2010.
DOI : 10.1109/SLT.2010.5700829

S. Sonnenburg and V. Franc, COFFIN: A computational framework for linear svms, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.999-1006, 2010.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, vol.15, issue.1, pp.1929-1958, 2014.

I. Steinwart, Sparseness of support vector machines?some asymptotically sharp bounds, Advances in Neural Information Processing Systems 16, 2004.

I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, On the importance of initialization and momentum in deep learning, Proceedings of the 30th International Conference on Machine Learning, ICML 2013, pp.16-21, 2013.

I. W. Tsang, J. T. Kwok, and P. Cheung, Core Vector Machines: Fast SVM Training on Very Large Data Sets, Journal of Machine Learning Research, vol.6, pp.363-392, 2005.

V. Valtchev, J. Odell, P. Woodland, and S. Young, MMIE training of large vocabulary recognition systems, Speech Communication, vol.22, issue.4, pp.303-314, 1997.
DOI : 10.1016/S0167-6393(97)00029-0

A. Vedaldi and A. Zisserman, Efficient Additive Kernels via Explicit Feature Maps, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.3, pp.480-492, 2012.
DOI : 10.1109/TPAMI.2011.153

URL : http://eprints.pascal-network.org/archive/00006964/01/vedaldi10.pdf

K. Vesel´yvesel´y, A. Ghoshal, L. Burget, and D. Povey, Sequence-discriminative training of deep neural networks, 14th Annual Conference of the International Speech Communication Association, pp.2345-2349, 2013.

C. K. Williams and M. Seeger, Using the Nyström method to speed up kernel machines, Advances in Neural Information Processing Systems 13, pp.682-688, 2001.

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer et al., Achieving human parity in conversational speech recognition
DOI : 10.1109/taslp.2017.2756440

E. Yen, T. Lin, S. Lin, P. K. Ravikumar, and I. S. Dhillon, Sparse random feature algorithm as coordinate descent in Hilbert space, Advances in Neural Information Processing Systems 27, 2014.

F. X. Yu, S. Kumar, H. A. Rowley, and S. Chang, Compact nonlinear maps and circulant extensions