F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with Sparsity-Inducing Penalties, Machine Learning, pp.1-106, 2012.
DOI : 10.1561/2200000015

URL : https://hal.archives-ouvertes.fr/hal-00613125

A. Beck and M. Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM Journal on Imaging Sciences, vol.2, issue.1, pp.183-202, 2009.
DOI : 10.1137/080716542

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, Neural Probabilistic Language Models, Journal of Machine Learning Research, vol.3, pp.1137-1155, 2003.
DOI : 10.1007/3-540-33486-6_6

URL : https://hal.archives-ouvertes.fr/hal-01434258

P. Bruckner, An O(n) algorithm for quadratic knapsack problems, Operations Research Letters, vol.3, issue.3, pp.163-166, 1984.
DOI : 10.1016/0167-6377(84)90010-5

L. Burget, P. Matejka, P. Schwarz, O. Glembek, and J. H. Cernocky, Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.7, pp.151979-1986, 2007.
DOI : 10.1109/TASL.2007.902499

Y. Chang and M. Collins, Exact decoding of phrase-based translation models through lagrangian relaxation, Proc. Conf. Empirical Methods for Natural Language Processing, pp.26-37, 2011.

S. F. Chen and R. Rosenfeld, A survey of smoothing techniques for ME models, IEEE Transactions on Speech and Audio Processing, vol.8, issue.1, pp.37-50, 2000.
DOI : 10.1109/89.817452

T. H. Cormen, C. E. Leiserson, and R. L. Rivest, An Introduction to Algorithms, 1990.

S. , D. Pietra, V. D. Pietra, and J. Lafferty, Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19, issue.4, pp.380-393, 1997.

J. Duchi, S. Shalev-shwartz, Y. Singer, and T. Chandra, Efficient projections onto the ? 1 -ball for learning in high dimensions, Proc. 25th Int. Conf. Machine Learning, 2008.

R. Giegerich and S. Kurtz, From ukkonen to Mc- Creight and weiner: A unifying view of linear-time suffix tree construction, Algorithmica, 1997.

J. Goodman, A bit of progress in language modelling, Computer Speech and Language, pp.403-434, 2001.

J. Goodman, Exponential priors for maximum entropy models, Proc. North American Chapter of the Association of Computational Linguistics, 2004.

C. Hu, J. T. Kwok, and W. Pan, Accelerated gradient methods for stochastic optimization and online learning, Advances in Neural Information Processing Systems, 2009.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, Proximal methods for hierarchical sparse coding, Journal of Machine Learning Research, vol.12, pp.2297-2334, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00516723

R. Kneser and H. Ney, Improved backing-off for M-gram language modeling, 1995 International Conference on Acoustics, Speech, and Signal Processing, 1995.
DOI : 10.1109/ICASSP.1995.479394

A. F. Martins, N. A. Smith, P. M. Aguiar, and M. A. Figueiredo, Structured sparsity in structured prediction, Proc. Conf. Empirical Methods for Natural Language Processing, pp.1500-1511, 2011.

P. Mccullagh and J. Nelder, Generalized linear models. Chapman and Hall, 1989.

A. Mnih and G. Hinton, Three new graphical models for statistical language modelling, Proceedings of the 24th international conference on Machine learning, ICML '07, 2007.
DOI : 10.1145/1273496.1273577

A. Mnih and G. Hinton, A scalable hierarchical distributed language model, Advances in Neural Information Processing Systems, 2008.

Y. Nesterov, Gradient methods for minimizing composite objective function. CORE Discussion Pa- per, 2007.

B. Roark, M. Saraclar, M. Collins, and M. Johnson, Discriminative language modeling with conditional random fields and the perceptron algorithm, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics , ACL '04, 2004.
DOI : 10.3115/1218955.1218962

A. Stolcke, Srilm-an extensible language modeling toolkit, Proc. Int. Conf. Spoken Language Processing, pp.901-904, 2002.

E. Ukkonen, Online construction of suffix trees, Algorithmica, 1995.

S. Vargas, P. Castells, and D. Vallet, Explicit relevance models in intent-oriented information retrieval diversification, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pp.75-84, 2012.
DOI : 10.1145/2348283.2348297

F. Wood, C. Archambeau, J. Gasthaus, J. Lancelot, and Y. Teh, A stochastic memoizer for sequence data, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553518

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.149.7670

F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh, The sequence memoizer, Communications of the ACM, vol.54, issue.2, pp.91-98, 2011.
DOI : 10.1145/1897816.1897842

J. Wu and S. Khundanpur, Efficient training methods for maximum entropy language modeling, Proc. 6th Inter. Conf. Spoken Language Technologies, pp.114-117, 2000.

P. Zhao, G. Rocha, and B. Yu, The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pp.3468-3497, 2009.