R. C. Moore and W. Lewis, Intelligent selection of language model training data, Proceedings of the ACL 2010 Conference Short Papers, pp.220-224, 2010.

L. Lamel, J. Gauvain, V. Le, I. Oparin, and S. Meng, Improved models for Mandarin speech-to-text transcription, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4660-4663, 2011.
DOI : 10.1109/ICASSP.2011.5947394

A. Rousseau, P. Deléglise, and Y. Estève, Enhancing the ted-lium corpus with selected data for language modeling and more ted talks, Proc. of LREC, pp.3935-3939, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433246

B. Dalvi, C. Xiong, and J. Callan, A language modeling approach to entity recognition and disambiguation for search queries, Proceedings of the first international workshop on Entity recognition & disambiguation, ERD '14, pp.45-54, 2014.
DOI : 10.1145/2633211.2634347

P. Koehn and B. Haddow, Towards effective use of training data in statistical machine translation, Proceedings of the Seventh Workshop on Statistical Machine Translation, pp.317-321, 2012.

P. Goyal, L. Behera, and T. M. Mcginnity, A novel neighborhood based document smoothing model for information retrieval, Information Retrieval, vol.43, issue.1, pp.391-425, 2013.
DOI : 10.1007/s10791-012-9202-3

R. D. Brown, Finding and identifying text in 900+ languages, Digital Investigation, pp.34-43, 2012.
DOI : 10.1016/j.diin.2012.05.004

M. Hamdani, P. Doetsch, M. Kozielski, A. E. Mousa, and H. Ney, The RWTH Large Vocabulary Arabic Handwriting Recognition System, 2014 11th IAPR International Workshop on Document Analysis Systems, pp.111-115, 2014.
DOI : 10.1109/DAS.2014.61

R. Rosenfield, Two decades of statistical language modeling: where do we go from here?, Proceedings of the IEEE, vol.88, issue.8, 2000.
DOI : 10.1109/5.880083

L. R. Rabiner and B. Juang, Statistical methods for the recognition and understanding of speech, Encyclopedia of language and linguistics, 2004.

S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J. Bonastre et al., The ester phase ii evaluation campaign for the rich transcription of french broadcast news, Interspeech, pp.1149-1152, 2005.

S. Galliano, G. Gravier, and L. Chaubard, The ester 2 evaluation campaign for the rich transcription of french radio broadcasts, Interspeech, pp.2583-2586, 2009.

]. G. Gravier, G. Adda, N. Paulson, M. Carré, A. Giraudel et al., The etape corpus for the evaluation of speech-based tv content processing in the french language, LREC-Eighth international conference on Language Resources and Evaluation, p.p. na, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00712591

Y. Esteve, T. Bazillon, J. Antoine, F. Béchet, and J. Farinas, The epac corpus: Manual and automatic annotations of conversational speech in french broadcast news, LREC, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01433895

T. Bazillon, Transcription et traitement manuel de la parole spontanée pour sa reconnaissance automatique, 2011.

D. Klakow, Selecting articles from the language model training corpus, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), pp.1695-1698, 2000.
DOI : 10.1109/ICASSP.2000.862077

X. Shen and B. Xu, The study of the effect of training set on statistical language modeling, INTERSPEECH, pp.721-724, 2001.

H. Wang, J. Gao, K. Lee, and M. L. Li, A unified approach to statistical language modeling for chinese, 2000.

J. Gao, J. Goodman, M. Li, and K. Lee, Toward a unified approach to statistical language modeling for Chinese, ACM Transactions on Asian Language Information Processing, vol.1, issue.1, pp.3-33, 2002.
DOI : 10.1145/595576.595578

K. Yasuda, R. Zhang, H. Yamamoto, and E. Sumita, Method of selecting training data to build a compact and efficient translation model, IJCNLP, pp.655-660, 2008.

G. Foster, C. Goutte, and R. Kuhn, Discriminative instance weighting for domain adaptation in statistical machine translation, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp.451-459, 2010.

A. Axelrod, X. He, and J. Gao, Domain adaptation via pseudo indomain data selection, Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.355-362, 2011.

H. Schwenk, A. Rousseau, and M. Attik, Large, pruned or continuous space language models on a gpu for statistical machine translation Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, Proceedings of the NAACL-HLT 2012 Workshop, pp.11-19, 2012.

A. Mendona, G. David, and D. Denise, French gigaword second edition, 2009.

D. Jouvet and D. Langlois, A Machine Learning Based Approach for Vocabulary Selection for Speech Transcription, Text, Speech, and Dialogue, pp.60-67, 2013.
DOI : 10.1007/978-3-642-40585-3_9

URL : https://hal.archives-ouvertes.fr/hal-00834302

A. Stolcke, Srilm-an extensible language modeling toolkit, INTERSPEECH, 2002.

A. Stolcke, J. Zheng, W. Wang, and V. Abrash, Srilm at sixteen: Update and outlook, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, p.5, 2011.

S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp.310-318, 1996.