R. Al-rfou, B. Perozzi, and S. Skiena, Polyglot: Distributed word representations for multilingual NLP, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp.183-192, 2013.

A. Baevski, S. Edunov, Y. Liu, L. Zettlemoyer, and M. Auli, Clozedriven pretraining of self-attention networks, 2019.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics, vol.5, pp.135-146, 2017.

O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow et al., Findings of the 2018 conference on machine translation (WMT18), Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp.272-303, 2018.

J. Callan, M. Hoy, C. Yoo, and L. Zhao, Clueweb09 data set, 2009.

Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le et al., Transformer-xl: Attentive language models beyond a fixed, 2019.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, 2018.

E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, Learning word vectors for 157 languages, Proceedings of the 11th Language Resources and Evaluation Conference, 2018.

A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou et al., Fasttext.zip: Compressing text classification models, 2016.

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, Bag of tricks for efficient text classification, Proceedings of the 15th Conference of the European Chapter, vol.2, pp.427-431, 2017.

T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin, Advances in pre-training distributed word representations, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC, 2018.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, Proceedings of the 26th International Conference on Neural Information Processing Systems, vol.2, pp.3111-3119, 2013.

R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda, English gigaword fifth edition, linguistic data consortium, 2011.

J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1532-1543, 2014.

M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark et al., Deep contextualized word representations, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.2227-2237, 2018.

A. Radford and K. Narasimhan, Improving language understanding by generative pre-training, 2018.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei et al., Language models are unsupervised multitask learners, OpenAI Blog, vol.1, p.8, 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp.6000-6010, 2017.

Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov et al., Xlnet: Generalized autoregressive pretraining for language understanding, 2019.

Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun et al., Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015 IEEE International Conference on Computer Vision, ICCV 2015, pp.19-27, 2015.