A. Burns, R. Tan, K. Saenko, S. Sclaroff, and B. A. Plummer, Language Features Matter: Effective language representations for vision-language tasks, 2019.

J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, 2017.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.

V. Gabeur, C. Sun, K. Alahari, and C. Schmid, CVPR 2020 video pentathlon challenge: Multi-modal transformer for video retrieval, CVPR Video Pentathlon Workshop, 2020.

D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba et al., Jointly discovering visual objects and spoken words from raw sensory input, 2018.

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen et al., CNN architectures for large-scale audio classification, 2017.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol.9, issue.8, 1997.

J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, Squeeze-and-excitation networks, IEEE Trans. Pattern Analysis and Machine Intelligence, 2019.

G. Huang, Z. Liu, and K. Q. Weinberger, Densely connected convolutional networks, 2016.

A. Karpathy, A. Joulin, and L. Fei-fei, Deep fragment embeddings for bidirectional image sentence mapping, 2014.

B. Klein, G. Lev, G. Sadeh, and L. Wolf, Associating neural word embeddings with deep image representations using fisher vectors, 2015.

R. Krishna, K. Hata, F. Ren, L. Fei-fei, and J. C. Niebles, Dense-captioning events in videos, 2017.

K. H. Lee, X. Chen, G. Hua, H. Hu, and X. He, Stacked cross attention for image-text matching, 2018.

Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, Use what you have: Video retrieval using representations from collaborative experts, 2019.

A. Miech, J. B. Alayrac, L. Smaira, I. Laptev, J. Sivic et al., Endto-End Learning of Visual Representations from Uncurated Instructional Videos, 2019.

A. Miech, I. Laptev, and J. Sivic, Learning a text-video embedding from incomplete and heterogeneous data, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01975102

A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev et al., HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02433497

T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, Efficient estimation of word representations in vector space, 2013.

N. C. Mithun, J. Li, F. Metze, and A. K. Roy-chowdhury, Learning joint embedding with multimodal cues for cross-modal video-text retrieval, 2018.

N. C. Mithun, J. Li, F. Metze, and A. K. Roy-chowdhury, Joint embeddings with multimodal cues for video-text retrieval, IJMIR, 2019.

A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, A dataset for movie description, 2015.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, 2014.

C. Sun, F. Baradel, K. Murphy, and C. Schmid, Learning video representations using contrastive bidirectional transformer. arXiv 1906, p.5743, 2019.

C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, Videobert: A joint model for video and language representation learning, 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, 2017.

M. Wray, D. Larlus, G. Csurka, and D. Damen, Fine-grained action retrieval through multiple parts-of-speech embeddings, 2019.

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi et al., Google's neural machine translation system: Bridging the gap between human and machine translation, 2016.

S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, 2017.

E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, Distance metric learning, with application to clustering with side-information, 2002.

J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A large video description dataset for bridging video and language, 2016.

Y. Yu, J. Kim, and G. Kim, A joint sequence fusion model for video question answering and retrieval, 2018.

Y. Yu, H. Ko, J. Choi, and G. Kim, End-to-end concept word detection for video captioning, retrieval, and question answering, 2017.

B. Zhang, H. Hu, and F. Sha, Cross-modal and hierarchical modeling of video and text, 2018.

Y. Zhang, R. Jin, and Z. H. Zhou, Understanding bag-of-words model: a statistical framework, International Journal of Machine Learning and Cybernetics, vol.1, pp.43-52, 2010.

B. Zhou, À. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, Places: A 10 million image database for scene recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.40, pp.1452-1464, 2018.