Language Features Matter: Effective language representations for vision-language tasks, 2019. ,
Quo vadis, action recognition? A new model and the kinetics dataset, 2017. ,
Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. ,
CVPR 2020 video pentathlon challenge: Multi-modal transformer for video retrieval, CVPR Video Pentathlon Workshop, 2020. ,
Jointly discovering visual objects and spoken words from raw sensory input, 2018. ,
CNN architectures for large-scale audio classification, 2017. ,
Long short-term memory, Neural Computation, vol.9, issue.8, 1997. ,
Squeeze-and-excitation networks, IEEE Trans. Pattern Analysis and Machine Intelligence, 2019. ,
Densely connected convolutional networks, 2016. ,
Deep fragment embeddings for bidirectional image sentence mapping, 2014. ,
Associating neural word embeddings with deep image representations using fisher vectors, 2015. ,
Dense-captioning events in videos, 2017. ,
Stacked cross attention for image-text matching, 2018. ,
Use what you have: Video retrieval using representations from collaborative experts, 2019. ,
, Endto-End Learning of Visual Representations from Uncurated Instructional Videos, 2019.
Learning a text-video embedding from incomplete and heterogeneous data, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01975102
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02433497
Efficient estimation of word representations in vector space, 2013. ,
Learning joint embedding with multimodal cues for cross-modal video-text retrieval, 2018. ,
Joint embeddings with multimodal cues for video-text retrieval, IJMIR, 2019. ,
A dataset for movie description, 2015. ,
Two-stream convolutional networks for action recognition in videos, 2014. ,
, Learning video representations using contrastive bidirectional transformer. arXiv 1906, p.5743, 2019.
Videobert: A joint model for video and language representation learning, 2019. ,
Attention is all you need, 2017. ,
Fine-grained action retrieval through multiple parts-of-speech embeddings, 2019. ,
Google's neural machine translation system: Bridging the gap between human and machine translation, 2016. ,
Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, 2017. ,
Distance metric learning, with application to clustering with side-information, 2002. ,
MSR-VTT: A large video description dataset for bridging video and language, 2016. ,
A joint sequence fusion model for video question answering and retrieval, 2018. ,
End-to-end concept word detection for video captioning, retrieval, and question answering, 2017. ,
Cross-modal and hierarchical modeling of video and text, 2018. ,
Understanding bag-of-words model: a statistical framework, International Journal of Machine Learning and Cybernetics, vol.1, pp.43-52, 2010. ,
Places: A 10 million image database for scene recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.40, pp.1452-1464, 2018. ,