Skip to Main content Skip to Navigation
Conference papers

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Abstract : Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-questionanswer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multimodal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Complete list of metadata
Contributor : Antoine Yang Connect in order to contact the contributor
Submitted on : Monday, August 30, 2021 - 12:17:47 PM
Last modification on : Wednesday, June 8, 2022 - 12:50:06 PM


Files produced by the author(s)


  • HAL Id : hal-03328749, version 1
  • ARXIV : 2012.00451



Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid. Just Ask: Learning to Answer Questions from Millions of Narrated Videos. ICCV 2021 - IEEE International Conference on Computer Vision, Oct 2021, Montréal, Canada. ⟨hal-03328749⟩



Record views


Files downloads