Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

Communication Dans Un Congrès Année : 2021

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

(1) , (2) , (3) , (1) , (1)

1
2
3

Antoine Yang

Fonction : Auteur
PersonId : 748008
IdHAL : antoine-yang
ORCID : 0000-0002-7258-571X

Models of visual object recognition and scene understanding

Antoine Miech

Fonction : Auteur

DeepMind [London]

Josef Sivic

Fonction : Auteur

Czech Institute of Informatics, Robotics and Cybernetics [Prague]

Ivan Laptev

Fonction : Auteur

Models of visual object recognition and scene understanding

Cordelia Schmid

Fonction : Auteur

Models of visual object recognition and scene understanding

Résumé

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-questionanswer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multimodal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Apprentissage [cs.LG] Informatique et langage [cs.CL] Intelligence artificielle [cs.AI] Informatique [cs]

Fichier principal

2012.00451.pdf (6.46 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine Yang : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03328749

Soumis le : lundi 30 août 2021-12:17:47

Dernière modification le : vendredi 19 avril 2024-16:18:58

Dates et versions

hal-03328749 , version 1 (30-08-2021)

Identifiants

HAL Id : hal-03328749 , version 1
ARXIV : 2012.00451

Citer

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid. Just Ask: Learning to Answer Questions from Millions of Narrated Videos. ICCV 2021 - IEEE International Conference on Computer Vision, Oct 2021, Montréal, Canada. ⟨hal-03328749⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 GENCI PSL ANR PRAIRIE-IA

1009 Consultations

358 Téléchargements

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager