Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

Communication Dans Un Congrès Année : 2022

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

(1) , (2) , (3) , (1) , (1)

1
2
3

Antoine Yang

Fonction : Auteur
PersonId : 748008
IdHAL : antoine-yang
ORCID : 0000-0002-7258-571X

Models of visual object recognition and scene understanding

Antoine Miech

Fonction : Auteur
PersonId : 1041372

DeepMind [London]

Josef Sivic

Fonction : Auteur
PersonId : 945630

Czech Institute of Informatics, Robotics and Cybernetics [Prague]

Ivan Laptev

Fonction : Auteur
PersonId : 865349

Models of visual object recognition and scene understanding

Cordelia Schmid

Fonction : Auteur
PersonId : 831154

Models of visual object recognition and scene understanding

Résumé

Video question answering (VideoQA) is a complex task that requires diverse multimodal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting.

Mots clés

Video Understanding Video Question Answering Zero-Shot Learning Computer Vision Vision and Language

Domaines

Informatique [cs] Intelligence artificielle [cs.AI] Vision par ordinateur et reconnaissance de formes [cs.CV] Apprentissage [cs.LG] Informatique et langage [cs.CL]

Fichier principal

cr.pdf (3.41 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine Yang : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03807016

Soumis le : lundi 10 octobre 2022-09:54:52

Dernière modification le : vendredi 19 avril 2024-16:18:58

Dates et versions

hal-03807016 , version 1 (08-10-2022)

hal-03807016 , version 2 (10-10-2022)

Identifiants

HAL Id : hal-03807016 , version 2

Citer

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. NeurIPS 2022 - 36th Conference on Neural Information Processing Systems, Nov 2022, New Orleans, United States. ⟨hal-03807016v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 GENCI PSL ANR PRAIRIE-IA

89 Consultations

112 Téléchargements

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager