Zero-Shot Video Question Answering via Frozen Bidirectional Language Models - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Conference Papers Year : 2022

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Abstract

Video question answering (VideoQA) is a complex task that requires diverse multimodal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting.
Fichier principal
Vignette du fichier
cr.pdf (3.41 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03807016 , version 1 (08-10-2022)
hal-03807016 , version 2 (10-10-2022)

Identifiers

  • HAL Id : hal-03807016 , version 2

Cite

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. NeurIPS 2022 - 36th Conference on Neural Information Processing Systems, Nov 2022, New Orleans, United States. ⟨hal-03807016v2⟩
90 View
112 Download

Share

Gmail Facebook X LinkedIn More