On the hidden treasure of dialog in video question answering - Archive ouverte HAL Access content directly
Conference Papers Year :

On the hidden treasure of dialog in video question answering


High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or humanmade plot summaries. It even outperforms human evaluators who have never watched any whole episode before.
Fichier principal
Vignette du fichier
C119.iccv21.vqa.pdf (1.15 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03530160 , version 1 (17-01-2022)


  • HAL Id : hal-03530160 , version 1


Deniz Engin, François Schnitzler, Ngoc Q K Duong, Yannis Avrithis. On the hidden treasure of dialog in video question answering. ICCV 2021 - IEEE/CVF International Conference on Computer Vision, Oct 2021, Virtual, France. pp.1-10. ⟨hal-03530160⟩
32 View
25 Download


Gmail Facebook Twitter LinkedIn More