On the hidden treasure of dialog in video question answering

Deniz Engin; François Schnitzler; Ngoc Q K Duong; Yannis Avrithis

Communication Dans Un Congrès Année : 2021

On the hidden treasure of dialog in video question answering

(1, 2) , (2) , (2) , (1)

1
2

Deniz Engin

Fonction : Auteur

Creating and exploiting explicit links between multimedia fragments

InterDigital Communications

François Schnitzler

Fonction : Auteur
PersonId : 808827
ORCID : 0000-0003-1304-2157

InterDigital Communications

Ngoc Q K Duong

Fonction : Auteur

InterDigital Communications

Yannis Avrithis

Fonction : Auteur
PersonId : 20705
IdHAL : yannis-avrithis
ORCID : 0000-0001-7476-4482
IdRef : 253126193

Creating and exploiting explicit links between multimedia fragments

Résumé

High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or humanmade plot summaries. It even outperforms human evaluators who have never watched any whole episode before.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

C119.iccv21.vqa.pdf (1.15 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Yannis Avrithis : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03530160

Soumis le : lundi 17 janvier 2022-17:19:54

Dernière modification le : mardi 16 janvier 2024-16:29:54

Archivage à long terme le : lundi 18 avril 2022-21:12:22

Dates et versions

hal-03530160 , version 1 (17-01-2022)

Identifiants

HAL Id : hal-03530160 , version 1

Citer

Deniz Engin, François Schnitzler, Ngoc Q K Duong, Yannis Avrithis. On the hidden treasure of dialog in video question answering. ICCV 2021 - IEEE/CVF International Conference on Computer Vision, Oct 2021, Virtual, France. pp.1-10. ⟨hal-03530160⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA CENTRALESUPELEC INRIA2 GENCI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

41 Consultations

47 Téléchargements

On the hidden treasure of dialog in video question answering

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager