Multi-modal Transformer for Video Retrieval

Valentin Gabeur; Chen Sun; Karteek Alahari; Cordelia Schmid

doi:10.1007/978-3-030-58548-8_13

Communication Dans Un Congrès Année : 2020

Multi-modal Transformer for Video Retrieval

(1, 2) , (2) , (1) , (2)

1
2

Valentin Gabeur

Fonction : Auteur

Apprentissage de modèles à partir de données massives

Google France

Chen Sun

Fonction : Auteur

Google France

Karteek Alahari

Fonction : Auteur
PersonId : 19670
IdHAL : karteek
ORCID : 0000-0002-1838-5936
IdRef : 196283892

Apprentissage de modèles à partir de données massives

Cordelia Schmid

Fonction : Auteur
PersonId : 831154

Google France

Résumé

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

Mots clés

video language retrieval multi-modal cross-modal temporality transformer attention

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

main.pdf (1.53 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

THOTH Team : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02903209

Soumis le : lundi 20 juillet 2020-17:55:19

Dernière modification le : vendredi 26 avril 2024-12:57:38

Archivage à long terme le : mardi 1 décembre 2020-02:01:15

Dates et versions

hal-02903209 , version 1 (20-07-2020)

Identifiants

HAL Id : hal-02903209 , version 1
DOI : 10.1007/978-3-030-58548-8_13

Citer

Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid. Multi-modal Transformer for Video Retrieval. ECCV 2020 - European Conference on Computer Vision, Aug 2020, Glasgow, United Kingdom. pp.214-229, ⟨10.1007/978-3-030-58548-8_13⟩. ⟨hal-02903209⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA INSMI LJK LJK_GI INRIA2 LJK-GI-THOTH UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES ANR UR1-MATH-NUM

466 Consultations

635 Téléchargements

Multi-modal Transformer for Video Retrieval

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager