A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

Vedran Vukotić; Christian Raymond; Guillaume Gravier

doi:10.1109/MMUL.2018.023121161

Article Dans Une Revue IEEE MultiMedia Année : 2018

A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

(1) , (1) , (1)

Vedran Vukotić

Fonction : Auteur
PersonId : 8581
IdHAL : vvukotic

Creating and exploiting explicit links between multimedia fragments

Christian Raymond

Fonction : Auteur
PersonId : 1778
IdHAL : christian-raymond
IdRef : 099236486

Creating and exploiting explicit links between multimedia fragments

Guillaume Gravier

Fonction : Auteur
PersonId : 1046
IdHAL : guig
ORCID : 0000-0002-2266-5682
IdRef : 110355415

Creating and exploiting explicit links between multimedia fragments

Résumé

With the recent resurgence of neural networks and the proliferation of massive amounts of unlabeled data, unsupervised learning algorithms became very popular for organizing and retrieving large video collections in a task defined as video hyperlinking. Information stored as videos typically contain two modalities, namely an audio and a visual one, that are used conjointly in multimodal systems by undergoing fusion. Multimodal autoencoders have been long used for performing multimodal fusion. In this work, we start by evaluating different initial, single-modal representations for automatic speech transcripts and for video keyframes. We progress to evaluating different autoencoding methods of performing multimodal fusion in an offline setup. The best performing setup is then evaluated in a live setup at TRECVID's 2016 video hyperlinking task. As in offline evaluations, we show that focusing on crossmodal translations as a way of performing multimodal fusion yields improved multimodal representations and that our simple system, trained in an unsupervised manner, with no external information information, defines the new state of the art in a live video hyperlinking setup. We conclude by performing an analysis on data gathered after the live evaluations at TRECVID 2016 and express our thoughts on the overall performance of our proposed system.

Mots clés

bidirectional learning tied weights shared weights deep learning neural networks multimodal retrieval crossmodal video retrieval unsupervised representation learning multimodal autoencoders multimodal fusion video hyperlinking

Domaines

Multimédia [cs.MM] Réseau de neurones [cs.NE] Vision par ordinateur et reconnaissance de formes [cs.CV] Recherche d'information [cs.IR]

Fichier principal

Vukotic - A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking - last revision.pdf (4.19 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Guillaume Gravier : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01848539

Soumis le : lundi 20 août 2018-17:09:50

Dernière modification le : vendredi 24 mars 2023-14:53:08

Archivage à long terme le : mercredi 21 novembre 2018-14:04:58

Dates et versions

hal-01848539 , version 1 (24-07-2018)

hal-01848539 , version 2 (20-08-2018)

Identifiants

HAL Id : hal-01848539 , version 2
DOI : 10.1109/MMUL.2018.023121161

Citer

Vedran Vukotić, Christian Raymond, Guillaume Gravier. A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking. IEEE MultiMedia, 2018, 25 (2), pp.11-23. ⟨10.1109/MMUL.2018.023121161⟩. ⟨hal-01848539v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-INSA-R CENTRALESUPELEC INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

415 Consultations

461 Téléchargements

A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager