Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

Vedran Vukotić; Christian Raymond; Guillaume Gravier

Communication Dans Un Congrès Année : 2016

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

(1) , (2) , (1)

1
2

Vedran Vukotić

Fonction : Auteur
PersonId : 8581
IdHAL : vvukotic

Creating and exploiting explicit links between multimedia fragments

Christian Raymond

Fonction : Auteur
PersonId : 1778
IdHAL : christian-raymond
IdRef : 099236486

Laboratoire Informatique d'Avignon

Guillaume Gravier

Fonction : Auteur
PersonId : 1046
IdHAL : guig
ORCID : 0000-0002-2266-5682
IdRef : 110355415

Creating and exploiting explicit links between multimedia fragments

Résumé

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially deep autoencoders, have proven promising both for crossmodal translation and for early fusion via multimodal embedding. A particular architecture, bidirectional symmetrical deep neural networks, have been proven to yield improved multimodal embeddings over classical autoencoders, while also being able to perform crossmodal translation. In this work, we focus firstly at evaluating good single-modal continuous representations both for textual and for visual information. Word2Vec and paragraph vectors are evaluated for representing collections of words, such as parts of automatic transcripts and multiple visual concepts, while different deep convolutional neural networks are evaluated for directly embedding visual information , avoiding the creation of visual concepts. Secondly, we evaluate methods for multimodal fusion and crossmodal translation, with different single-modal pairs, in the task of video hyperlinking. Bidirectional (symmetrical) deep neural networks were shown to successfully tackle downsides of multimodal autoencoders and yield a superior multimodal representation. In this work, we extensively tests them in different settings, with different single-modal representations, within the context of video hyperlinking. Our novel bidirectional symmetrical deep neural networks are compared to classical autoencoders and are shown to yield significantly improved multimodal embeddings that significantly (α = 0.0001) outperform multimodal embeddings obtained by deep autoencoders with an absolute improvement in precision at 10 of 14.1% when embedding visual concepts and automatic transcripts and an absolute improvement of 4.3% when embedding automatic transcripts with features obtained with very deep convolutional neural networks, yielding 80% of precision at 10.

Mots clés

neural networks deep learning convolutional neural networks CNN deep neural networks DNN representation embedding multimodal crossmodal autoencoder video and text video hyperlinking bidirectional learning retrieval video retrieval tied weights shared weights multimodal fusion

Domaines

Multimédia [cs.MM] Intelligence artificielle [cs.AI]

Fichier principal

vukotic_iV&L-MM_2016.pdf (468.1 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Vedran Vukotić : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01374727

Soumis le : vendredi 30 septembre 2016-23:03:58

Dernière modification le : vendredi 24 mars 2023-14:53:02

Archivage à long terme le : samedi 31 décembre 2016-16:39:03

Dates et versions

hal-01374727 , version 1 (30-09-2016)

Identifiants

HAL Id : hal-01374727 , version 1

Citer

Vedran Vukotić, Christian Raymond, Guillaume Gravier. Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking. ACM Multimedia 2016 Workshop: Vision and Language Integration Meets Multimedia Fusion (iV&L-MM'16), ACM Multimedia, Oct 2016, Amsterdam, Netherlands. ⟨hal-01374727⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM UNIV-AVIGNON UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-INSA-R CENTRALESUPELEC IRISA-D6 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC LIA UNIV-RENNES UR1-MATH-NUM

929 Consultations

1218 Téléchargements

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager