End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

miech17ICCV.pdf (3.17 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine Miech : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01569540

Soumis le : vendredi 28 juillet 2017-01:26:22

Dernière modification le : lundi 11 décembre 2023-11:30:52

Dates et versions

hal-01569540 , version 1 (27-07-2017)

hal-01569540 , version 2 (28-07-2017)

Identifiants

HAL Id : hal-01569540 , version 2
ARXIV : 1912.06430

Citer

Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CVPR 2020 - IEEE Conference on Computer Vision and Pattern Recognition, Jun 2020, Seattle / Virtual, United States. ⟨hal-01569540v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 PSL ANR PRAIRIE-IA

315 Consultations

1456 Téléchargements