End-to-End Learning of Visual Representations from Uncurated Instructional Videos - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Résumé

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Fichier principal
Vignette du fichier
miech17ICCV.pdf (3.17 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01569540 , version 1 (27-07-2017)
hal-01569540 , version 2 (28-07-2017)

Identifiants

Citer

Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CVPR 2020 - IEEE Conference on Computer Vision and Pattern Recognition, Jun 2020, Seattle / Virtual, United States. ⟨hal-01569540v2⟩
315 Consultations
1456 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More