Skip to Main content Skip to Navigation
Conference papers

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Abstract : Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Document type :
Conference papers
Complete list of metadata

Cited literature [40 references]  Display  Hide  Download

https://hal.inria.fr/hal-01569540
Contributor : Antoine Miech <>
Submitted on : Friday, July 28, 2017 - 1:26:22 AM
Last modification on : Tuesday, January 19, 2021 - 10:16:03 AM

File

miech17ICCV.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01569540, version 2
  • ARXIV : 1912.06430

Collections

Citation

Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CVPR 2020 - IEEE Conference on Computer Vision and Pattern Recognition, Jun 2020, Seattle / Virtual, United States. ⟨hal-01569540v2⟩

Share

Metrics

Record views

330

Files downloads

929