Skip to Main content Skip to Navigation
Conference papers

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Abstract : Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Document type :
Conference papers
Complete list of metadata

Cited literature [40 references]  Display  Hide  Download
Contributor : Antoine Miech Connect in order to contact the contributor
Submitted on : Friday, July 28, 2017 - 1:26:22 AM
Last modification on : Thursday, March 17, 2022 - 10:08:53 AM


Files produced by the author(s)


  • HAL Id : hal-01569540, version 2
  • ARXIV : 1912.06430



Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CVPR 2020 - IEEE Conference on Computer Vision and Pattern Recognition, Jun 2020, Seattle / Virtual, United States. ⟨hal-01569540v2⟩



Record views


Files downloads