Skip to Main content Skip to Navigation

Large-scale Learning from Video and Natural Language

Antoine Miech 1
1 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique - ENS Paris, Inria de Paris
Abstract : The goal of this thesis is to build and train machine learning models capable of understanding the content of videos. Current video understanding approaches mainly rely on large-scale manually annotated video datasets for training. However, collecting and annotating such dataset is cumbersome, expensive and time-consuming. To address this issue, this thesis focuses on leveraging large amounts of readily-available, but noisy annotations in the form of natural language. In particular, we exploit a diverse corpus of textual metadata such as movie scripts, web video titles and descriptions or automatically transcribed speech obtained from narrated videos. Training video models on such readily-available textual data is challenging as such annotation is often imprecise or wrong. In this thesis, we introduce learning approaches to deal with weak annotation and design specialized training objectives and neural network architectures.
Complete list of metadata
Contributor : Antoine Miech Connect in order to contact the contributor
Submitted on : Sunday, December 20, 2020 - 9:43:25 PM
Last modification on : Tuesday, January 11, 2022 - 11:16:06 AM
Long-term archiving on: : Sunday, March 21, 2021 - 6:19:17 PM


Files produced by the author(s)


  • HAL Id : tel-03084216, version 1



Antoine Miech. Large-scale Learning from Video and Natural Language. Computer Vision and Pattern Recognition [cs.CV]. PSL Research University, 2020. English. ⟨tel-03084216⟩



Les métriques sont temporairement indisponibles