Skip to Main content Skip to Navigation

Structured Learning from Videos and Language

Jean-Baptiste Alayrac 1, 2
1 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique de l'École normale supérieure, Inria de Paris
2 SIERRA - Statistical Machine Learning and Parsimony
DI-ENS - Département d'informatique de l'École normale supérieure, CNRS - Centre National de la Recherche Scientifique, Inria de Paris
Abstract : The goal of this thesis is to develop models, representations and structured learning algorithms for the automatic understanding of complex human activities from instructional videos narrated with natural language. We first introduce a model that, given a set of narrated instructional videos describing a task, is able to generate a list of action steps needed to complete the task and locate them in the visual and textual streams. To that end, we formulate two assumptions. First, people perform actions when they mention them, i.e., there is a strong temporal correlation between text and video. Second, we assume that complex tasks are composed of an ordered sequence of action steps. Equipped with these two hypotheses, our model first clusters the textual inputs and then uses this output to refine the location of the action steps in the video. We evaluate our model on a newly collected dataset of instructional videos depicting 5 different complex goal oriented tasks, such as changing car tire or repotting a plant. We then present an approach to link actions and the manipulated objects. More precisely, we focus on actions that aim at modifying the state of a specific object, such as pouring a coffee cup or opening a door. Such actions are an inherent part of instructional videos. Our method is based on the optimization of a joint cost between actions and object states under constraints. The constraints reflect our assumption that there is a consistent temporal order for the changes in object states and manipulation actions. We demonstrate experimentally that object states help localizing actions and conversely that action localization improves object state recognition. All our models are based on discriminative clustering, a technique which allows to leverage the readily available weak supervision contained in instructional videos. In order to deal with the resulting optimization problems, we take advantage of a highly adapted optimization technique: the Frank-Wolfe algorithm. Motivated by the fact that scaling our approaches to thousands of videos is essential in the context of narrated instructional videos, we also present several improvements to make the Frank-Wolfe algorithm faster and more computationally efficient. In particular, we propose three main modifications to the Block-Coordinate Frank-Wolfe algorithm: gap-based sampling, away and pairwise Block Frank-Wolfe steps and a solution to cache the oracle calls. We show the effectiveness of our improvements on four challenging structured prediction tasks including foreground/background segmentation and human pose estimation.
Complete list of metadata

Cited literature [198 references]  Display  Hide  Download
Contributor : Jean-Baptiste Alayrac Connect in order to contact the contributor
Submitted on : Monday, October 1, 2018 - 8:26:26 PM
Last modification on : Friday, October 15, 2021 - 1:41:15 PM
Long-term archiving on: : Wednesday, January 2, 2019 - 3:37:59 PM


Files produced by the author(s)


  • HAL Id : tel-01885412, version 1


Jean-Baptiste Alayrac. Structured Learning from Videos and Language. Computer Vision and Pattern Recognition [cs.CV]. Ecole normale supérieure - ENS PARIS, 2018. English. ⟨tel-01885412v1⟩



Record views


Files downloads