Learning to localize goal-­oriented actions with weak supervision

Dimitri Zhukov

Thèse Année : 2021

Learning to localize goal-oriented actions with weak supervision

Localisation faiblement supervisée des actions orientées vers un but

(1, 2)

1
2

Dimitri Zhukov

Fonction : Auteur

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Résumé

The goal of this thesis is to develop methods for automatic understanding of video content. We focus on instructional videos that demonstrate how to perform complex tasks, such as making an omelette or hanging a picture. First, we investigate learning visual models for the steps of tasks, using only a list of steps for each task, instead of costly and time consuming human annotations. Our model allows us to share the information between the tasks on the sub-step level, effectively multiplying the amount of available training data. We demonstrate the benefits of our method on a newly collected dataset of instructional videos, CrossTask. Next, we present a method for isolating task-related actions from the surrounding background, that doesn’t rely on human supervision. Finally, we learn to associate natural language instructions with the corresponding objects within the 3D scene, reconstructed from the videos.

Le but de cette thèse est de développer des méthodes pour la compréhension automatique des vidéos d'instructions, qui démontrent des tâches humaines, comme, par exemple, faire une omelette ou accrocher une peinture. Nous proposons, d’abord, une méthode d'apprentissage des actions seulement à partir d'un script pour chaque tâche, au lieu des annotations manuelles. Notre modèle permet de réduire la quantité de données d'entraînement, en partageant l’information entre les tâches. Nous évaluons notre approche sur un nouveau jeu de données, CrossTask. Nous présentons, ensuite, une méthode non supervisée pour isoler les actions, liée à une tâche de leur contexte. Finally, we learn to associate natural language instructions with the corresponding objects within the 3D scene, reconstructed from the videos. Finalement, nous proposons une approche pour associer des instructions textuelles avec des objets correspondants dans la scène 3D, reconstruite à partir des vidéos.

Mots clés

Computer vision Action recognition Video understanding Unsupervised learning Weakly supervised learning Instructional videos

Vision par ordinateur Reconnaissance d'actions Compréhension de vidéos Apprentissage non supervisé Apprentissage faiblement supervisé Vidéos d'instruction

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

thesis_final.pdf (22.19 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Dimitri Zhukov : Connectez-vous pour contacter le contributeur

https://inria.hal.science/tel-03518272

Soumis le : dimanche 9 janvier 2022-14:26:52

Dernière modification le : vendredi 19 avril 2024-16:18:56

Dates et versions

tel-03518272 , version 1 (09-01-2022)

Identifiants

HAL Id : tel-03518272 , version 1

Citer

Dimitri Zhukov. Learning to localize goal-oriented actions with weak supervision. Computer Vision and Pattern Recognition [cs.CV]. PSL University, 2021. English. ⟨NNT : ⟩. ⟨tel-03518272⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA THESES-ENS INRIA2 PSL ANR PRAIRIE-IA

233 Consultations

78 Téléchargements