Learning visual models for person detection and action prediction

Tuan-Hung Vu

Thèse Année : 2018

Learning visual models for person detection and action prediction

Apprentissage de modèles visuels pour la détection de personnes et la prédiction d’actions

(1, 2)

1
2

Tuan-Hung Vu

Fonction : Auteur
PersonId : 959043

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Résumé

In this thesis, we address person detection and action prediction in visual data. We develop models that learn representations for visual data and the structure in the output space while making use of contextual cues and temporal consistency. We also propose a predictive model to anticipate person’s attention in given static scenes. In the first part of the thesis, we explores the strong association between scene categories and actions. Based on that understanding, we formulate a new task of predicting human actions in static scenes. To train and evaluate the proposed model, we collect a new dataset of scene-action associations, named SUN Action dataset. The success of this task enables potential applications such as affordance geo-localization. The second part of the thesis is focused on person and generic object detection in videos. First, we construct contextual models to enhance person detection in individual frames. We train and evaluate our method on our new HollywoodHeads dataset with annotated human heads in movies. Our models consistently improve detection performance over baseline detectors. Second, we introduce a novel convolutional neural network architecture operating on short clips of frames to leverage temporal consistency and to learn spatio-temporal representations. By empirical experiments, we demonstrate the benefit of our spatio-temporal representations for object detection in videos. Last, we learn video representations that incorporate multiscale information on coarse time scales and design practical frameworks that achieve accuracy, efficiency and predictive power. Compared to per-frame features, our video representations show best detection improvement on frames degraded by fast motions.

Mots clés

Scene understanding Action prediction Spatiotemporal visual representation Object detection in videos Deep convolutional neural networks

Vision par ordinateur Détection d'objects Apprentissage profond

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

thesis_Tuan-Hung.pdf (27.67 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Tuan-Hung Vu : Connectez-vous pour contacter le contributeur

https://inria.hal.science/tel-01861455

Soumis le : vendredi 24 août 2018-14:44:24

Dernière modification le : vendredi 19 avril 2024-16:18:56

Archivage à long terme le : dimanche 25 novembre 2018-14:08:31

Dates et versions

tel-01861455 , version 1 (24-08-2018)

Identifiants

HAL Id : tel-01861455 , version 1

Citer

Tuan-Hung Vu. Learning visual models for person detection and action prediction. Computer Vision and Pattern Recognition [cs.CV]. Ecole Normale Superieure de Paris - ENS Paris, 2018. English. ⟨NNT : ⟩. ⟨tel-01861455⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA THESES-ENS INRIA2 PSL

743 Consultations

422 Téléchargements

Learning visual models for person detection and action prediction

Apprentissage de modèles visuels pour la détection de personnes et la prédiction d’actions

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager