Localizing spatially and temporally objects and actions in videos

Vicky Kalogeiton

Thèse Année : 2017

Localizing spatially and temporally objects and actions in videos

Localiser spatio-temporallement des objets et des actions dans des vidéos

(1, 2)

1
2

Vicky Kalogeiton

Fonction : Auteur
PersonId : 1025608

Apprentissage de modèles à partir de données massives

School of Informatics [Edimbourg]

Résumé

The rise of deep learning has facilitated remarkable progress in video understanding. This thesis addresses three important tasks of video understanding: video object detection, joint object and action detection, and spatio-temporal action localization. Object class detection is one of the most important challenges in computer vision. Object detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source of data. Yet, training an object detector on one domain (either still images or videos) and testing on the other one results in a significant performance gap compared to training and testing on the same domain. In the first part of this thesis, we examine the reasons behind this performance gap. We define and evaluate several domain shift factors: spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing. We examine the impact of these factors by comparing the detection performance before and after cancelling them out. The results show that all five factors affect the performance of the detectors and their combined effect explains the performance gap. While most existing approaches for detection in videos focus on objects or human actions separately, in the second part of this thesis we aim at detecting non-human centric actions, i.e., objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting object-action pairs in videos, and show that both tasks of object and action detection benefit from this joint learning. In experiments on the A2D dataset, we obtain state-of-the-art results on segmentation of object-action pairs. In the third part, we are the first to propose an action tubelet detector that leverages the temporal continuity of videos instead of operating at the frame level, as state-of- the-art approaches do. The same way modern detectors rely on anchor boxes, our tubelet detector is based on anchor cuboids by taking as input a sequence of frames and outputing tubelets, i.e., sequences of bounding boxes with associated scores. Our tubelet detector outperforms all state of the art on the UCF-Sports, J-HMDB, and UCF-101 action localization datasets especially at high overlap thresholds. The improvement in detection performance is explained by both more accurate scores and more precise localization.

Localiser spatio-temporallement des objets et des actions dans des vidéos

Mots clés

action localization action recognition object detection video analysis computer vision deep learning machine learning

Localisation

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

vicky-phdthesis.pdf (39.41 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

THOTH Team : Connectez-vous pour contacter le contributeur

https://inria.hal.science/tel-01674504

Soumis le : mercredi 3 janvier 2018-03:36:15

Dernière modification le : jeudi 4 avril 2024-18:24:06

Archivage à long terme le : mercredi 4 avril 2018-12:35:02

Dates et versions

tel-01674504 , version 1 (03-01-2018)

Identifiants

HAL Id : tel-01674504 , version 1

Citer

Vicky Kalogeiton. Localizing spatially and temporally objects and actions in videos. Computer Vision and Pattern Recognition [cs.CV]. University of Edinburgh; INRIA Grenoble, 2017. English. ⟨NNT : ⟩. ⟨tel-01674504⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LJK LJK_GI INRIA2 LJK-GI-THOTH

484 Consultations

309 Téléchargements

Localizing spatially and temporally objects and actions in videos

Localiser spatio-temporallement des objets et des actions dans des vidéos

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager