Skip to Main content Skip to Navigation

Learning to annotate dynamic video scenes

Abstract : Videos often depict complex scenes including people, objects and interactions between these and the enviorment. Relations between agents are likely to evolve in time and agents can perform actions. The automatic understanding of video data is complicated as it requires to properly localize the agents both in space and time. Moreover, one need to automatically describe the relations between agents and how these evolve in time. Modern approaches to computer vision heavily rely on supervised learning, where annotated samples are provided to the algorihtm to learn parametric models. However, for rich data such as video, the lableling process starts to be costly and complicated. Also, symbolic labels are not sufficient to encode the complex interactions between people, objects and scenes. Natural language offers much richer descriptive power and is thus a practical modality to annotated video data. Therefore, in this thesis we propose to focus on jointly modeling video and text. We explore such joint models in the context of movies with associated movie scripts, which provide accurate descriptions of the pictured events. The main challenge that we face is that movie scripts do not provide precise temporal and spatial localization of objects and actions. We first present a model for automatically annotating person tracks in movies with person and action labels. The model uses a discriminative clustering cost function, and weak supervision in the form of constraints that we obtain from scripts. This approach allows us to spatially and temporaly localize agents and the actions they perform, as described in the script, in the video. However, the temporal and spatial localization is due to the use of person detection tracks. Then, in a second contribution, we describe a model for aligning sentences with frames of the video. The optimal temporal correspondance is again obtained using a discriminative model under temporal ordering constraints. This alignment model is applied on two datasets: one composed of videos associated with a stream of symbolic labels; a second one composed of videos with textual descriptions in the form of key steps towards a goal (cooking recipes for instance).
Complete list of metadata
Contributor : Piotr Bojanowski <>
Submitted on : Monday, September 12, 2016 - 4:27:10 PM
Last modification on : Thursday, July 1, 2021 - 5:58:09 PM
Long-term archiving on: : Tuesday, December 13, 2016 - 3:43:25 PM



  • HAL Id : tel-01364560, version 1



Piotr Bojanowski. Learning to annotate dynamic video scenes. Computer Vision and Pattern Recognition [cs.CV]. Ecole normale supérieure, 2016. English. ⟨tel-01364560⟩



Record views


Files downloads