ActionVLAD: Learning spatio-temporal aggregation for action classification - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2017

ActionVLAD: Learning spatio-temporal aggregation for action classification

Résumé

In this work, we introduce a new video representation for action classification that aggregates local convolu-tional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as out-performs other baselines with comparable base architec-tures on HMDB51, UCF101, and Charades video classification benchmarks.
Fichier principal
Vignette du fichier
Girdhar17.pdf (8.05 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01678686 , version 1 (09-01-2018)

Identifiants

Citer

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell. ActionVLAD: Learning spatio-temporal aggregation for action classification. IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, United States. ⟨hal-01678686⟩
304 Consultations
119 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More