H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, Dynamic image networks for action recognition, CVPR, 2016, p.8
DOI : 10.1109/cvpr.2016.331

URL : https://pure.uva.nl/ws/files/19630210/cvpr2016bilen.pdf

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, ECCV, 2004.
DOI : 10.1007/978-3-540-24673-2_3

C. Cao, Y. Zhang, C. Zhang, and H. Lu, Action recognition with joints-pooled 3D deep convolutional descriptors, IJ- CAI, 2016.

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
DOI : 10.1109/CVPR.2017.143

J. Carreira and A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
DOI : 10.1109/CVPR.2017.502

G. Chéron, I. Laptev, and C. Schmid, P-CNN: Pose-Based CNN Features for Action Recognition, 2015 IEEE International Conference on Computer Vision (ICCV), 2008.
DOI : 10.1109/ICCV.2015.368

A. Diba, V. Sharma, and L. Van-gool, Deep Temporal Linear Encoding Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.168

URL : http://arxiv.org/pdf/1611.06678

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.

W. Du, Y. Wang, and Y. Qiao, RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.402

Y. Du, W. Wang, and L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, CVPR, 2015.

C. Feichtenhofer, A. Pinz, and R. Wildes, Spatiotemporal residual networks for video action recognition, NIPS, 2016, p.8

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional Two-Stream Network Fusion for Video Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.8
DOI : 10.1109/CVPR.2016.213

R. Girdhar and D. Ramanan, Attentional pooling for action recognition, NIPS, 2008.

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
DOI : 10.1109/CVPR.2015.7298676

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, ICAIS, 2010.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML, 2015.

S. D. Jain, B. Xiong, and K. Grauman, FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.228

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards Understanding Action Recognition, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.396

URL : https://hal.archives-ouvertes.fr/hal-00906902

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Action Tubelet Detector for Spatio-Temporal Action Localization, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.472

URL : https://hal.archives-ouvertes.fr/hal-01519812

D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

A. Kläser, M. Marszaek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, 2008.
DOI : 10.5244/C.22.99

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, 2012.
DOI : 10.1162/neco.2009.10-08-881

URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126543

I. Laptev, On space-time interest points, IJCV, issue.1, 2005.
DOI : 10.1007/s11263-005-1838-7

URL : http://kth.diva-portal.org/smash/get/diva2:442088/FULLTEXT01

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_48

J. Liu, A. Shahroudy, D. Xu, and G. Wang, Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition, ECCV, 2016. 1
DOI : 10.1109/ISSNIP.2014.6827664

A. Newell, K. Yang, and J. Deng, Stacked Hourglass Networks for Human Pose Estimation, ECCV, 2016.
DOI : 10.1109/ICCV.2015.178

URL : http://arxiv.org/pdf/1603.06937

X. Peng and C. Schmid, Multi-region Two-Stream R-CNN for Action Detection, ECCV, 2008.
DOI : 10.1109/CVPR.2015.7298735

URL : https://hal.archives-ouvertes.fr/hal-01349107

S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos, Procedings of the British Machine Vision Conference 2016, 2016.
DOI : 10.5244/C.30.58

A. Shahroudy, J. Liu, T. Ng, and G. Wang, NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DOI : 10.1109/CVPR.2016.115

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2008.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, 2015.

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, CRCV-TR-12-01, 2012.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, JMLR, issue.5, 2014.

L. Sun, K. Jia, K. Chen, D. Y. Yeung, B. E. Shi et al., Lattice Long Short-Term Memory for Human Action Recognition, 2017 IEEE International Conference on Computer Vision (ICCV), 2008.
DOI : 10.1109/ICCV.2017.236

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298594

P. Tokmakov, K. Alahari, and C. Schmid, Learning Motion Patterns in Videos, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.64

URL : https://hal.archives-ouvertes.fr/hal-01427480

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2008.
DOI : 10.1109/ICCV.2015.510

D. Tran, J. Ray, Z. Shou, S. Chang, and M. Paluri, Convnet architecture search for spatiotemporal feature learning. arXiv, 2008.

C. Wang, Y. Wang, and A. L. Yuille, An approach to posebased action recognition, CVPR, 2013.

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV, 2008.
DOI : 10.1109/CVPR.2016.219

B. Xiaohan-nie, C. Xiong, and S. Zhu, Joint action recognition and pose estimation from video, CVPR, 2015.

J. Yue-hei, M. Ng, S. Hausknecht, O. Vijayanarasimhan, R. Vinyals et al., Beyond short snippets: Deep networks for video classification, CVPR, 2015.

C. Zach, T. Pock, and H. Bischof, A duality based approach for realtime TV-L1 optical flow. Pattern Recognition, 2007.

A. Zisserman, J. Carreira, K. Simonyan, W. Kay, B. Zhang et al., The Kinetics Human Action Video Dataset

M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection, 2017 IEEE International Conference on Computer Vision (ICCV), 2008.
DOI : 10.1109/ICCV.2017.316