R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing, vol.28, issue.6, pp.976-990, 2010.
DOI : 10.1016/j.imavis.2009.11.014

G. Cheng, Y. Wan, A. N. Saudagar, K. Namuduri, and B. P. Buckles, Advances in human action recognition: A survey, arXiv: Computer Vision and Pattern Recognition, 2015.

S. Herath, M. T. Harandi, and F. Porikli, Going deeper into action recognition: A survey, Image and Vision Computing, vol.60, pp.4-21, 2017.
DOI : 10.1016/j.imavis.2017.01.010

URL : http://arxiv.org/pdf/1605.04988

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, pp.3551-3558, 2013.
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, Beyond gaussian pyramid: Multi-skip feature stacking for action recognition, In: CVPR, pp.204-212, 2015.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, In: NIPS, pp.568-576, 2014.

Y. Ng, J. Hausknecht, M. Vijayanarasimhan, S. Vinyals, O. Monga et al., Beyond short snippets: Deep networks for video classification, In: CVPR, pp.4694-4702, 2015.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4489-4497, 2015.
DOI : 10.1109/ICCV.2015.510

URL : http://arxiv.org/pdf/1412.0767

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, pp.20-36, 2016.
DOI : 10.1109/CVPR.2016.219

URL : http://arxiv.org/pdf/1608.00859

R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.337

URL : https://hal.archives-ouvertes.fr/hal-01678686

J. Carreira and A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.502

URL : http://arxiv.org/pdf/1705.07750

J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, Video-based sign language recognition without temporal segmentation. arXiv preprint, 2018.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Communications of the ACM, vol.60, issue.6, pp.1097-1105, 2012.
DOI : 10.1162/neco.2009.10-08-881

URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90

URL : http://arxiv.org/pdf/1512.03385

L. Ran, Y. Zhang, W. Wei, and Q. Zhang, A Hyperspectral Image Classification Framework with Spatial Pixel Pair Features, Sensors, vol.12, issue.10, 2017.
DOI : 10.1109/TGRS.2011.2153861

URL : https://doi.org/10.3390/s17102421

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383266

URL : http://www.xrce.xerox.com/Publications/Attachments/2006-034/2006-034.pdf

L. Ran, Y. Zhang, Q. Zhang, and T. Yang, Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images, Sensors, vol.12, issue.6, 2017.
DOI : 10.1016/j.patrec.2005.10.010

URL : http://www.mdpi.com/1424-8220/17/6/1341/pdf

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Largescale video classification with convolutional neural networks, In: CVPR, pp.1725-1732, 2014.
DOI : 10.1109/cvpr.2014.223

URL : http://www.cs.cmu.edu/~rahuls/pub/cvpr2014-deepvideo-rahuls.pdf

J. Donahue, A. Hendricks, L. Guadarrama, S. Rohrbach, M. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, pp.2625-2634, 2015.
DOI : 10.1109/tpami.2016.2599174

URL : https://doi.org/10.1109/tpami.2016.2599174

S. Ji, W. Xu, M. Yang, and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.1, pp.221-231, 2013.
DOI : 10.1109/TPAMI.2012.59

URL : http://www.dbs.informatik.uni-muenchen.de/%7Eyu_k/icml2010_3dcnn.pdf

Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam, Deep Local Video Feature for Action Recognition, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.1219-1225, 2017.
DOI : 10.1109/CVPRW.2017.161

URL : http://arxiv.org/pdf/1701.07368

I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, vol.17, issue.8, pp.107-123, 2005.
DOI : 10.1007/BFb0017862

URL : http://kth.diva-portal.org/smash/get/diva2:442088/FULLTEXT01

P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp.65-72, 2005.
DOI : 10.1109/VSPETS.2005.1570899

S. Sadanand and J. J. Corso, Action bank: A high-level representation of activity in video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.1234-1241, 2012.
DOI : 10.1109/CVPR.2012.6247806

P. Scovanner, S. Ali, and M. Shah, A 3-dimensional sift descriptor and its application to action recognition, Proceedings of the 15th international conference on Multimedia , MULTIMEDIA '07, pp.357-360, 2007.
DOI : 10.1145/1291233.1291311

A. Klaser, M. Marszaa-lek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, pp.275-276, 2008.
DOI : 10.5244/C.22.99

URL : https://hal.archives-ouvertes.fr/inria-00514853

N. Dalal, B. Triggs, and C. Schmid, Human Detection Using Oriented Histograms of Flow and Appearance, pp.428-441, 2006.
DOI : 10.1109/ICCV.2003.1238422

URL : https://hal.archives-ouvertes.fr/inria-00548587

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, Netvlad: Cnn architecture for weakly supervised place recognition, In: CVPR, pp.5297-5307, 2016.
DOI : 10.1109/tpami.2017.2711011

URL : https://hal.archives-ouvertes.fr/hal-01557234

P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition, IEEE Transactions on Circuits and Systems for Video Technology, vol.27, issue.12, pp.2613-2622, 2017.
DOI : 10.1109/TCSVT.2016.2576761

L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi, Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4597-4605, 2015.
DOI : 10.1109/ICCV.2015.522

URL : http://arxiv.org/pdf/1510.00562

G. Varol, I. Laptev, and C. Schmid, Long-Term Temporal Convolutions for Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.6, 2017.
DOI : 10.1109/TPAMI.2017.2712608

URL : https://hal.archives-ouvertes.fr/hal-01241518

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, In: ICML, pp.448-456, 2015.

L. Wang, J. Xue, N. Zheng, and G. Hua, Automatic salient object extraction with contextual cue, 2011 International Conference on Computer Vision, pp.105-112, 2011.
DOI : 10.1109/ICCV.2011.6126231

L. Wang, G. Hua, R. Sukthankar, J. Xue, and N. Zheng, Video object discovery and co-segmentation with extremely weak supervision, IEEE transactions on pattern analysis and machine intelligence, pp.2074-2088, 2017.
DOI : 10.1007/978-3-319-10593-2_42

URL : http://www.cs.stevens.edu/%7Eghua/publication/ECCV14a.pdf

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, Language modeling with gated convolutional networks. arXiv preprint arXiv, pp.1612-08083, 2016.

A. Miech, I. Laptev, and J. Sivic, Learnable pooling with context gating for video classification, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01547378

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional Two-Stream Network Fusion for Video Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1933-1941, 2016.
DOI : 10.1109/CVPR.2016.213

URL : http://arxiv.org/pdf/1604.06573

K. Soomro, R. Zamir, A. Shah, and M. , UCF101: A dataset of 101 human actions classes from videos in the wild, pp.12-13, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, pp.2556-2563, 2011.
DOI : 10.1109/ICCV.2011.6126543

URL : http://dspace.mit.edu/bitstream/1721.1/69981/1/Poggio-HMDB.pdf

Y. G. Jiang, J. Liu, R. Zamir, A. Laptev, I. Piccardi et al., THUMOS challenge: Action recognition with a large number of classes, 2013.

Q. Zhang and G. Hua, Multi-View Visual Recognition of Imperfect Testing Data, Proceedings of the 23rd ACM international conference on Multimedia, MM '15, pp.561-570, 2015.
DOI : 10.1145/2020408.2020593

J. Deng, W. Dong, R. Socher, L. J. Li, K. Li et al., ImageNet: A large-scale hierarchical image database, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.248-255, 2009.
DOI : 10.1109/CVPR.2009.5206848

Q. Zhang, G. Hua, W. Liu, Z. Liu, and Z. Zhang, Auxiliary Training Information Assisted Visual Recognition, IPSJ Transactions on Computer Vision and Applications, vol.7, issue.0, pp.138-150, 2015.
DOI : 10.2197/ipsjtcva.7.138

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, In: ICLR, 2015.

Q. Zhang, G. Hua, W. Liu, Z. Liu, and Z. Zhang, Can Visual Recognition Benefit from Auxiliary Information in Training?, Lecture Notes in Computer Science, vol.9003, pp.65-80, 2015.
DOI : 10.1007/978-3-319-16865-4_5

C. Zach, T. Pock, and H. Bischof, A Duality Based Approach for Realtime TV-L 1 Optical Flow, Pattern Recognition, pp.214-223, 2007.
DOI : 10.1007/978-3-540-74936-3_22

X. Wang, A. Farhadi, and A. Gupta, Actions?transformationsActions?Actions?transformations. In: CVPR, 2016.