S. Ji, W. Xu, M. Yang, and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.1, pp.3-221, 2013.
DOI : 10.1109/TPAMI.2012.59

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Largescale video classification with convolutional neural networks, In: CVPR, pp.1725-1732, 2014.

Q. Zhang, H. Abeida, M. Xue, W. Rowe, and J. Li, Fast implementation of sparse iterative covariance-based estimation for source localization, The Journal of the Acoustical Society of America, vol.131, issue.2, pp.1249-1259, 2012.
DOI : 10.1121/1.3672656

L. Ran, Y. Zhang, Q. Zhang, and T. Yang, Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images, Sensors, vol.12, issue.6, 2017.
DOI : 10.1016/j.patrec.2005.10.010

H. Abeida, Q. Zhang, J. Li, and N. Merabtine, Iterative Sparse Asymptotic Minimum Variance Based Approaches for Array Processing, IEEE Transactions on Signal Processing, vol.61, issue.4, pp.933-944, 2013.
DOI : 10.1109/TSP.2012.2231676

J. Carreira and A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4724-4733, 2017.
DOI : 10.1109/CVPR.2017.502

W. Le, X. Jianru, Z. Nanning, and H. Gang, Automatic Salient Object Extraction with Contextual Cue In: ICCV, pp.105-112, 2011.

L. Wang, G. Hua, R. Sukthankar, J. Xue, and N. Zheng, Video object discovery and co-segmentation with extremely weak supervision, pp.2074-2088, 2017.

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3431-3440, 2015.
DOI : 10.1109/CVPR.2015.7298965

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, In: NIPS, pp.568-576, 2014.

Y. Ng, J. Hausknecht, M. Vijayanarasimhan, S. Vinyals, O. Monga et al., Beyond short snippets: Deep networks for video classification, In: CVPR, pp.4694-4702, 2015.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, pp.20-36, 2016.
DOI : 10.1109/CVPR.2016.219

J. Donahue, A. Hendricks, L. Guadarrama, S. Rohrbach, M. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, pp.2625-2634, 2015.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4489-4497, 2015.
DOI : 10.1109/ICCV.2015.510

G. Chéron, I. Laptev, and C. Schmid, P-CNN: Pose-Based CNN Features for Action Recognition, 2015 IEEE International Conference on Computer Vision (ICCV), pp.3218-3226, 2015.
DOI : 10.1109/ICCV.2015.368

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional Two-Stream Network Fusion for Video Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1933-1941, 2016.
DOI : 10.1109/CVPR.2016.213
URL : http://arxiv.org/pdf/1604.06573

J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, Video-based sign language recognition without temporal segmentation. arXiv preprint, 2018.

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, pp.3551-3558, 2013.
DOI : 10.1109/ICCV.2013.441
URL : https://hal.archives-ouvertes.fr/hal-00873267

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., pp.32-36, 2004.
DOI : 10.1109/ICPR.2004.1334462
URL : http://www.nada.kth.se/%7Ecaputo/publik/icpr04actions.pdf

K. Soomro, A. R. Zamir, and M. Shah, Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv, pp.1212-0402, 2012.

H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre, HMDB51: A Large Video Database for Human Motion Recognition, In: High Performance Computing in Science and Engineering, pp.571-582, 2013.
DOI : 10.1007/978-3-642-33374-3_41
URL : http://cbcl.mit.edu/publications/ps/Kuehne_etal_iccv11.pdf

M. T. Luong, H. Pham, and C. D. Manning, Effective Approaches to Attention-based Neural Machine Translation, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
DOI : 10.18653/v1/D15-1166
URL : http://arxiv.org/pdf/1508.04025

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, pp.2048-2057, 2015.

V. Mnih, N. Heess, and A. Graves, Recurrent models of visual attention, In: NIPS, pp.2204-2212, 2014.

H. Wang, A. Kläser, C. Schmid, and C. L. Liu, Action recognition by dense trajectories, CVPR 2011, pp.3169-3176, 2011.
DOI : 10.1109/CVPR.2011.5995407
URL : https://hal.archives-ouvertes.fr/inria-00583818

I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, vol.17, issue.8, pp.107-123, 2005.
DOI : 10.1007/BFb0017862
URL : http://kth.diva-portal.org/smash/get/diva2:442088/FULLTEXT01

L. Ran, Y. Zhang, W. Wei, and Q. Zhang, A Hyperspectral Image Classification Framework with Spatial Pixel Pair Features, Sensors, vol.12, issue.10, 2017.
DOI : 10.1109/TGRS.2011.2153861
URL : https://doi.org/10.3390/s17102421

J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.1290-1297, 2012.
DOI : 10.1109/CVPR.2012.6247813

Y. Du, W. Wang, and L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, In: CVPR, pp.1110-1118, 2015.

Q. Zhang and G. Hua, Multi-View Visual Recognition of Imperfect Testing Data, Proceedings of the 23rd ACM international conference on Multimedia, MM '15, pp.561-570, 2015.
DOI : 10.1145/2020408.2020593

Q. Zhang, G. Hua, W. Liu, Z. Liu, and Z. Zhang, Can Visual Recognition Benefit from Auxiliary Information in Training?, Lecture Notes in Computer Science, vol.9003, pp.65-80, 2015.
DOI : 10.1007/978-3-319-16865-4_5
URL : http://www.cs.stevens.edu/%7Eghua/publication/ACCV14b.pdf

Q. Zhang, G. Hua, W. Liu, Z. Liu, and Z. Zhang, Auxiliary Training Information Assisted Visual Recognition, IPSJ Transactions on Computer Vision and Applications, vol.7, issue.0, pp.138-150, 2015.
DOI : 10.2197/ipsjtcva.7.138
URL : https://www.jstage.jst.go.jp/article/ipsjtcva/7/0/7_138/_pdf

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal et al., Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4507-4515, 2015.
DOI : 10.1109/ICCV.2015.512
URL : http://arxiv.org/pdf/1502.08029

A. Gaidon, Z. Harchaoui, and C. Schmid, Temporal Localization of Actions with Actoms, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.11, pp.2782-2795, 2013.
DOI : 10.1109/TPAMI.2013.65
URL : https://hal.archives-ouvertes.fr/hal-00687312

Q. Zhang, H. Abeida, M. Xue, W. Rowe, and J. Li, Fast implementation of sparse iterative covariance-based estimation for array processing, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp.2031-2035, 2011.
DOI : 10.1109/ACSSC.2011.6190383

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90
URL : http://arxiv.org/pdf/1512.03385

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, In: ICML, pp.448-456, 2015.

J. Deng, W. Dong, R. Socher, L. J. Li, K. Li et al., ImageNet: A large-scale hierarchical image database, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.248-255, 2009.
DOI : 10.1109/CVPR.2009.5206848

A. Paszke, S. Gross, S. Chintala, and G. Chanan, , p.Pytorch, 2017.

Z. Cai, L. Wang, X. Peng, and Y. Qiao, Multi-view Super Vector for Action Recognition, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.596-603, 2014.
DOI : 10.1109/CVPR.2014.83

X. Peng, L. Wang, X. Wang, and Y. Qiao, Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice, Computer Vision and Image Understanding, vol.150, pp.109-125, 2016.
DOI : 10.1016/j.cviu.2016.03.013

L. Wang, Y. Qiao, and X. Tang, MoFAP: A Multi-level Representation for Action Recognition, International Journal of Computer Vision, vol.23, issue.2, pp.254-271, 2016.
DOI : 10.1109/ICCV.2013.442

L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi, Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4597-4605, 2015.
DOI : 10.1109/ICCV.2015.522

L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectory-pooled deepconvolutional descriptors, In: CVPR, pp.4305-4314, 2015.

G. Varol, I. Laptev, and C. Schmid, Long-Term Temporal Convolutions for Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.6, 2017.
DOI : 10.1109/TPAMI.2017.2712608
URL : https://hal.archives-ouvertes.fr/hal-01241518

W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, A Key Volume Mining Deep Framework for Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1991-1999, 2016.
DOI : 10.1109/CVPR.2016.219

B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, Modeling video evolution for action recognition, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5378-5387, 2015.
DOI : 10.1109/CVPR.2015.7299176

B. Ni, P. Moulin, X. Yang, and S. Yan, Motion Part Regularization: Improving action recognition via trajectory group selection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3698-3706, 2015.
DOI : 10.1109/CVPR.2015.7298993