M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen et al., Tensorflow: Large-scale machine learning on heterogeneous systems, 2015.

S. Abu-el-haija, N. Kothari, J. Lee, P. Natsev, G. Toderici et al., Youtube-8m: A large-scale video classification benchmark, 2016.

E. Ahmed, M. Jones, and T. K. Marks, An improved deep learning architecture for person re-identification, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3908-3916, 2015.
DOI : 10.1109/CVPR.2015.7299016

R. Al-rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau et al., Theano: A python framework for fast computation of mathematical expressions, 2016.

M. R. Amer, S. Todorovic, A. Fern, and S. Zhu, Monte Carlo Tree Search for Scheduling Activity Recognition, 2013 IEEE International Conference on Computer Vision, pp.1353-1360, 2013.
DOI : 10.1109/ICCV.2013.171

R. Araujo and M. S. Kamel, A semi-supervised temporal clustering method for facial emotion analysis, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp.1-6, 2014.
DOI : 10.1109/ICMEW.2014.6890712

K. Avgerinakis, K. Adam, A. Briassouli, and Y. Kompatsiaris, Moving camera human activity localization and recognition with motionplanes and multiple homographies, 2015 IEEE International Conference on Image Processing (ICIP), pp.2085-2089, 2015.
DOI : 10.1109/ICIP.2015.7351168

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks, International Conference on Artificial Neural Networks, pp.154-159, 2010.
DOI : 10.1007/978-3-642-15822-3_20
URL : https://hal.archives-ouvertes.fr/hal-01381827

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, Sequential Deep Learning for Human Action Recognition, International Workshop on Human Behavior Understanding, pp.29-39, 2011.
DOI : 10.1007/978-3-642-25446-8_4
URL : https://hal.archives-ouvertes.fr/hal-01354493

N. Ballas, L. Yao, and A. Courville, Delving deeper into convolutional networks for learning video representations, Proc. International Conference on Learning Representations, 2016.

I. Bayer and T. Silbermann, A multi modal approach to gesture recognition from audio and video data, Proceedings of the 15th ACM on International conference on multimodal interaction, ICMI '13, pp.461-466, 2013.
DOI : 10.1145/2522848.2532592

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.157-166, 1994.
DOI : 10.1109/72.279181

H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, Dynamic image networks for action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3034-3042, 2016.

N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition, 2016 23rd International Conference on Pattern Recognition (ICPR), 2016.
DOI : 10.1109/ICPR.2016.7899606

X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, Two streams Recurrent Neural Networks for Large-Scale Continuous Gesture Recognition, 2016 23rd International Conference on Pattern Recognition (ICPR), 2016.
DOI : 10.1109/ICPR.2016.7899603

R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal, Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.471-478, 2013.
DOI : 10.1109/CVPRW.2013.153

R. Chavarriaga, H. Sagha, J. Del, and R. Milln, Ensemble creation and reconfiguration for activity recognition: An information theoretic approach, 2011 IEEE International Conference on Systems, Man, and Cybernetics, pp.2761-2766, 2011.
DOI : 10.1109/ICSMC.2011.6084090

C. Chen, B. Zhang, Z. Hou, J. Jiang, M. Liu et al., Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features, Multimedia Tools and Applications, pp.1-19, 2016.
DOI : 10.1109/TIP.2006.884956

W. Chen and J. J. Corso, Action Detection by Implicit Intentional Motion Clustering, 2015 IEEE International Conference on Computer Vision (ICCV), pp.3298-3306, 2015.
DOI : 10.1109/ICCV.2015.377

G. Chéron, I. Laptev, and C. Schmid, P-CNN: Pose-Based CNN Features for Action Recognition, 2015 IEEE International Conference on Computer Vision (ICCV), pp.3218-3226, 2015.
DOI : 10.1109/ICCV.2015.368

R. Collobert, S. Bengio, and J. Marithoz, Torch: A modular machine learning software library, 2002.

Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan et al., Deep Structured Models For Group Activity Recognition, Procedings of the British Machine Vision Conference 2015, pp.179-180, 2015.
DOI : 10.5244/C.29.179

Z. Deng, A. Vahdat, H. Hu, and G. Mori, Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.516

A. Diba, A. Mohammad-pazandeh, H. Pirsiavash, and L. Van-gool, DeepCAMP: Deep Convolutional Action & Attribute Mid-Level Patterns, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.387
URL : http://arxiv.org/pdf/1608.03217

Y. Du, W. Wang, and L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1110-1118, 2015.

J. Duan, S. Zhou, J. Wan, X. Guo, and S. Z. Li, Multi-modality fusion based on consensus-voting and 3d convolution for isolated gesture recognition. arXiv preprint, 2016.

I. C. Duta, B. Ionescu, K. Aizawa, and N. Sebe, Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos, International Conference on Multimedia Modeling, pp.365-378, 2017.
DOI : 10.1109/ICCV.2013.442

T. Eleni, Gesture recognition with a convolutional long short term memory recurrent neural network, ESANN, 2015. URL https

J. L. Elman, Finding Structure in Time, Cognitive Science, vol.49, issue.2, pp.179-211, 1990.
DOI : 10.1007/BF00308682

H. J. Escalante, C. A. Hérnadez, L. E. Sucar, and M. Montes, Late fusion of heterogeneous methods for multimedia image retrieval, Proceeding of the 1st ACM international conference on Multimedia information retrieval, MIR '08, pp.172-179, 2008.
DOI : 10.1145/1460096.1460125

H. J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, and J. Wan, Principal motion components for gesture recognition using a single example, 2015.

H. J. Escalante, E. F. Morales, and L. E. Sucar, A na¨?vena¨?ve bayes baseline for early gesture recognition, pp.91-99, 2016.

H. J. Escalante, V. Ponce, J. Wan, M. Riegler, A. Clapes et al., ChaLearn Joint Contest on Multimedia Challenges Beyond Visual Analysis: An overview, 2016 23rd International Conference on Pattern Recognition (ICPR), 2016.
DOI : 10.1109/ICPR.2016.7899609
URL : https://hal.archives-ouvertes.fr/hal-01381144

V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, DAPs: Deep Action Proposals for Action Understanding, European Conference on Computer Vision, 2016.
DOI : 10.1007/978-3-319-10602-1_26

C. Feichtenhofer, A. Pinz, and R. Wildes, Spatiotemporal residual networks for video action recognition, Advances in Neural Information Processing Systems, pp.3468-3476, 2016.

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional Two-Stream Network Fusion for Video Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1933-1941, 2016.
DOI : 10.1109/CVPR.2016.213

B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars, Rank Pooling for Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.4, 2016.
DOI : 10.1109/TPAMI.2016.2558148

D. Fortun, P. Bouthemy, and C. Kervrann, Optical flow modeling and computation: A survey, Computer Vision and Image Understanding, vol.134, pp.1-21, 2015.
DOI : 10.1016/j.cviu.2015.02.008
URL : https://hal.archives-ouvertes.fr/hal-01104081

F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, Learning precise timing with lstm recurrent networks, JMLR, vol.3, pp.115-143, 2002.

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.759-768, 2015.
DOI : 10.1109/CVPR.2015.7298676

A. Grushin, D. D. Monner, J. A. Reggia, and A. Mishra, Robust human action recognition via long short-term memory, The 2013 International Joint Conference on Neural Networks (IJCNN), pp.1-8, 2013.
DOI : 10.1109/IJCNN.2013.6706797

F. Gu, M. Sridhar, A. Cohn, D. Hogg, F. Flrez-revuelta et al., Weakly supervised activity analysis with spatio-temporal localisation, Neurocomputing, vol.216, 2016.
DOI : 10.1016/j.neucom.2016.08.032

S. Han, H. Mao, and W. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, Proc. International Conference on Learning Representations, 2016.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90

Y. He, S. Shirakabe, Y. Satoh, and H. Kataoka, Human Action Recognition Without Human, Proc. European Conference on Computer Vision 2016 Workshops, pp.11-17
DOI : 10.1109/CVPR.2015.7298953
URL : http://arxiv.org/pdf/1608.07876

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.961-970, 2015.
DOI : 10.1109/CVPR.2015.7298698

S. Hochreiter, Untersuchungen zu dynamischen neuronalen netzen. Diploma, p.91, 1991.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

J. Huang, W. Zhou, H. Li, and W. Li, Sign Language Recognition using 3D convolutional neural networks, 2015 IEEE International Conference on Multimedia and Expo (ICME), pp.1-6, 2015.
DOI : 10.1109/ICME.2015.7177428

M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, A Hierarchical Deep Temporal Model for Group Activity Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.217

A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler, Learning human pose estimation features with convolutional networks, International Conference on Learning Representations, pp.1-14, 2014.

A. Jain, J. Tompson, Y. Lecun, and C. Bregler, MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation, pp.302-315, 2015.
DOI : 10.1007/978-3-319-16808-1_21

M. Jain, J. Van-gemert, and C. G. Snoek, University of amsterdam at thumos challenge 2014, ECCV THUMOS Challenge 2014, 2014.

M. Jain, J. C. Van-gemert, T. Mensink, and C. G. Snoek, Objects2action: Classifying and Localizing Actions without Any Video Example, 2015 IEEE International Conference on Computer Vision (ICCV)
DOI : 10.1109/ICCV.2015.521

M. Jain, J. C. Van-gemert, and C. G. Snoek, What do 15,000 object categories tell us about classifying and localizing actions?, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.46-55, 2015.
DOI : 10.1109/CVPR.2015.7298599

S. Ji, W. Xu, M. Yang, and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.495-502, 2010.
DOI : 10.1109/TPAMI.2012.59

S. Ji, W. Xu, M. Yang, and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.1, pp.221-231, 2013.
DOI : 10.1109/TPAMI.2012.59

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, pp.675-678, 2014.
DOI : 10.1145/2647868.2654889

Y. Jiang, J. Liu, A. Zamir, I. Laptev, M. Piccardi et al., THUMOS challenge: Action recognition with a large number of classes, 2013.

V. John, A. Boyali, S. Mita, M. Imanishi, and N. Sanma, Deep Learning-Based Fast Hand Gesture Recognition Using Representative Frames, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp.1-8, 2016.
DOI : 10.1109/DICTA.2016.7797030

J. Joo, W. Li, F. F. Steen, and S. Zhu, Visual Persuasion: Inferring Communicative Intents of Images, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.216-223, 2014.
DOI : 10.1109/CVPR.2014.35

B. Kang, S. Tripathi, and T. Q. Nguyen, Real-time sign language fingerspelling recognition using convolutional neural networks from depth map, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015.
DOI : 10.1109/ACPR.2015.7486481

S. Karaman, L. Seidenari, A. D. Bagdanov, and A. D. Bimbo, L1-regularized logistic regression stacking and transductive crf smoothing for action recognition in video, Results of the THUMOS 2013 Action Recognition Challenge with a Large Number of Classes, 2013.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Largescale video classification with convolutional neural networks, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.1725-1732, 2014.

T. Kerola, N. Inoue, and K. Shinoda, Cross-view human action recognition from depth maps using spectral graph sequences, Computer Vision and Image Understanding, vol.154, pp.108-126, 2017.
DOI : 10.1016/j.cviu.2016.10.004

O. Koller, H. Ney, and R. Bowden, Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3793-3802, 2016.
DOI : 10.1109/CVPR.2016.412

J. Konecny and M. Hagara, One-shot-learning gesture recognition using hog-hof features, JMLR, vol.15, pp.2513-2532, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105, 2012.
DOI : 10.1162/neco.2009.10-08-881

Y. Kuniyoshi, H. Inoue, and M. Inaba, Design and implementation of a system that generates assembly programs from visual recognition of human action sequences. In Intelligent Robots and Systems' 90.'Towards a New Frontier of Applications, Proceedings. IROS'90. IEEE International Workshop on, pp.567-574, 1990.

G. Lev, G. Sadeh, B. Klein, and L. Wolf, RNN Fisher Vectors for Action Recognition and Image Annotation, European Conference on Computer Vision, pp.833-850, 2016.
DOI : 10.1109/ICCV.2015.521
URL : http://arxiv.org/pdf/1512.03958

S. Li, Z. Liu, and A. B. Chan, Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network, International Journal of Computer Vision, vol.35, issue.12, pp.19-36, 2015.
DOI : 10.1007/978-3-319-10590-1_53

S. Li, W. Zhang, and A. B. Chan, Maximum-margin structured learning with deep networks for 3d human pose estimation, ICCV, pp.2848-2856, 2015.

Y. Li, W. Li, V. Mahadevan, and N. Vasconcelos, VLAD3: Encoding Dynamics of Deep Features for Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1951-1960, 2016.
DOI : 10.1109/CVPR.2016.215

Y. Li, Q. Miao, K. Tian, Y. Fan, X. Xu et al., Large-scale gesture recognition with a fusion of rgb-d data based on c3d model, Proc. of International Conference on Pattern RecognitionW, 2016.

C. Liang, Y. Song, and Y. Zhang, Hand gesture recognition using view projection from point cloud, 2016 IEEE International Conference on Image Processing (ICIP), pp.4413-4417, 2016.
DOI : 10.1109/ICIP.2016.7533194

Z. Liang, G. Zhang, J. X. Huang, and Q. V. Hu, Deep learning for healthcare decision making with EMRs, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.556-559, 2014.
DOI : 10.1109/BIBM.2014.6999219

H. Lin, M. Hsu, and W. Chen, Human hand gesture recognition using a convolution neural network, 2014 IEEE International Conference on Automation Science and Engineering (CASE), pp.1038-1043, 2015.
DOI : 10.1109/CoASE.2014.6899454

A. Liu, Y. Su, W. Nie, and M. Kankanhalli, Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.1, pp.102-114, 2017.
DOI : 10.1109/TPAMI.2016.2537337

J. Liu, A. Shahroudy, D. Xu, and G. Wang, Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition, European Conference on Computer Vision, pp.816-833
DOI : 10.1109/ISSNIP.2014.6827664

Z. Liu, C. Zhang, and Y. Tian, 3D-based Deep Convolutional Neural Network for action recognition with depth sequences, Image and Vision Computing, vol.55, pp.93-100, 2016.
DOI : 10.1016/j.imavis.2016.04.004

J. Luo, W. Wang, and H. Qi, Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps, 2013 IEEE International Conference on Computer Vision, pp.1809-1816, 2013.
DOI : 10.1109/ICCV.2013.227

B. Mahasseni and S. Todorovic, Regularizing long short term memory with 3d humanskeleton sequences for action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3054-3062, 2016.

E. Mansimov, N. Srivastava, and R. Salakhutdinov, Initialization strategies of spatiotemporal convolutional neural networks, 2015.

R. Marks, System and method for providing a real-time three-dimensional interactive environment, US Patent, vol.8, p.72470, 2011.

P. Mettes, J. C. Van-gemert, and C. G. Snoek, Spot On: Action Localization from Pointly-Supervised Proposals, European Conference on Computer Vision, pp.437-453
DOI : 10.1007/s11263-013-0636-x

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., Human-level control through deep reinforcement learning, Nature, vol.101, issue.7540, pp.518529-533, 2015.
DOI : 10.1016/S0004-3702(98)00023-X

P. Molchanov, S. Gupta, K. Kim, and J. Kautz, Hand gesture recognition with 3D convolutional neural networks, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.1-7, 2015.
DOI : 10.1109/CVPRW.2015.7301342
URL : http://web4.cs.ucl.ac.uk/staff/j.kautz/publications/Gesture_HANDS15.pdf

P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree et al., Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.456

A. Montes, A. Salvador, X. Giro-i, and . Nieto, Temporal activity detection in untrimmed videos with recurrent neural networks, 2016.

M. Hondori and M. Khademi, A review on technical and clinical impact of microsoft kinect on physical therapy and rehabilitation, Journal of Medical Engineering, 2014.

K. Nasrollahi, S. Escalera, P. Rasti, G. Anbarjafari, X. Bar et al., Deep learning based super-resolution for improved action recognition, 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), pp.67-72, 2015.
DOI : 10.1109/IPTA.2015.7367098

N. Neverova, C. Wolf, G. Paci, G. Sommavilla, G. W. Taylor et al., A Multi-scale Approach to Gesture Detection and Recognition, 2013 IEEE International Conference on Computer Vision Workshops, pp.484-491, 2013.
DOI : 10.1109/ICCVW.2013.69
URL : https://hal.archives-ouvertes.fr/hal-01339262

N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout, Multi-scale Deep Learning for Gesture Detection and Localization, ECCVW, pp.474-490, 2014.
DOI : 10.1007/978-3-319-16178-5_33
URL : https://hal.archives-ouvertes.fr/hal-01419792

N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout, ModDrop: Adaptive Multi-Modal Gesture Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.8, 2015.
DOI : 10.1109/TPAMI.2015.2461544
URL : https://hal.archives-ouvertes.fr/hal-01178733

J. Y. Ng, J. Choi, J. Neumann, and L. S. Davis, Actionflownet: Learning motion representation for action recognition. arXiv preprint, 2016.

B. Ni, Y. Pei, Z. Liang, L. Lin, and P. Moulin, Integrating multi-stage depth-induced contextual information for human action recognition and localization, In FG, pp.1-8, 2013.

B. Ni, X. Yang, and S. Gao, Progressively Parsing Interactional Objects for Fine Grained Action Detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1020-1028, 2016.
DOI : 10.1109/CVPR.2016.116

N. Nishida and H. Nakayama, Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network, In PSIVT, pp.682-694, 2016.
DOI : 10.1007/978-3-319-29451-3_54

S. Oh, A large-scale benchmark dataset for event recognition in surveillance video, CVPR 2011, pp.3153-3160, 2011.
DOI : 10.1109/CVPR.2011.5995586

E. Ohn-bar and M. M. Trivedi, Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations, IEEE Transactions on Intelligent Transportation Systems, vol.15, issue.6, pp.2368-2377, 2014.
DOI : 10.1109/TITS.2014.2337331

F. J. Ordóñez and D. Roggen, Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition, Sensors, vol.6, issue.1, p.115, 2016.
DOI : 10.1007/s00779-013-0638-2

W. Ouyang, X. Chu, and X. Wang, Multi-source Deep Learning for Human Pose Estimation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.2337-2344, 2014.
DOI : 10.1109/CVPR.2014.299

O. K. Oyedotun and A. Khashman, Deep learning in vision-based static hand gesture recognition, Neural Computing and Applications, pp.1-11, 2016.
DOI : 10.12720/joace.3.1.40-45

E. Park, X. Han, T. L. Berg, and A. C. Berg, Combining multiple sources of knowledge in deep CNNs for action recognition, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.1-8, 2016.
DOI : 10.1109/WACV.2016.7477589

X. Peng and C. Schmid, Encoding feature maps of cnns for action recognition, CVPR, THUMOS Challenge 2015 Workshop, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01236843

X. Peng and C. Schmid, Multi-region Two-Stream R-CNN for Action Detection, European Conference on Computer Vision, pp.744-759, 2016.
DOI : 10.1109/CVPR.2015.7298735
URL : https://hal.archives-ouvertes.fr/hal-01349107

X. Peng, L. Wang, Z. Cai, Y. Qiao, and Q. Peng, Hybrid super vector with improved dense trajectories for action recognition, ICCV Workshops, 2013.

X. Peng, C. Zou, Y. Qiao, and Q. Peng, Action Recognition with Stacked Fisher Vectors, European Conference on Computer Vision, pp.581-595, 2014.
DOI : 10.1007/978-3-319-10602-1_38

X. Peng, L. Wang, Z. Cai, and Y. Qiao, Action and Gesture Temporal Spotting with Super Vector Representation, pp.518-527978, 2015.
DOI : 10.1007/978-3-319-16178-5_36

L. Pigou, A. V. Oord, S. Dieleman, M. V. Herreweghe, and J. Dambre, Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video, International Journal of Computer Vision, vol.86, issue.11
DOI : 10.1109/CVPR.2015.7298935

Y. Poleg, A. Ephrat, S. Peleg, and C. Arora, Compact CNN for indexing egocentric videos, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.1-9, 2016.
DOI : 10.1109/WACV.2016.7477708
URL : http://arxiv.org/pdf/1504.07469

Z. Qiu, Q. Li, T. Yao, T. Mei, and Y. Rui, Msr asia msm at thumos challenge 2015, CVPR workshop, 2015.

A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, Proc. International Conference on Learning Representations, 2016.

H. Rahmani and A. Mian, 3D Action Recognition from Novel Viewpoints, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1506-1515, 2016.
DOI : 10.1109/CVPR.2016.167

H. Rahmani, A. Mian, and M. Shah, Learning a Deep Model for Human Action Recognition from Novel Viewpoints, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
DOI : 10.1109/TPAMI.2017.2691768

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Advances in neural information processing systems, pp.91-99, 2015.
DOI : 10.1109/TPAMI.2016.2577031
URL : http://arxiv.org/pdf/1506.01497

N. Rhinehart and K. M. Kitani, Learning Action Maps of Large Environments via First-Person Vision, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.69

A. Richard and J. Gall, Temporal Action Detection Using a Statistical Language Model, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.341

H. Sagha, J. Del, R. Milln, and R. Chavarriaga, Detecting anomalies to improve classification performance in opportunistic sensor networks, 2011 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp.154-159, 2011.
DOI : 10.1109/PERCOMW.2011.5766860

H. Sagha, S. T. Digumarti, J. Del, R. Millán, R. Chavarriaga et al., Benchmarking classification techniques using the Opportunity human activity dataset, 2011 IEEE International Conference on Systems, Man, and Cybernetics, pp.36-40
DOI : 10.1109/ICSMC.2011.6083628

S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint, 2016.
DOI : 10.5244/c.30.58
URL : http://arxiv.org/pdf/1608.01529

J. Scharcanski and M. E. Celebi, Computer vision techniques for the diagnosis of skin cancer, 2014.
DOI : 10.1007/978-3-642-39608-3

A. Shahroudy, J. Liu, T. Ng, and G. Wang, NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1010-1019, 2016.
DOI : 10.1109/CVPR.2016.115
URL : http://arxiv.org/pdf/1604.02808

A. Shahroudy, T. Ng, Y. Gong, and G. Wang, Deep multimodal feature analysis for action recognition in rgb+ d videos. arXiv preprint, 2016.

L. Shao, L. Liu, and M. Yu, Kernelized Multiview Projection for Robust Action Recognition, International Journal of Computer Vision, vol.34, issue.3, pp.115-129, 2016.
DOI : 10.1145/1273496.1273646
URL : https://link.springer.com/content/pdf/10.1007%2Fs11263-015-0861-6.pdf

Z. Shou, D. Wang, and S. Chang, Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.119
URL : http://arxiv.org/pdf/1601.02129

Z. Shou, D. Wang, and S. Chang, Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1049-1058, 2016.
DOI : 10.1109/CVPR.2016.119
URL : http://arxiv.org/pdf/1601.02129

Z. Shu, K. Yun, and D. Samaras, Action Detection with Improved Dense Trajectories and Sliding Window, pp.541-551978
DOI : 10.1007/978-3-319-16178-5_38
URL : http://www3.cs.stonybrook.edu/%7Ekyun/papers/zhixin_kiwon_chalearnLAP2014.pdf

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, pp.568-576

B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1961-1970, 2016.
DOI : 10.1109/CVPR.2016.216

S. Singh, C. Arora, and C. Jawahar, First Person Action Recognition Using Deep Learned Descriptors, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2620-2628, 2016.
DOI : 10.1109/CVPR.2016.287

K. Soomro, H. Idrees, and M. Shah, Action Localization in Videos through Context Walk, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.375

W. Sultani and M. Shah, Automatic action annotation in weakly labeled videos, Computer Vision and Image Understanding, vol.161
DOI : 10.1016/j.cviu.2017.05.005

L. Sun, K. Jia, D. Yeung, and B. E. Shi, Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4597-4605, 2015.
DOI : 10.1109/ICCV.2015.522
URL : http://arxiv.org/pdf/1510.00562

J. Tompson, Y. L. Murphy-stein, and K. Perlin, Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks, ACM Transactions on Graphics, vol.33, issue.5, pp.1-169, 2014.
DOI : 10.1145/1531326.1531369

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4489-4497, 2015.
DOI : 10.1109/ICCV.2015.510
URL : http://arxiv.org/pdf/1412.0767

P. Turaga, A. Veeraraghavan, and R. Chellappa, Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2008.
DOI : 10.1109/CVPR.2008.4587733

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.154-171, 2013.
DOI : 10.1023/B:VISI.0000013087.49260.fb

G. Varol, I. Laptev, and C. Schmid, Long-term Temporal Convolutions for Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
DOI : 10.1109/TPAMI.2017.2712608
URL : https://hal.archives-ouvertes.fr/hal-01241518

V. Veeriah, N. Zhuang, and G. Qi, Differential Recurrent Neural Networks for Action Recognition, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4041-4049, 2015.
DOI : 10.1109/ICCV.2015.460

S. Vishwakarma and A. Agrawal, A survey on activity recognition and behavior understanding in video surveillance, The Visual Computer, vol.114, issue.12, pp.983-1009, 2013.
DOI : 10.1016/j.cviu.2009.11.005

C. Vondrick and D. Ramanan, Video annotation and tracking with active learning, NIPS, 2011.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, Phoneme recognition using time-delay neural networks. Readings in speech recognition, pp.393-404, 1990.

H. Wang, D. Oneata, J. Verbeek, and C. Schmid, A Robust and Efficient Video Representation for Action Recognition, International Journal of Computer Vision, vol.103, issue.1, pp.1-20, 2015.
DOI : 10.1109/ICCV.2013.442
URL : https://hal.archives-ouvertes.fr/hal-01145834

H. Wang, W. Wang, and L. Wang, How scenes imply actions in realistic videos? In ICIP, pp.1619-1623, 2016.
DOI : 10.1109/icip.2016.7532632

J. Wang, W. Wang, R. Wang, and W. Gao, Deep alternative neural network: Exploring contexts as early as possible for action recognition, Advances in Neural Information Processing Systems, pp.811-819, 2016.

L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectory-pooled deepconvolutional descriptors, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4305-4314, 2015.
DOI : 10.1109/cvpr.2015.7299059
URL : http://arxiv.org/pdf/1505.04868

L. Wang, Z. Wang, Y. Xiong, and Y. Qiao, CUHK&SIAT submission for thumos15 action recognition challenge, THUMOS Action Recognition challenge, pp.1-3, 2015.

L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, Towards good practices for very deep two-stream convnets. arXiv preprint, 2015.
DOI : 10.1007/978-3-319-46484-8_2
URL : http://arxiv.org/pdf/1608.00859

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, European Conference on Computer Vision, pp.20-36, 2016.
DOI : 10.1109/CVPR.2016.219
URL : http://arxiv.org/pdf/1608.00859

P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang et al., Action Recognition From Depth Maps Using Deep Convolutional Neural Networks, IEEE Transactions on Human-Machine Systems, vol.46, issue.4, pp.498-509, 2016.
DOI : 10.1109/THMS.2015.2504550

P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao et al., Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks, 2016 23rd International Conference on Pattern Recognition (ICPR), 2016.
DOI : 10.1109/ICPR.2016.7899600
URL : http://arxiv.org/pdf/1608.06338

P. Wang, Q. Song, H. Han, and J. Cheng, Sequentially Supervised Long Short-Term Memory for Gesture Recognition, Cognitive Computation, vol.115, issue.3, pp.1-10, 2016.
DOI : 10.1007/s11263-015-0816-y

P. Wang, W. Li, S. Liu, Z. Gao, C. Tang et al., Large-scale isolated gesture recognition using convolutional neural networks. arXiv preprint, 2017.
DOI : 10.1109/icpr.2016.7899599
URL : http://arxiv.org/pdf/1701.01814

Y. Wang and M. Hoai, Improving Human Action Recognition by Non-action Classification, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2698-2707, 2016.
DOI : 10.1109/CVPR.2016.295
URL : http://arxiv.org/pdf/1604.06397

Y. Wang, J. Song, L. Wang, L. Van-gool, and O. Hilliges, Two-Stream SR-CNNs for Action Recognition in Videos, Procedings of the British Machine Vision Conference 2016, 2016.
DOI : 10.5244/C.30.108

Z. Wang, L. Wang, W. Du, and Y. Qiao, Exploring fisher vector and deep networks for action spotting, CVPRW, pp.10-14

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to track for spatio-temporal action localization. abs/1506, 1929.
DOI : 10.1109/iccv.2015.362
URL : https://hal.archives-ouvertes.fr/hal-01159941

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to Track for Spatio-Temporal Action Localization, 2015 IEEE International Conference on Computer Vision (ICCV), pp.3164-3172, 2015.
DOI : 10.1109/ICCV.2015.362
URL : https://hal.archives-ouvertes.fr/hal-01159941

P. A. Wilson and B. Lewandowska-tomaszczyk, Affective Robotics: Modelling and Testing Cultural Prototypes, Cognitive Computation, vol.1, issue.1, pp.814-840, 2014.
DOI : 10.1007/978-3-540-78157-8_11

C. Wolf, E. Lombardi, J. Mille, O. Celiktutan, M. Jiu et al., Evaluation of video activity localizations integrating quality and quantity measurements, Computer Vision and Image Understanding, vol.127, pp.14-30, 2014.
DOI : 10.1016/j.cviu.2014.06.014
URL : https://hal.archives-ouvertes.fr/hal-01283866

D. Wu, L. Pigou, P. J. Kindermans, N. Le, L. Shao et al., Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.8, pp.1-1, 2016.
DOI : 10.1109/TPAMI.2016.2537340
URL : http://doi.org/10.1109/tpami.2016.2537340

J. Wu, J. Cheng, C. Zhao, and H. Lu, Fusing multi-modal features for gesture recognition, Proceedings of the 15th ACM on International conference on multimodal interaction, ICMI '13, pp.453-460, 2013.
DOI : 10.1145/2522848.2532589

J. Wu, P. Ishwar, and J. Konrad, Two-stream cnns for gesture-based verification and identification: Learning user style, CVPRW, 2016.
DOI : 10.1007/978-3-319-61657-5_7

J. Wu, G. Wang, W. Yang, and X. Ji, Action recognition with joint attention on multi-level deep features, 2016.

Z. Wu, Y. Fu, Y. Jiang, and L. Sigal, Harnessing object and scene semantics for largescale video understanding, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3112-3121, 2016.
DOI : 10.1109/cvpr.2016.339

X. Xu, T. M. Hospedales, and S. Gong, Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation, Proc. European Conference on Computer Vision, 2016.
DOI : 10.1007/978-3-642-15561-1_11
URL : http://arxiv.org/pdf/1611.08663

J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.379-385, 1992.
DOI : 10.1109/CVPR.1992.223161

Y. Ye and Y. Tian, Embedding Sequential Information into Spatiotemporal Features for Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016.
DOI : 10.1109/CVPRW.2016.142

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-fei, End-to-End Learning of Action Detection from Frame Glimpses in Videos, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2678-2687, 2016.
DOI : 10.1109/CVPR.2016.293

D. Yu, A. Eversole, M. Seltzer, K. Yao, Z. Huang et al., An introduction to computational networks and the computational network toolkit, 2014.

J. Yu, K. Weng, G. Liang, and G. Xie, A vision-based robotic grasping system using deep learning for 3D object recognition and pose estimation, 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp.1175-1180, 2013.
DOI : 10.1109/ROBIO.2013.6739623

J. Yuan, B. Ni, X. Yang, and A. Kassim, Temporal Action Localization with Pyramid of Score Distribution Features, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.337

J. Yue-hei, M. Ng, S. Hausknecht, O. Vijayanarasimhan, R. Vinyals et al., Beyond short snippets: Deep networks for video classification, CVPR, pp.4694-4702, 2015.

S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov, Exploiting image-trained cnn architectures for unconstrained video classification. arXiv preprint, 2015.
DOI : 10.5244/c.29.60
URL : http://arxiv.org/pdf/1503.04144

B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, Real-Time Action Recognition with Enhanced Motion Vector CNNs, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2718-2726, 2016.
DOI : 10.1109/CVPR.2016.297
URL : http://arxiv.org/pdf/1604.07669

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning deep features for scene recognition using places database, NIPS, pp.487-495
DOI : 10.1109/tpami.2017.2723009

T. Zhou, N. Li, X. Cheng, Q. Xu, L. Zhou et al., Learning semantic context feature-tree for action recognition via nearest neighbor fusion, Neurocomputing, vol.201, pp.1-11, 2016.
DOI : 10.1016/j.neucom.2016.04.007

Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian, Interaction part mining: A mid-level approach for fine-grained action recognition, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3323-3331, 2015.
DOI : 10.1109/CVPR.2015.7298953
URL : http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Zhou_Interaction_Part_Mining_2015_CVPR_paper.pdf

G. Zhu, L. Zhang, L. Mei, J. Shao, J. Song et al., Large-scale Isolated Gesture Recognition using pyramidal 3D convolutional networks, 2016 23rd International Conference on Pattern Recognition (ICPR), 2016.
DOI : 10.1109/ICPR.2016.7899601

W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, A Key Volume Mining Deep Framework for Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1991-1999, 2016.
DOI : 10.1109/CVPR.2016.219

C. L. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, European Conference on Computer Vision, pp.391-405, 2014.
DOI : 10.1007/978-3-319-10602-1_26
URL : http://research.microsoft.com/en-us/um/people/larryz/ZitnickDollarECCV14edgeBoxes.pdf