J. Aggarwal and Q. Cai, Human motion analysis: A review, Computer Vision and Image Understanding, vol.73, issue.3, p.30, 1999.

I. Akhter and M. J. Black, Pose-conditioned joint angle limits for 3D human pose reconstruction, CVPR, p.22, 2015.

T. Alldieck, M. Kassubeck, B. Wandt, B. Rosenhahn, and M. Magnor, Optical flow-based 3D human motion estimation from monocular video, GCPR, p.26, 2017.

Y. Amit and D. Geman, Shape quantization and recognition with randomized trees, Neural Computation, vol.9, issue.7, p.19, 1997.

M. Andriluka, S. Roth, and B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation, CVPR, p.19, 2009.

M. Andriluka, S. Roth, and B. Schiele, Discriminative appearance models for pictorial structures, International Journal of Computer Vision, vol.99, p.19, 2012.

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, 2D human pose estimation: New benchmark and state of the art analysis, CVPR, vol.71, p.74, 2014.

D. Anguelov, Learning Models of Shape from 3D Range Data, p.27, 2005.

D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers et al., SCAPE: Shape completion and animation of people, SIGGRAPH, vol.26, p.65, 2005.

N. I. Badler, J. O'rourke, and H. Tolzis, A human body modelling system for motion studies, IEEE, vol.11, p.18, 1979.

N. Badler, Temporal Scene Analysis: Conceptual Descriptions of Object Movements, vol.29, p.30, 1975.

A. Balan, L. Sigal, M. J. Black, J. Davis, and H. Haussecker, Detailed human shape and pose from images, CVPR, vol.26, p.64, 2007.

L. Ballan, A. Taneja, J. Gall, L. Van-gool, and M. Pollefeys, Motion capture of hands in action using discriminative salient points, ECCV, p.176, 2012.

N. Ballas, L. Yao, C. J. Pal, and A. C. Courville, Delving deeper into convolutional networks for learning video representations, ICLR, p.35, 2016.

F. Baradel, C. Wolf, and J. Mille, Pose-conditioned spatio-temporal attention for human action recognition. CoRR, abs/1703.10106, vol.125, p.139, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01593548

F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, Glimpse clouds: Human activity recognition from unstructured feature points, CVPR, vol.141, p.143, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01713109

I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis, Looking beyond appearances: Synthetic training data for deep CNNs in re-identification, Computer Vision and Image Understanding, vol.167, p.74, 2018.

A. Baumberg and D. Hogg, Efficient method for contour tracking using active shape models, Motion of Non-Rigid and Articulated Obgects Workshop, p.31, 1994.

A. Baumberg and D. C. Hogg, Generating spatiotemporal models from examples, Image and Vision Computing, vol.14, p.31, 1996.

H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, Dynamic image networks for action recognition, CVPR, vol.35, p.100, 2016.

M. J. Black, Y. Yacoob, A. D. Jepson, and D. J. Fleet, Learning parameterized models of image motion, CVPR, p.32, 1997.

, Blender -a 3D modelling and rendering package, vol.42, p.185

A. J. Bobick, Movement, activity and action: the role of knowledge in the perception of motion, Philosophical transactions of the Royal Society of London. Series B, Biological sciences, vol.352, pp.1257-65, 1997.

. Bodynet and . Page, , p.87

F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero et al., Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image, ECCV, vol.64, p.80, 2016.

C. Bregler, Learning and recognizing human dynamics in video sequences, CVPR, p.32, 1997.

C. Bregler and J. Malik, Tracking people with twists and exponential maps, CVPR, p.31, 1998.

J. Bromley, I. Guyon, Y. Lecun, E. Säckinger, and R. Shah, Signature verification using a "Siamese" time delay neural network, NIPS, p.127, 1993.

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High accuracy optical flow estimation based on a theory for warping, ECCV, vol.101, p.103, 2004.

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, A naturalistic open source movie for optical flow evaluation, ECCV, p.74, 2012.

F. Caba-heilbron, V. Escorcia, B. Ghanem, C. Niebles, and J. , Activitynet: A large-scale video benchmark for human activity understanding, CVPR, vol.154, p.155, 2015.

Y. Cai, L. Ge, J. Cai, and J. Yuan, Weakly-supervised 3D hand pose estimation from monocular RGB images, ECCV, vol.196, p.197, 2018.

K. Cao, Y. Rong, C. Li, X. Tang, and C. C. Loy, Pose-robust face recognition via deep residual equivariant mapping, CVPR, vol.127, p.134, 2018.

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime multi-person 2D pose estimation using part affinity fields, CVPR, vol.21, p.62, 2017.

. Carnegie-mellon-mocap-database, , vol.43, p.184

J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the Kinetics dataset, CVPR, vol.35, p.128, 2017.

J. Carreira, P. Agrawal, K. Fragkiadaki, M. , and J. , Human pose estimation with iterative error feedback, CVPR, p.20, 2016.

C. Cédras and M. Shah, Motion-based recognition a survey, Image and Vision Computing, vol.13, issue.2, p.29, 1995.

A. X. Chang, A. Funkhouser, J. Guibas, P. Hanrahan, Q. Huang et al., An information-rich 3D model repository, vol.176, p.184, 2015.

J. Charles, T. Pfister, M. Everingham, and A. Zisserman, Automatic and efficient human pose estimation for sign language videos, International Journal of Computer Vision, p.19, 2013.

C. Chen and D. Ramanan, 3D human pose estimation = 2D pose estimation + matching, CVPR, p.25, 2017.

D. L. Chen and W. B. Dolan, Collecting highly parallel data for paraphrase evaluation, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol.1, p.154, 2011.

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, Y. et al., Semantic image segmentation with deep convolutional nets and fully connected CRFs, ICLR, p.24, 2015.

L. Chen, Y. Yang, J. Wang, W. Xu, Y. et al., Attention to scale: Scale-aware semantic image segmentation, CVPR, vol.25, p.46, 2016.

W. Chen, H. Wang, Y. Li, H. Su, Z. Wang et al., Synthesizing training images for boosting human 3D pose estimation, vol.3, p.74, 2016.

W. Chen, Z. Fu, D. Yang, and J. Deng, Single-image depth perception in the wild, NIPS, vol.25, p.47, 2016.

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta et al., Microsoft COCO captions: Data collection and evaluation server, p.167, 2015.

X. Chen and A. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, NIPS, p.20, 2014.

X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun et al., Detect what you can: Detecting and representing objects using holistic models and body parts, CVPR, vol.24, p.25, 2014.

C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction, ECCV, 0197.

. Chumpy,

E. Coumans, Bullet real-time physics simulation, 2013.

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, p.33, 2004.

D. Bourdev, L. Malik, and J. Poselets, Body part detectors trained using 3d human pose annotations, ICCV, p.19, 2009.

N. Dalal, B. Triggs, and C. Schmid, Human detection using oriented histograms of flow and appearance, ECCV, p.34, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00548587

M. De-la-gorce, D. J. Fleet, P. , and N. , Model-based 3D hand pose estimation from monocular video, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, issue.9, p.173, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00856313

H. Deng, T. Birdal, S. Ilic, and . Ppfnet, Global context aware local features for robust 3D point matching, CVPR, p.63, 2018.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., ImageNet: A large-scale hierarchical image database, CVPR, vol.97, p.151, 2009.

J. Deutscher, A. Blake, R. , and I. , Articulated body motion capture by annealed particle filtering, CVPR, p.31, 2000.

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, Exploring nearest neighbor approaches for image captioning, p.168, 2015.

E. Dibra, H. Jain, C. Öztireli, R. Ziegler, and M. Gross, HS-Nets: Estimating human body shape from silhouettes with convolutional neural networks, vol.3, p.28, 2016.

E. Dibra, S. Melchior, T. Wolf, A. Balkis, A. C. Öztireli et al., Monocular RGB hand pose inference from unsupervised refinable nets, CVPR Workshops, vol.173, p.175, 2018.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, vol.34, p.100, 2015.

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas et al., Learning optical flow with convolutional networks, ICCV, p.39, 2015.

Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui et al., Marker-less 3D human motion capture with monocular image sequence and height-maps, ECCV, p.40, 2016.

D. Dwibedi, J. Tompson, C. Lynch, and P. Sermanet, Learning actionable representations from visual observations, In IROS, p.127, 2018.

A. Efros, A. Berg, G. Mori, M. , and J. , Recognizing action at a distance, ICCV, p.32, 2003.

M. Eichner, M. Marin-jimenez, A. Zisserman, and V. Ferrari, 2D articulated human pose estimation and retrieval in (almost) unconstrained still images, International Journal of Computer Vision, vol.99, issue.2, p.19, 2012.

D. Eigen and R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, ICCV, vol.25, p.46, 2015.

D. Eigen, C. Puhrsch, F. , and R. , Depth map prediction from a single image using a multi-scale deep network, NIPS, vol.25, p.48, 2014.

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results

S. R. Fanello, C. Keskin, S. Izadi, P. Kohli, D. Kim et al., Learning to be a depth camera for close-range human capture and interaction, SIGGRAPH, p.39, 2014.

C. Farabet, C. Couprie, L. Najman, and Y. Lecun, Learning hierarchical features for scene labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, p.24, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00742077

A. Farhadi and M. Tabrizi, Learning to recognize activities from the wrong view point, ECCV, p.125, 2008.

G. Farnebäck, Two-frame motion estimation based on polynomial expansion, SCIA, vol.101, p.103, 2003.

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, CVPR, vol.35, p.122, 2016.

T. Feix, J. Romero, H. Schmiedmayer, A. Dollar, and D. Kragic, The grasp taxonomy of human grasp types. Human-Machine Systems, IEEE Transactions on, p.203, 2016.

P. Felzenszwalb, D. Mcallester, and D. Ramanan, A discriminatively trained, multiscale, deformable part model, CVPR, p.19, 2008.

P. Felzenszwalb, R. Girshick, D. Mcallester, and D. Ramanan, Object detection with discriminatively trained part based models. Pattern Analysis and Machine Intelligence, vol.32, p.19, 2010.

P. Felzenszwalb and D. Huttenlocher, Pictorial structures for object recognition, International Journal of Computer Vision, vol.61, p.18, 2005.

C. Fernández, P. Baiget, F. Roca, and J. Gonzàlez, Determining the best suited semantic events for cognitive surveillance, Expert Systems with Applications, vol.38, issue.4, pp.4068-4079, 2011.

B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars, Modeling video evolution for action recognition, CVPR, p.34, 2015.

C. Ferrari and J. F. Canny, Planning optimal grasps, ICRA, p.203, 1992.

V. Ferrari, M. Marin-jimenez, and A. Zisserman, Progressive search space reduction for human pose estimation, CVPR, p.19, 2008.

V. Ferrari, M. Marín-jiménez, and A. Zisserman, 2D human pose estimation in TV shows, Statistical and Geometrical Approaches to Visual Motion Analysis, p.154, 2009.

M. A. Fischler and R. A. Elschlager, The representation and matching of pictorial structures, IEEE Transactions on Computers, C, vol.22, issue.1, p.18, 1973.

D. A. Forsyth and M. M. Fleck, Body plans, CVPR, vol.18, 1997.

D. F. Fouhey, W. Kuo, A. A. Efros, M. , and J. , From lifestyle VLOGs to everyday interactions, In CVPR, issue.8, 2018.

A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, Virtual worlds as proxy for multiobject tracking analysis, CVPR, p.40, 2016.

G. Garcia-hernando, S. Yuan, S. Baek, and T. Kim, First-person hand action benchmark with RGB-D videos and 3D hand pose annotations, CVPR, vol.173, p.195, 2018.

D. M. Gavrila and L. S. Davis, Towards 3-D model-based tracking and recognition of human movement: a multi-view approach, Int. Workshop on Face and Gesture Recognition, p.31, 1995.

D. Gavrila, The visual analysis of human movement: A survey. Computer Vision and Image Understanding, vol.73, p.29, 1999.

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, Vision meets robotics: The KITTI dataset, International Journal of Robotics Research, vol.32, p.25, 2013.

M. F. Ghezelghieh, R. Kasturi, and S. Sarkar, Learning camera viewpoint using cnn to improve 3D body pose estimation, vol.3, p.74, 2016.

R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta, Learning a predictable and generative vector representation for objects, ECCV, p.63, 2016.

R. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, vol.97, p.99, 2014.

G. Gkioxari, B. Hariharan, R. Girshick, M. , and J. , Using k-poselets for detecting people and localizing their keypoints, CVPR, p.19, 2014.

C. Goldfeder, M. T. Ciocarlie, H. Dang, A. , and P. K. , The Columbia grasp database, ICRA, vol.183, p.203, 2009.

K. Gong, X. Liang, X. Shen, L. , and L. , Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing, CVPR, p.25, 2017.

J. Gonzàlez, Human Sequence Evaluation: The Key-frame Approach, 2004.

A. Gorban, H. Idrees, Y. Jiang, A. Roshan-zamir, I. Laptev et al., Action recognition with a large number of classes, vol.151, p.155, 2015.

L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as spacetime shapes, Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.12, p.152, 2007.

R. Green, Spherical harmonic lighting: The gritty details, Archives of the Game Developers Conference, vol.56, p.44, 2003.

T. Groueix, M. Fisher, V. G. Kim, B. Russell, A. et al., AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation, CVPR, vol.63, 0198.

T. Groueix, M. Fisher, V. G. Kim, B. Russell, A. et al., 3D correspondences by deep deformation, vol.179, p.199, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01830474

C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru et al., AVA: A video dataset of spatio-temporally localized atomic visual actions, CVPR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01764300

P. Guan, A. Weiss, O. Balan, A. Black, and M. , Estimating human shape and pose from a single image, ICCV, vol.26, p.64, 2009.

R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou et al., DenseReg: Fully convolutional dense shape regression in-the-wild, CVPR, p.66, 2017.

R. A. Güler, N. Neverova, and I. Kokkinos, DensePose: Dense human pose estimation in the wild, CVPR, vol.25, p.81, 2018.

A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, 3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding, CVPR, p.138, 2014.

A. Gupta and L. S. Davis, Objects in action: An approach for combining action understanding and object perception, CVPR, p.154, 2007.

H. Hamer, K. Schindler, E. Koller-meier, and L. Van-gool, Tracking a hand manipulating an object, ICCV, vol.172, p.176, 2009.

H. Hamer, J. Gall, T. Weise, and L. Van-gool, An object-dependent hand pose prior from sparse training data, CVPR, p.176, 2010.

K. Hara, H. Kataoka, and Y. Satoh, Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In CVPR, vol.35, p.141, 2018.

B. Hariharan, P. A. Arbeláez, R. B. Girshick, M. , and J. , Simultaneous detection and segmentation, ECCV, p.24, 2014.

B. Hariharan, P. A. Arbeláez, R. B. Girshick, M. , and J. , Hypercolumns for object segmentation and fine-grained localization, CVPR, p.24, 2015.

Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black et al., Learning joint reconstruction of hands and manipulated objects, CVPR, vol.11, p.13, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02429093

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, vol.130, p.202, 2015.

T. Heap and D. Hogg, Towards 3D hand tracking using a deformable model, International Conference on Automatic Face and Gesture Recognition, vol.172, p.175, 1996.

G. Hinton, Using relaxation to find a puppet, Artificial Intelligence and Simulation of Behaviour, vol.18, p.20, 1976.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol.9, issue.8, p.34, 1997.

D. Hogg, Model-based vision: a program to see a walking person, Image and Vision Computing, vol.1, issue.1, p.19, 1983.

S. Hongeng and R. Nevatia, Multi-agent event recognition, ICCV, 2001.

J. Hu, W. Zheng, J. Lai, and J. Zhang, Jointly learning heterogeneous features for RGB-D activity recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.11, p.141, 2017.

Y. Huang, F. Bogo, C. Lassner, A. Kanazawa, P. V. Gehler et al., Towards accurate marker-less human shape and pose estimation over time, vol.3, p.26, 2017.

E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schieke, Deeper-Cut: A deeper, stronger, and faster multi-person pose estimation model, ECCV, 1921.

S. Ioffe and D. Forsyth, Probabilistic methods for finding people, International Journal of Computer Vision, vol.43, issue.1, pp.45-68, 2001.

C. Ionescu, L. Fuxin, and C. Sminchisescu, Latent structured models for human pose estimation, ICCV, vol.48, p.53, 2011.

C. Ionescu, J. Carreira, and C. Sminchisescu, Iterated second-order label sensitive pooling for 3D human pose estimation, CVPR, vol.45, p.53, 1924.

C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, and . Human3, 6M: Large scale datasets and predictive methods for 3D human sensing in natural environments, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, issue.7, p.86, 2014.

U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz, Hand pose estimation via latent 2.5D heatmap regression, ECCV, vol.173, p.197, 2018.

M. Isard and A. Blake, Condensation-conditional density propagation for visual tracking, International Journal of Computer Vision, vol.29, issue.1, p.32, 1998.

Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo, First-person animal activity recognition from egocentric videos, ICPR, p.154, 2014.

A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos, Large pose 3D face reconstruction from a single image via direct volumetric CNN regression, ICCV, vol.65, p.67, 2017.

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards understanding action recognition, ICCV, p.24, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00906902

S. Ji, W. Xu, M. Yang, Y. , and K. , 3D convolutional neural networks for human action recognition, ICML, vol.99, p.100, 1998.

Y. Ji, F. Xu, Y. Yang, F. Shen, H. T. Shen et al., A large-scale RGB-D database for arbitrary-view human action recognition, ACMMM, vol.124, p.141, 2018.

S. Johnson and M. Everingham, Clustered pose and nonlinear appearance models for human pose estimation, BMVC, vol.20, p.74, 2010.

S. X. Ju, M. J. Black, Y. , and Y. , Cardboard people: a parameterized model of articulated image motion, International Conference on Automatic Face and Gesture Recognition, p.32, 1996.

M. Kan, S. Shan, C. , and X. , Multi-view deep network for cross-view classification, CVPR, p.127, 2016.

A. Kanazawa, M. J. Black, D. W. Jacobs, M. , and J. , End-to-end recovery of human shape and pose, CVPR, vol.63, p.177, 2018.

A. Kanazawa, S. Tulsiani, A. A. Efros, M. , and J. , Learning category-specific mesh reconstruction from image collections, ECCV, vol.179, p.199, 2018.

V. Kantorov and I. Laptev, Efficient feature extraction, encoding, and classification for action recognition, CVPR, vol.101, p.103, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01058734

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-scale video classification with convolutional neural networks, CVPR, vol.155, p.164, 2014.

H. Kato, Y. Ushiku, and T. Harada, Neural 3D mesh renderer, CVPR, vol.176, p.179, 2018.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The Kinetics human action video dataset, vol.35, p.130, 2017.

Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, A new representation of skeleton sequences for 3D action recognition, CVPR, vol.125, p.139, 2017.

C. Keskin, F. K?raç, Y. Kara, A. , and L. , Hand pose estimation and hand shape classification using multi-layered randomized decision forests, ECCV, vol.172, p.175, 2012.

. Kinect,

D. P. Kingma, J. Ba, and . Adam, A method for stochastic optimization. ICLR, vol.198, p.202, 2014.

A. Kläser, M. Marsza?ek, and C. Schmid, A spatio-temporal descriptor based on 3D-gradients, BMVC, p.33, 2008.

Y. Kong, Z. Ding, J. Li, and Y. Fu, Deeply learned view-invariant features for crossview action recognition, IEEE Transactions on Image Processing, vol.26, issue.6, p.125, 2017.

Y. Kong and Y. Fu, Human action recognition and prediction: A survey. CoRR, abs/1806.11230, p.124, 2018.

I. Kostrikov and J. Gall, Depth sweep regression forests for estimating 3D human pose from images, BMVC, vol.64, p.87, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, vol.34, p.163, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: a large video database for human motion recognition, ICCV, vol.153, p.155, 2011.

H. Kuehne, A. B. Arslan, and T. Serre, The language of actions: Recovering the syntax and semantics of goal-directed human activities, CVPR, vol.152, p.154, 2014.

Z. Lan, M. Lin, X. Li, A. G. Hauptmann, R. et al., Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition, CVPR, p.116, 2015.

I. Laptev, On space-time interest points, International Journal of Computer Vision, vol.64, issue.2-3, p.33, 2005.

I. Laptev, Modeling and visual recognition of human actions and interactions. Habilitation à diriger des recherches en mathématiques et en informatique, Ecole normale supérieure, 2013.
URL : https://hal.archives-ouvertes.fr/tel-01064540

I. Laptev, M. Marsza?ek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, CVPR, vol.33, p.151, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00548659

C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black et al., Unite the people: Closing the loop between 3D and 2D human representations, CVPR, vol.85, p.87, 2017.

Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard et al., Backpropagation applied to handwritten zip code recognition, Neural Computation, vol.1, issue.4, p.99, 1989.

H. Lee and Z. Chen, Determination of 3d human body postures from a single view, vol.30, p.20, 1985.

I. Lenz, H. Lee, and A. Saxena, Deep learning for detecting robotic grasps, The International Journal of Robotics Research, p.183, 2015.

V. Leroy, J. Franco, and E. Boyer, Multi-view dynamic shape refinement using local temporal integration, ICCV, p.62, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01567758

T. Lewiner, H. Lopes, A. W. Vieira, and G. Tavares, Efficient implementation of marching cubes cases with topological guarantees, Journal of Graphics Tools, vol.8, issue.2, p.87, 2003.

S. Li and A. B. Chan, 3d human pose estimation from monocular images with deep convolutional neural network, ACCV, p.22, 2014.

J. Lin, Y. Wu, and T. S. Huang, Modeling the constraints of human hand motion, Proceedings of the Workshop on Human Motion, p.178, 2000.

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick et al., Common objects in context, ECCV, p.20, 2014.

F. Liu, C. Shen, L. , and G. , Deep convolutional neural fields for depth estimation from a single image, CVPR, vol.25, p.46, 2015.

J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot, Global context-aware attention LSTM networks for 3D action recognition, CVPR, vol.125, p.139, 2017.

J. Liu, J. Luo, and M. Shah, Recognizing realistic actions from videos "in the wild, CVPR, vol.151, p.153, 2009.

J. Liu, B. Kuipers, and S. Savarese, Recognizing human actions by attributes, CVPR, p.34, 2011.

J. Liu, M. Shah, B. Kuipers, and S. Savarese, Cross-view action recognition via view knowledge transfer, CVPR, p.122, 2011.

J. Liu, A. Shahroudy, D. Xu, W. , and G. , Spatio-temporal LSTM with trust gates for 3D human action recognition, ECCV, vol.125, p.139, 2016.

M. Liu and J. Yuan, Recognizing human actions as the evolution of pose estimation maps, CVPR, vol.125, p.139, 2018.

M. Liu, H. Liu, C. , and C. , Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn, vol.68, p.139, 2017.

V. Lomonaco and D. Maltoni, Core50: a new dataset and benchmark for continuous object recognition, Proceedings of the 1st Annual Conference on Robot Learning, Proceedings of Machine Learning Research, vol.192, p.204, 2017.

J. Long, E. Shelhamer, D. , and T. , Fully convolutional networks for semantic segmentation, CVPR, p.24, 2015.

M. Loper, N. Mahmood, J. Romero, G. Pons-moll, M. J. Black et al., A skinned multi-person linear model, SIGGRAPH Asia, vol.177, p.184, 1992.

M. M. Loper, N. Mahmood, M. J. Black, and . Mosh, Motion and shape capture from sparse markers, SIGGRAPH Asia, vol.12, p.62, 2014.

D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, vol.60, p.33, 2004.

. Ltc and . Page, , p.118

A. Lucchi, Y. Li, X. Boix, K. Smith, and P. Fua, Are spatial and global constraints really necessary for segmentation? In ICCV, p.24, 2011.

Z. Luo, J. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-fei, Graph distillation for action detection with privileged information, ECCV, vol.125, p.143, 2018.

D. C. Luvizon, D. Picard, and H. Tabia, 2D/3D pose estimation and action recognition using multitask deep learning, CVPR, vol.66, p.143, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01815703

J. Maccormick and M. Isard, Partitioned sampling, articulated objects, and interface-quality hand tracking, ECCV, p.172, 2000.

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan et al., Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics, p.183, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01801048

M. Maire, S. X. Yu, and P. Perona, Object detection and segmentation from joint embedding of parts and pixels, ICCV, p.24, 2011.

J. Malik, A. Elhayek, F. Nunnari, K. Varanasi, K. Tamaddon et al., DeepHPS: End-to-end estimation of 3D hand pose and shape by learning from synthetic depth, vol.3, p.175, 2018.

E. Mansimov, N. Srivastava, and R. Salakhutdinov, Initialization strategies of spatio-temporal convolutional neural networks, p.35, 2015.

J. Marin, D. Vazquez, D. Geronimo, and A. M. Lopez, Learning appearance in virtual scenarios for pedestrian detection, CVPR, vol.8, p.40, 2010.

D. Marr, H. K. Nishihara, and S. Brenner, Representation and recognition of the spatial organization of three-dimensional shapes, Royal Society of London B, vol.18, p.31, 1978.

M. Marsza?ek, I. Laptev, and C. Schmid, Actions in context, CVPR, p.154, 2009.

J. Martinez, R. Hossain, J. Romero, and J. J. Little, A simple yet effective baseline for 3D human pose estimation, ICCV, vol.22, p.64, 2017.

F. Massa, B. Russell, A. , and M. , Deep exemplar 2D-3D detection by adapting from real to rendered views, CVPR, p.127, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01800639

D. Maturana, S. Scherer, and . Voxnet, A 3D convolutional neural network for realtime object recognition, IROS, vol.63, p.176, 2015.

D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei et al., VNect: Real-time 3D human pose estimation with a single RGB camera, p.22, 2017.

A. T. Miller and P. K. Allen, Graspit! A versatile simulator for robotic grasping. Robotics Automation Magazine, vol.11, p.203, 2004.

G. A. Miller, English verbs of motion: a case study in semantics and lexical memory, Coding Processes and Human Memory, vol.29, p.30, 1972.

P. Min and . Binvox,

T. B. Moeslund, A. Hilton, V. Krger, and L. Sigal, Visual Analysis of Humans: Looking at People, p.19, 2013.

T. Möller and B. Trumbore, Fast, minimum storage ray-triangle intersection, J. Graph. Tools, p.181, 1997.

F. Moreno-noguer, 3D human pose estimation from a single image via distance matrix regression, CVPR, p.22, 2017.

F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas et al., Real-time hand tracking under occlusion from an egocentric RGB-D sensor, p.175, 2017.

F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar et al., GANerated hands for real-time 3D hand tracking from monocular RGB, CVPR, vol.173, p.197, 2018.

H. Nagel, From image sequences towards conceptual descriptions, Image and Vision Computing, vol.6, issue.2, pp.59-74, 1988.

B. Neumann and H. Novak, Event models for recognition and natural language description of events in real-world image sequences, IJCAI, p.29, 1983.

A. Newell, K. Yang, and J. Deng, Stacked hourglass networks for human pose estimation, ECCV, vol.84, p.92, 2016.

A. Newell, Z. Huang, and J. Deng, Associative embedding: End-to-end learning for joint detection and grouping, NIPS, 1921.

J. Y. Ng, J. Choi, J. Neumann, L. S. Davis, and . Actionflownet, Learning motion representation for action recognition, In WACV, p.35, 2018.

J. Y. Ng, .. Hausknecht, M. Vijayanarasimhan, S. Vinyals, O. Monga et al., Beyond short snippets: Deep networks for video classification, CVPR, vol.34, p.116, 2015.

J. C. Niebles, H. Wang, and L. Fei-fei, Unsupervised learning of human action categories using spatial-temporal words, IJCV, vol.79, issue.3, p.97, 2008.

S. A. Niyogi and E. H. Adelson, Analyzing gait with spatiotemporal surfaces, Motion of Non-Rigid and Articulated Obgects Workshop, vol.32, p.33, 1994.

J. Nocedal and S. J. Wright, Numerical Optimization, p.73, 2006.

F. S. Nooruddin and G. Turk, Simplification and repair of polygonal models using volumetric techniques, IEEE Transactions on Visualization and Computer Graphics, vol.9, issue.2, p.67, 2003.

. Obman and . Page,

S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen et al., A large-scale benchmark dataset for event recognition in surveillance video, CVPR, p.152, 2011.

I. Oikonomidis, N. Kyriazis, A. , and A. A. , Efficient model-based 3D tracking of hand articulations using Kinect, BMVC, p.172, 2011.

I. Oikonomidis, N. Kyriazis, A. , and A. A. , Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints, ICCV, p.176, 2011.

I. Oikonomidis, N. Kyriazis, A. , and A. A. , Tracking the articulated motion of two strongly interacting hands, CVPR, p.176, 2012.

R. Okada and S. Soatto, Relevant feature selection for human pose estimation and localization in cluttered images, ECCV, vol.8, p.40, 2008.

G. Oliveira, A. Valada, C. Bollen, W. Burgard, and T. Brox, Deep learning for human part discovery in images, ICRA, vol.46, p.52, 1925.

M. Omran, C. Lassner, G. Pons-moll, P. V. Gehler, and B. Schiele, Neural body fitting: Unifying deep learning and model-based human pose and shape estimation, vol.3, p.28, 2018.

J. O'rourke and N. I. Badler, Model-based image analysis of human motion using constraint propagation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.2, issue.6, p.19, 1980.

P. Panteleris, I. Oikonomidis, A. , and A. , Using a single RGB frame for real time 3D hand pose estimation in the wild, In WACV, vol.173, p.175, 2018.

G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. Osman et al., Expressive body capture: 3D hands, face, and body from a single image, CVPR, vol.27, p.28, 1926.

G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, Coarse-to-fine volumetric prediction for single-image 3D human pose, CVPR, vol.64, p.70, 2017.

G. Pavlakos, X. Zhou, and K. Daniilidis, Ordinal depth supervision for 3D human pose estimation, CVPR, 2018.

G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, Learning to estimate 3D human pose and shape from a single color image, CVPR, vol.28, p.177, 2018.

X. Peng, B. Sun, K. Ali, and K. Saenko, Learning deep object detectors from 3D models, ICCV, p.39, 2015.

A. Pentland and B. Horowitz, Recovery of nonrigid motion and structure, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.13, p.32, 1991.

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher kernel for largescale image classification, ECCV, vol.33, p.163, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00548630

T. Pfister, Advancing Human Pose and Gesture Recognition, p.19, 2015.

T. Pham, N. Kyriazis, A. A. Argyros, and A. Kheddar, Hand-object contact force estimation from markerless visual tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.176, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01356138

H. Pirsiavash and D. Ramanan, Detecting activities of daily living in first-person camera views, CVPR, vol.154, p.155, 2012.

L. Pishchulin, A. Jain, C. Wojek, M. Andriluka, T. Thormählen et al., Learning people detection models from few training samples, CVPR, vol.8, p.40, 2011.

L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele, Articulated people detection and pose estimation: Reshaping the future, CVPR, vol.8, p.40, 2012.

L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka et al., DeepCut: Joint subset partition and labeling for multi person pose estimation, CVPR, vol.21, p.62, 2016.

G. Pons-moll, J. Romero, N. Mahmood, M. J. Black, and . Dyna, A model of dynamic human shape in motion, SIGGRAPH, p.44, 2015.

A. Popa, M. Zanfir, and C. Sminchisescu, Deep multitask architecture for integrated 2D and 3D human sensing, CVPR, vol.22, p.66, 2017.

. Primesense,

W. Qiu, Generating human images and ground truth using computer graphics. Master's thesis, UCLA, p.40, 2016.

M. Rad, M. Oberweger, and V. Lepetit, Feature mapping for learning fast and accurate 3D pose inference from synthetic images, CVPR, vol.127, p.134, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02506574

H. Rahmani, A. Mian, and M. Shah, Learning a deep model for human action recognition from novel viewpoints, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.3, p.125, 2018.

H. Rahmani and A. Mian, Learning a non-linear knowledge transfer model for crossview action recognition, CVPR, vol.40, p.138, 2015.

H. Rahmani and A. Mian, 3D action recognition from novel viewpoints, CVPR, vol.40, p.41, 2016.

H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, Histogram of oriented principal components for cross-view action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.12, p.124, 2016.

V. Ramakrishna, T. Kanade, and Y. Sheikh, Reconstructing 3D human pose from 2D image landmarks, ECCV, p.22, 2012.

D. Ramanan, Learning to parse images of articulated bodies, NIPS, 2006.

D. Ramanan, D. A. Forsyth, and A. Zisserman, Strike a pose: Tracking people by finding stylized poses, CVPR, p.19, 2005.

J. M. Rehg and T. Kanade, Visual tracking of high dof articulated structures: an application to human hand tracking, ECCV, vol.172, p.175, 1994.

H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei et al., EgoCap: Egocentric marker-less motion capture with two fisheye cameras, SIGGRAPH Asia, p.41, 2016.

G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger, OctNetFusion: Learning depth fusion from data, vol.3, p.63, 2017.

G. Riegler, A. O. Ulusoy, A. Geiger, and . Octnet, Learning deep 3D representations at high resolutions, CVPR, p.63, 2017.

K. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming et al., Civilian American and European Surface Anthropometry Resource (CAESAR), Final, vol.43, p.184, 2002.

M. D. Rodriguez, J. Ahmed, and M. Shah, Action mach a spatio-temporal maximum average correlation height filter for action recognition, CVPR, p.151, 2008.

G. Rogez and C. Schmid, MoCap-guided data augmentation for 3D pose estimation in the wild, NIPS, vol.40, p.87, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01389486

G. Rogez, M. Khademi, I. Supan?i?, J. S. Montiel, J. M. Ramanan et al., 3D hand pose detection in egocentric RGB-D images, ECCV Workshop on Consumer Depth Cameras for Computer Vision, p.176, 2014.

G. Rogez, I. Supan?i?, J. S. Ramanan, and D. , First-person pose recognition using egocentric workspaces, CVPR, p.176, 2015.

G. Rogez, I. Supan?i?, J. S. Ramanan, and D. , Understanding everyday hands in action from RGB-D images, ICCV, p.176, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01237011

G. Rogez, P. Weinzaepfel, C. Schmid, and . Lcr-net, Localization-classificationregression for human pose, CVPR, vol.64, p.87, 2017.

K. Rohr, Towards model-based recognition of human movements in image sequences, CVGIP: Image Understanding, vol.59, issue.1, p.31, 1994.

K. Rohr, Human movement analysis based on explicit motion models, Motion-Based Recognition, p.31, 1997.

A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal et al., Coherent multi-sentence video description with variable level of detail, Pattern Recognition, vol.152, p.154, 2014.

A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, A dataset for movie description, CVPR, vol.154, p.155, 0151.

M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, A database for fine grained activity detection of cooking activities, CVPR, vol.152, p.155, 2012.

J. Romero, H. Kjellström, and D. Kragic, Hands in action: real-time 3D reconstruction of hands in interaction with objects, ICRA, vol.176, p.177, 2010.

J. Romero, M. Loper, M. J. Black, and . Flowcap, 2D human pose from optical flow, p.40, 2015.

J. Romero, D. Tzionas, and M. J. Black, Embodied hands: Modeling and capturing hands and bodies together, Proc. SIGGRAPH Asia), vol.36, p.184

R. Ronfard, C. Schmid, and B. Triggs, Learning to parse pictures of people, ECCV, vol.18, p.20, 2002.
URL : https://hal.archives-ouvertes.fr/inria-00545109

O. Ronneberger, P. Fischer, T. Brox, and . U-net, Convolutional networks for biomedical image segmentation, MICCAI, p.24, 2015.

A. Rozantsev, M. Salzmann, and P. Fua, Beyond sharing weights for deep domain adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.127, 2018.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet large scale visual recognition challenge, International Journal of Computer Vision (IJCV), vol.115, issue.3, p.202, 2015.

M. S. Ryoo and J. K. Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, ICCV, p.154, 2009.

S. Sadanand and J. J. Corso, Action bank: A high-level representation of activity in video, CVPR, p.34, 2012.

A. Sahbani, S. El-khoury, and P. Bidaud, An overview of 3D object grasp synthesis algorithms, Robotics and Autonomous Systems, p.183, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00731127

G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval

B. Sapp and B. Taskar, MODEC: Multimodal decomposable models for human pose estimation, CVPR, vol.20, p.38, 2013.

C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, ICPR, vol.152, p.153, 1997.

P. Scovanner, S. Ali, and M. Shah, A 3-dimensional SIFT descriptor and its application to action recognition, ACM International Conference on Multimedia, p.33, 2007.

P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang et al., Time-contrastive networks: Self-supervised learning from video, In ICRA, vol.127, p.137, 2018.

M. Shah and R. Jain, Motion-Based Recognition, p.29, 1997.

A. Shahroudy, J. Liu, T. Ng, G. Wang, and . Rgb+d, A large scale dataset for 3D human activity analysis, CVPR, vol.123, p.139, 2016.

N. Shapovalova, C. Fernández, F. X. Roca, and J. Gonzàlez, Semantics of human behavior in image sequences, Computer Analysis of Human Behavior, pp.151-182, 2011.

J. Shotton, A. Fitzgibbon, A. Blake, A. Kipman, M. Finocchio et al., Real-time human pose recognition in parts from a single depth image, CVPR, vol.41, p.175, 2011.

H. Sidenbladh and M. J. Black, Learning image statistics for bayesian tracking, ICCV, p.31, 2001.

H. Sidenbladh, M. J. Black, and D. J. Fleet, Stochastic tracking of 3d human figures using 2d image motion, ECCV, p.31, 2000.

L. Sigal, A. Balan, M. J. Black, and . Humaneva, Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion, International Journal of Computer Vision, vol.87, issue.1, p.23, 2010.

L. Sigal, A. Balan, and M. J. Black, Combined discriminative and generative articulated pose and non-rigid shape estimation, NIPS, p.26, 2008.

G. A. Sigurdsson, O. Russakovsky, A. Farhadi, I. Laptev, and A. Gupta, Much ado about time: Exhaustive annotation of temporal data, vol.157, p.158, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01431527

G. A. Sigurdsson, G. Varol, X. Wang, I. Laptev, A. Farhadi et al., Hollywood in homes: Crowdsourcing data collection for activity understanding, ECCV, p.13, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01418216

G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, A. et al., Actor and observer: Joint modeling of first and third-person videos, CVPR, vol.127, p.137, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01755547

N. Silberman, D. Hoiem, P. Kohli, F. , and R. , Indoor segmentation and support inference from RGBD images, ECCV, p.25, 2012.

T. Simon, H. Joo, I. Matthews, and Y. Sheikh, Hand keypoint detection in single images using multiview bootstrapping, CVPR, vol.173, p.175, 2017.

E. P. Simoncelli and B. A. Olshausen, Natural image statistics and neural representation. Annual review of neuroscience, vol.24, p.159, 2001.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, p.163, 2015.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, vol.115, p.163, 2014.

C. Sminchisescu, A. Kanaujia, and D. Metaxas, Learning joint top-down and bottom-up processes for 3D visual inference, CVPR, p.40, 2006.

C. Sminchisescu and B. Triggs, Kinematic jump processes for monocular 3d human tracking, CVPR, p.31, 2003.
URL : https://hal.archives-ouvertes.fr/inria-00548223

Y. Song, L. Goncalves, and P. Perona, Unsupervised learning of human motion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.25, issue.7, p.32, 2003.

K. Soomro, A. Roshan-zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, vol.153, p.155, 2012.

A. Spurr, J. Song, S. Park, and O. Hilliges, Cross-modal deep variational hand pose estimation, CVPR, p.173, 2018.

S. Sridhar, F. Mueller, M. Zollhoefer, D. Casas, A. Oulasvirta et al., Real-time joint tracking of a hand manipulating an object from RGB-D input, ECCV, vol.173, p.176, 2016.

B. Stenger, P. R. Mendonça, and R. Cipolla, Model-based 3D tracking of an articulated hand, CVPR, p.172, 2001.

H. Su, C. R. Qi, Y. Li, and L. J. Guibas, Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views, ICCV, p.39, 2015.

H. Su, H. Fan, and L. Guibas, A point set generation network for 3D object reconstruction from a single image, CVPR, vol.63, p.176, 2017.

H. Su, C. Qi, K. Mo, and L. Guibas, PointNet: Deep learning on point sets for 3D classification and segmentation, CVPR, p.63, 2017.

A. Subramaniam, M. Chatterjee, and A. Mittal, Deep neural networks with inexact matching for person re-identification, NIPS, p.127, 2016.

A. Toshev, C. Szegedy, and . Deeppose, Human pose estimation via deep neural networks, CVPR, vol.20, p.63, 2014.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3D convolutional networks, ICCV, vol.128, p.164, 2015.

D. Tran, H. Wang, L. Torresani, J. Ray, Y. Lecun et al., A closer look at spatiotemporal convolutions for action recognition, CVPR, p.35, 2018.

A. Tsoli and A. Argyros, Joint 3D tracking of a deformable object in interaction with a hand, ECCV, vol.176, p.177, 2018.

K. Tuite, N. Snavely, D. Hsiao, N. Tabing, and Z. Popovic, PhotoCity: training experts at large-scale image acquisition through a competitive game, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, p.154, 2011.

S. Tulsiani, T. Zhou, A. A. Efros, M. , and J. , Multi-view supervision for singleview reconstruction via differentiable ray consistency, CVPR, p.68, 2017.

H. Tung, H. Tung, E. Yumer, and K. Fragkiadaki, Self-supervised learning of motion capture, NIPS, vol.63, p.79, 1928.

B. Tversky, J. Morrison, and J. Zacks, On bodies and events, The Imitative Mind, p.97, 2002.

D. Tzionas and J. Gall, 3d object reconstruction from hand-object interactions, ICCV, p.176, 2015.

D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys et al., Capturing hands in action using discriminative salient points and physics simulation, International Journal of Computer Vision, vol.118, issue.2, p.187, 2016.

L. Van-der-maaten and G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research, vol.9, p.159, 2008.

G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black et al., Learning from synthetic humans, CVPR, vol.84, p.184, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01505711

G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer et al., BodyNet: Volumetric inference of 3D human body shapes, ECCV, p.11, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01852169

G. Varol, I. Laptev, and C. Schmid, Long-term temporal convolutions for action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.6, p.128, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01241518

G. Varol, I. Laptev, and C. Schmid, On view-independent video representations for action recognition, p.11, 2019.

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell et al., Sequence to sequence-video to text, ICCV, p.168, 2015.

N. N. Vo and J. Hays, Localizing and orienting street views using overhead imagery, ECCV, p.127, 2016.

T. Von-marcard, B. Rosenhahn, M. Black, and G. Pons-moll, Sparse inertial poser: Automatic 3D human pose estimation from sparse IMUs, Eurographics, p.62, 2017.

S. Wachter and H. Nagel, Tracking of persons in monocular image sequences, IEEE Nonrigid and Articulated Motion Workshop, p.31, 1997.

D. Wang, W. Ouyang, W. Li, and D. Xu, Dividing and aggregating network for multi-view action recognition, ECCV, vol.123, p.143, 2018.

H. Wang and C. Schmid, Action recognition with improved trajectories, ICCV, vol.33, p.164, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873267

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision, vol.103, issue.1, p.34, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00725627

J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu, Cross-view action modeling, learning, and recognition, CVPR, vol.124, p.131, 2014.

L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectory-pooled deepconvolutional descriptors, CVPR, vol.35, p.116, 2015.

L. Wang, Y. Qiao, and X. Tang, Motionlets: Mid-level 3D parts for human motion recognition, CVPR, p.34, 2013.

L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, Towards good practices for very deep two-stream convnets, vol.34, p.104, 2015.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal segment networks: Towards good practices for deep action recognition, ECCV, vol.34, p.138, 2016.

N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu et al., Pixel2Mesh: Generating 3D mesh models from single RGB images, ECCV, vol.176, p.199, 2018.

P. Wang, Y. Liu, Y. Guo, C. Sun, X. Tong et al., Octree-based convolutional neural networks for 3D shape analysis. SIGGRAPH, p.63, 2017.

X. Wang, A. Farhadi, A. Gupta, and . Actions~transformations, CVPR, p.116, 2016.

X. Wang, R. Girshick, A. Gupta, and K. He, Non-local neural networks, CVPR, p.35, 2018.

Y. Wang, J. Min, J. Zhang, Y. Liu, F. Xu et al., Video-based hand manipulation capture through composite motion control, ACM Transactions on Graphics (TOG), vol.32, issue.4, p.176, 2013.

Y. Wang and M. Hebert, Learning to learn: Model regression networks for easy small sample learning, ECCV, p.127, 2016.

S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, Convolutional pose machines, CVPR, vol.62, p.63, 2016.

D. Weinland, E. Boyer, and R. Ronfard, Action recognition from arbitrary views using 3D exemplars, ICCV, p.124, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00544741

J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman et al., MarrNet: 3D shape reconstruction via 2.5D sketches, NIPS, p.176, 2017.

Y. Wu, J. Y. Lin, and T. S. Huang, Capturing natural hand articulation, ICCV, p.172, 2001.

Y. Yacoob and M. J. Black, Parameterized modeling and recognition of activities, ICCV, p.32, 1998.

X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision, NIPS, vol.63, p.68, 2016.

J. Yang, J. Franco, F. Hétroy-wheeler, and S. Wuhrer, Estimation of human body shape in motion with wide clothing, ECCV, p.62, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01344795

Y. Yang and D. Ramanan, Articulated pose estimation with flexible mixtures-ofparts, CVPR, p.19, 2011.

Y. Yang and D. Ramanan, Articulated human detection with flexible mixtures of parts, IEEE Trans. Pattern Anal. Mach. Intell, vol.35, issue.12, p.19, 2013.

H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall, A dual-source approach for 3D pose estimation from a single image, CVPR, vol.22, p.87, 2016.

K. M. Yi, E. Trulls-fortuny, V. Lepetit, and P. Fua, LIFT: Learned invariant feature transform, ECCV, p.127, 2016.

F. Yu, Y. Zhang, S. Song, A. Seff, X. et al., Construction of a largescale image dataset using deep learning with humans in the loop, vol.45, p.185, 2015.

M. E. Yumer and N. J. Mitra, Learning semantic deformation flows with 3D convolutional networks, ECCV, p.63, 2016.

S. Zagoruyko and N. Komodakis, Learning to compare image patches via convolutional neural networks, CVPR, p.127, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01246261

M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, ECCV, p.118, 2014.

L. Zelnik-manor and M. Irani, Event-based analysis of video, CVPR, vol.32, p.33, 2001.

B. Zhang, L. Wang, Z. Wang, Y. Qiao, W. et al., Real-time action recognition with enhanced motion vector CNNs, CVPR, p.100, 2016.

J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu et al., 3D hand pose tracking and estimation using stereo matching, p.196, 2016.

P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue et al., View adaptive recurrent neural networks for high performance human action recognition from skeleton data, vol.123, p.125, 2017.

J. Zheng and Z. Jiang, Learning view-invariant sparse representations for cross-view action recognition, ICCV, p.125, 2013.

J. Zheng, Z. Jiang, and R. Chellappa, Cross-view action recognition via transferable dictionary learning, IEEE Transactions on Image Processing, vol.25, issue.6, p.125, 2016.

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning deep features for scene recognition using places database, NIPS, vol.97, p.151, 2014.

X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis, Sparseness meets deepness: 3D human pose estimation from monocular video, CVPR, vol.22, p.40, 2016.

X. Zhou, X. Sun, W. Zhang, S. Liang, W. et al., Deep kinematic pose regression, ECCV Workshop on Geometry Meets Deep Learning, p.22, 2016.

X. Zhou, Q. Huang, X. Sun, X. Xue, W. et al., Towards 3D human pose estimation in the wild: A weakly-supervised approach, ICCV, vol.62, p.64, 2017.

J. Zhu, B. Wang, X. Yang, W. Zhang, and Z. Tu, Action recognition with actons, ICCV, p.34, 2013.

R. Zhu, H. Kiani, C. Wang, and S. Lucey, Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image, ICCV, p.68, 2017.

C. Zimmermann and T. Brox, Learning to estimate 3D hand pose from single RGB images, ICCV, vol.173, p.197, 2017.

G. K. Zipf, The psycho-biology of language, p.159, 1935.

C. Zitnick and D. Parikh, Bringing semantics into focus using visual abstraction, CVPR, p.155, 2013.

M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection, ICCV, vol.125, p.143, 2017.