. Bub, . Masson, D. Bub, and M. Masson, Gestural knowledge evoked by objects as part of conceptual representations, Aphasiology, vol.20, issue.9, pp.1112-1124, 2006.
DOI : 10.1080/02687030600741667

L. L. Chao and A. Martin, Representation of Manipulable Man-Made Objects in the Dorsal Stream, NeuroImage, vol.12, issue.4, pp.478-484, 2000.
DOI : 10.1006/nimg.2000.0635

. Chao, Layout Estimation of Highly Cluttered Indoor Scenes Using Geometric and Semantic Cues, 2013.
DOI : 10.1007/978-3-642-41184-7_50

. Choi, Understanding Indoor Scenes Using 3D Geometric Phrases, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.12

. Chow, . Liu, C. Chow, and C. Liu, Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, vol.14, pp.462-467, 1968.

. Csurka, Visual categorization with bags of keypoints, WS-SLCV, ECCV, 2004.

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.177

URL : https://hal.archives-ouvertes.fr/inria-00548512

. Dantone, Human Pose Estimation Using Body Parts Dependent Joint Regressors, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.391

. Dean, Fast, Accurate Detection of 100,000 Object Classes on a Single Machine, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.237

[. Pero, Bayesian geometric modeling of indoor scenes, CVPR, 2012.

[. Pero, Sampling bedrooms, CVPR, 2011.

. Delaitre, Scene Semantics from Long-Term Observation of People, 2012.
DOI : 10.1007/978-3-642-33783-3_21

URL : https://hal.archives-ouvertes.fr/hal-01060880

. Delaitre, Recognizing human actions in still images: a study of bag-of-features and part-based representations, Procedings of the British Machine Vision Conference 2010, 2010.
DOI : 10.5244/C.24.97

URL : https://hal.archives-ouvertes.fr/hal-01060885

. Delaitre, Willow actions database, 2010.

. Delaitre, Learning person-object interactions for action recognition in still images, NIPS, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00648156

. Desai, . Ramanan, C. Desai, and D. Ramanan, Detecting Actions, Poses, and Objects with Relational Phraselets, ECCV, 2012.
DOI : 10.1007/978-3-642-33765-9_12

. Desai, Discriminative models for static human-object interactions, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Workshops, 2010.
DOI : 10.1109/CVPRW.2010.5543176

. Deutscher, Articulated body motion capture by annealed particle filtering, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), 2000.
DOI : 10.1109/CVPR.2000.854758

. Deutscher, Tracking through singularities and discontinuities by random sampling, Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
DOI : 10.1109/ICCV.1999.790409

. Doersch, Mid-level visual element discovery as discriminative mode seeking, NIPS, 2013.

. Doersch, What makes paris look like paris, SIGGRAPH, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01053876

. Dollár, Behavior Recognition via Sparse Spatio-Temporal Features, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
DOI : 10.1109/VSPETS.2005.1570899

A. Elgammal and C. Lee, Inferring 3D body pose from silhouettes using activity manifold learning, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., 2004.
DOI : 10.1109/CVPR.2004.1315230

. Everingham, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, 2007.
DOI : 10.1007/s11263-009-0275-4

. Everingham, The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results, 2010.

. Everingham, The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012.

. Fathi, Learning to recognize objects in egocentric activities, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995444

. Fei-fei, . Li, L. Fei-fei, and L. Li, What, Where and Who? Telling the Story of an Image by Activity Classification, Scene Recognition and Object Categorization, Computer Vision, pp.157-171, 2010.
DOI : 10.1007/978-3-642-12848-6_6

P. Fei-fei, L. Fei-fei, and P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.16

P. Felzenszwalb, Learning models for object recognition, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001.
DOI : 10.1109/CVPR.2001.990647

. Felzenszwalb, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, 2009.
DOI : 10.1109/TPAMI.2009.167

H. Felzenszwalb, P. Felzenszwalb, and D. Huttenlocher, Distance transforms of sampled functions, 1963.

H. Felzenszwalb, P. Felzenszwalb, and D. Huttenlocher, Pictorial Structures for Object Recognition, International Journal of Computer Vision, vol.61, issue.1, 2005.
DOI : 10.1023/B:VISI.0000042934.15159.49

H. Felzenszwalb, P. Felzenszwalb, and D. P. Huttenlocher, Efficient matching of pictorial structures, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), 2000.
DOI : 10.1109/CVPR.2000.854739

. Felzenszwalb, A discriminatively trained, multiscale, deformable part model, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587597

. Felzenszwalb, A discriminatively trained, multiscale, deformable part model, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587597

. Ferrari, Progressive search space reduction for human pose estimation, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587468

. Ferrari, Progressive search space reduction for human pose estimation, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587468

E. Fischler, M. A. Fischler, and R. A. Elschlager, The Representation and Matching of Pictorial Structures, IEEE Transactions on Computers, vol.22, issue.1, pp.67-92, 1973.
DOI : 10.1109/T-C.1973.223602

. Fouhey, People watching: Human actions as a cue for single-view geometry, ECCV, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01060874

. Fouhey, Data-Driven 3D Primitives for Single Image Understanding, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.421

. Fouhey, Unfolding an Indoor Origami World, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_44

Y. Freund and R. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, vol.55, issue.1, pp.119-139, 1997.
DOI : 10.1006/jcss.1997.1504

. Jabri, Detection and location of people in video images using adaptive fusion of color and edge information, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, 2000.
DOI : 10.1109/ICPR.2000.902997

. Jhuang, A Biologically Inspired System for Action Recognition, 2007 IEEE 11th International Conference on Computer Vision, 2007.
DOI : 10.1109/ICCV.2007.4408988

S. Johnson, Leeds sports pose dataset, 2010.

S. Johnson and M. Everingham, Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation, Procedings of the British Machine Vision Conference 2010, 2010.
DOI : 10.5244/C.24.12

S. Johnson and M. Everingham, Learning effective human pose estimation from inaccurate annotation, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995318

. Kanaujia, Spectral Latent Variable Models for Perceptual Inference, 2007 IEEE 11th International Conference on Computer Vision, 2007.
DOI : 10.1109/ICCV.2007.4408845

. Khan, Geometry Driven Semantic Labeling of Indoor Scenes, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_44

. Kitani, Activity Forecasting, ECCV, 2012.
DOI : 10.1007/978-3-642-33765-9_15

. Kjellstrom, Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects, ECCV, 2008.
DOI : 10.1007/978-3-540-88688-4_25

T. Kohli, P. Kohli, and P. Torr, Robust Higher Order Potentials for Enforcing Label Consistency, International Journal of Computer Vision, vol.24, issue.3, pp.302-324, 2009.
DOI : 10.1007/s11263-008-0202-0

Z. Kourtzi, ???But still, it moves???, Trends in Cognitive Sciences, vol.8, issue.2, pp.47-49, 2004.
DOI : 10.1016/j.tics.2003.12.001

. Kourtzi, . Kanwisher, Z. Kourtzi, and N. Kanwisher, Activation in Human MT/MST by Static Images with Implied Motion, Journal of Cognitive Neuroscience, vol.252, issue.1, pp.48-55, 2000.
DOI : 10.1093/cercor/7.7.690

. Krahnstoever, . Mendonca, N. Krahnstoever, and P. R. Mendonca, Bayesian autocalibration for surveillance, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, 2005.
DOI : 10.1109/ICCV.2005.44

. Kuehne, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126543

. Mohan, Example-based object detection in images by components, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, issue.4, 2001.
DOI : 10.1109/34.917571

. Moore, Exploiting human actions and object context for recognition tasks, Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
DOI : 10.1109/ICCV.1999.791201

. Nelissen, Observing Others: Multiple Action Representation in the Frontal Lobe, Science, vol.310, issue.5746, pp.332-336, 2005.
DOI : 10.1126/science.1115593

. Niebles, Unsupervised learning of human action categories using spatial-temporal words, 2008.

. Ohta, An analysis system for scenes containing objects with substructures, Proceedings of the Fourth International Joint Conference on Pattern Recognitions, 1978.

T. Oliva, A. Oliva, and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International Journal of Computer Vision, vol.42, issue.3, pp.145-175, 2001.
DOI : 10.1023/A:1011139631724

. Oquab, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.222

URL : https://hal.archives-ouvertes.fr/hal-00911179

. Oren, Pedestrian detection using wavelet templates, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997.
DOI : 10.1109/CVPR.1997.609319

S. E. Palmer, Vision science: photons to phenomenology, 1999.

L. Pandey, M. Pandey, and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126383

P. Papageorgiou, C. Papageorgiou, and T. Poggio, A trainable system for object detection, 2000.

T. Payet, N. Payet, and S. Todorovic, Scene shape from texture of objects, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995326

. Peursum, Combining image regions and human activity for indirect object recognition in indoor wide-angle views, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, 2005.
DOI : 10.1109/ICCV.2005.57

. Pishchulin, Poselet Conditioned Pictorial Structures, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.82

. Pishchulin, Strong Appearance and Expressive Spatial Models for Human Pose Estimation, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.433

. Prest, Weakly Supervised Learning of Interactions between Humans and Objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.3, 2011.
DOI : 10.1109/TPAMI.2011.158

URL : https://hal.archives-ouvertes.fr/inria-00516477

. Quattoni, . Torralba, A. Quattoni, and A. Torralba, Recognizing indoor scenes, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206537

D. Ramanan, Learning to parse images of articulated bodies, NIPS, 2006.

. Ramanan, Strike a Pose: Tracking People by Finding Stylized Poses, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.335

. Rodriguez, Density-aware person detection and tracking in crowds, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126526

URL : https://hal.archives-ouvertes.fr/hal-00654266

. Rother, What Can Casual Walkers Tell Us About A 3D Scene?, 2007 IEEE 11th International Conference on Computer Vision, 2007.
DOI : 10.1109/ICCV.2007.4409082

S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science, vol.290, issue.5500, pp.2323-2326, 2000.
DOI : 10.1126/science.290.5500.2323

URL : http://astro.temple.edu/~msobel/courses_files/saulmds.pdf

. Russakovsky, Imagenet: Large scale visual recognition challenge 2014, 2014.

T. Sapp, B. Sapp, and B. Taskar, MODEC: Multimodal Decomposable Models for Human Pose Estimation, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.471

. Sapp, Cascaded Models for Articulated Pose Estimation, ECCV, 2010.
DOI : 10.1007/978-3-642-15552-9_30

. Satkin, Data-Driven Scene Understanding from 3D Models, Procedings of the British Machine Vision Conference 2012, 2012.
DOI : 10.5244/C.26.128

A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, 2002.

. Schuldt, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004.
DOI : 10.1109/ICPR.2004.1334462

. Schwing, Efficient structured prediction for 3D indoor scene understanding, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6248006

. Shotton, TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation, ECCV, 2006.
DOI : 10.1007/11744023_1

. Sidenbladh, Stochastic Tracking of 3D Human Figures Using 2D Image Motion, ECCV, 2000.
DOI : 10.1007/3-540-45053-X_45

. Sigal, Tracking loose-limbed people, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., 2004.
DOI : 10.1109/CVPR.2004.1315063

. Silberman, Instance Segmentation of Indoor Scenes Using a Coverage Loss, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_40

Z. Simonyan, K. Simonyan, and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014.

. Singh, Unsupervised Discovery of Mid-Level Discriminative Patches, ECCV, 2012.
DOI : 10.1007/978-3-642-33709-3_6

Z. Sivic, J. Sivic, and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, 2003.
DOI : 10.1109/ICCV.2003.1238663

. Soomro, UCF101: A dataset of 101 human actions classes from videos in the wild, 2012.

. Stark, Functional Object Class Detection Based on Learned Affordance Cues, ICVS, 2008.
DOI : 10.1007/978-3-540-79547-6_42

G. Staufer, C. Staufer, and W. Grimson, Adaptive background mixture models for real-time tracking, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), 1998.
DOI : 10.1109/CVPR.1999.784637

P. Sun, J. Sun, and J. Ponce, Learning Discriminative Part Detectors for Image Classification and Cosegmentation, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.422

URL : https://hal.archives-ouvertes.fr/hal-00932380

. Tenenbaum, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science, vol.290, issue.5500, pp.2319-2323, 2000.
DOI : 10.1126/science.290.5500.2319

. Tighe, J. Lazebnik-]-tighe, and S. Lazebnik, Superparsing, ECCV, 2010.
DOI : 10.1007/s11263-012-0574-z

. Tompson, Joint training of a convolutional network and a graphical model for human pose estimation, 2014.

. Wang, Unsupervised Discovery of Action Classes, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 2 (CVPR'06), 2006.
DOI : 10.1109/CVPR.2006.321

. Wong, Learning Motion Categories using both Semantic and Structural Information, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383332

Y. , R. Yang, Y. Ramanan, and D. , Articulated pose estimation using flexible mixtures of parts, CVPR, 2011.

Y. , R. Yang, Y. Ramanan, and D. , Articulated human detection with flexible mixtures of parts, 2013.

. Yao, . Fei-fei, B. Yao, and L. Fei-fei, Grouplet: A structured image representation for recognizing human and object interactions, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5540234

. Yao, . Fei-fei, B. Yao, and L. Fei-fei, Modeling mutual context of object and human pose in human-object interaction activities, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5540235

. Yao, Human action recognition by learning bases of action attributes and parts, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126386

. Yao, Combining randomization and discrimination for fine-grained image categorization, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995368

C. Yu and T. Joachims, Learning structural SVMs with latent variables, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553523

J. Yuen and A. Torralba, A Data-Driven Approach for Event Prediction, ECCV, 2010.
DOI : 10.1007/978-3-642-15552-9_51

. Zhang, Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study, International Journal of Computer Vision, vol.36, issue.1, 2007.
DOI : 10.1007/s11263-006-9794-4

URL : https://hal.archives-ouvertes.fr/inria-00548574

Z. Zhao, Y. Zhao, and S. Zhu, Image parsing with stochastic scene grammar, NIPS, 2011.