. Bibliography-[achanta, , 2012.

. Süsstrunk, SLIC Superpixels Compared to State-of-the-art Superpixel Methods, PAMI (cit, p.169, 2012.

J. Alayrac, , vol.58, 2016.

A. , Unsupervised learning from Narrated Instruction Videos, 2016.

A. , Learning from narrated instruction videos, PAMI (cit, p.10, 2017.

A. , Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs, p.10, 2016.

A. , Joint Discovery of Object States and Manipulation Actions, ICCV (cit, p.10, 2017.

. Andriluka, Pictorial structures revisited: People detection and articulated pose estimation, CVPR (cit, p.14, 2009.

H. Bach, Z. Bach, and . Harchaoui, DIFFRAC: A discriminative and flexible framework for clustering, NIPS (cit. on pp, vol.26, p.93, 2007.

[. Beck, The Cyclic Block Conditional Gradient Method for Convex Optimization Problems, SIOPT (cit, p.125, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01491541

S. Beck, S. Beck, and . Shtern, Linearly convergent awaystep conditional gradient for non-strongly convex functions, p.156, 2015.

C. M. Bishop, Pattern recognition and machine learning, p.38, 2006.

[. Bojanowski, Finding Actors and Actions in Movies, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00904991

[. Bojanowski, Weakly Supervised Action Labeling in Videos Under Ordering Constraints, ECCV (cit. on pp. 26, vol.27, p.100, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01053967

[. Bojanowski, Weakly-supervised alignment of video with text, ICCV (cit. on pp. 26, vol.55, p.105, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01154523

K. Boykov, V. Boykov, and . Kolmogorov, An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision, PAMI (cit, vol.171, p.132, 2004.

[. Brady, Detecting changes in real-world objects: The relationship between visual longterm memory and change blindness, Communicative and Integrative Biology, vol.87, 2006.

[. Branson, Efficient Large-Scale Structured Learning, CVPR (cit, p.114, 2013.

[. Braun, Lazifying Conditional Gradient Algorithms, ICML (cit, p.131, 2017.

J. P. Dünner, M. Dünner, and . Jaggi, Efficient use of limited-memory accelerators for linear learning on heterogeneous systems, NIPS (cit, p.125, 2017.

J. Carreira and A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset". In: CVPR, 2017.

J. Chambers, D. Chambers, and . Jurafsky, Unsupervised Learning of Narrative Event Chains, In: ACL (cit. on pp, vol.27, 2008.

[. Chari, On Pairwise Costs for Network Flow Multi-Object Tracking, CVPR (cit, p.65, 2015.

C. , Y. Chen, and A. Yuille, Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations, 2014.

[. Chéron, A Flexible Model for Training Action Localization with Varying Levels of Supervision, p.11, 2018.

[. Cimpoi, Deep Filter Banks for Texture Recognition and Segmentation, CVPR (cit, p.74, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01263622

[. Cour, Learning from ambiguously labeled images, CVPR (cit, p.23, 2009.
DOI : 10.1109/cvprw.2009.5206667

URL : http://www.cis.upenn.edu/~taskar/pubs/cvpr09.pdf

[. Csiba, Stochastic Dual Coordinate Ascent with Adaptive Probabilities, ICML (cit, vol.125, p.123, 2015.

J. C. Perekrestenko, M. Perekrestenko, and . Jaggi, Faster coordinate descent via adaptive importance sampling, AISTATS (cit, p.125, 2017.

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR (cit, vol.16, p.13, 2005.
DOI : 10.1109/cvpr.2005.177

URL : https://hal.archives-ouvertes.fr/inria-00548512

[. Damen, Scaling Egocentric Vision: The EPIC-KITCHENS Dataset, p.37, 2018.

[. Damen, , 2014.

. Mayol-cuevas, You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video, 2014.

&. Aspremont, An optimal affine invariant smooth minimization algorithm, p.147, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00907547

. Bibliography-[delaitre, Scene semantics from long-term observation of people, 2012.

[. Delaitre, Learning personobject interactions for action recognition in still images, NIPS (cit. on pp, vol.88, p.18, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00648156

. Deng, ImageNet: A Large-Scale Hierarchical Image Database, CVPR (cit, vol.103, p.92, 2009.

[. Desai, Discriminative models for static human-object interactions, CVPR Workshops, p.19, 2010.

[. Doersch, What Makes Paris Look like Paris?, In: SIGGRAPH (cit, p.90, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01248528

[. Duan, Discovering localized attributes for fine-grained recognition, CVPR (cit, vol.87, 2012.

[. Duchenne, Automatic annotation of human actions in video, ICCV (cit, vol.55, p.23, 2009.

[. Everingham, The Pascal Visual Object Classes (VOC) Challenge". In: IJCV (cit, p.16, 2010.

[. Everingham, Hello! My name is... Buffy"-Automatic Naming of Characters in TV Video, BMVC (cit, p.23, 2006.

[. Farhadi, Describing Objects by their Attributes, CVPR (cit. on pp. 17, vol.87, 2009.

R. Fathi, ]. A. Fathi, and J. M. Rehg, Modeling Actions through State Changes, CVPR (cit. on pp, vol.18, 2013.

C. Fellbaum, WordNet: An Electronic Lexical Database, p.63, 1998.

[. Felzenszwalb, Object Detection with Discriminatively Trained Part-Based Models, PAMI (cit, pp.14-16, 2010.

H. F. Felzenszwalb, D. P. Felzenszwalb, and . Huttenlocher, Efficient matching of pictorial structures, CVPR (cit, p.16, 2000.

H. F. Felzenszwalb, D. P. Felzenszwalb, and . Huttenlocher, Pictorial structures for object recognition, IJCV (cit, vol.173, p.132, 2005.

[. Fernando, Modeling Video Evolution For Action Recognition, CVPR (cit, p.89, 2015.

M. A. Fischler and R. A. Elschlager, The Representation and Matching of Pictorial Structures, IEEE Transactions on Computers, p.16, 1973.

W. Förstner and E. Gülsch, A fast operator for detection and precise location of distinct points, corners and centres of circular features, ISPRS (cit, p.19, 1987.

. Fouhey, From Lifestyle Vlogs to Everyday Interactions". In: arXiv (cit, vol.138, p.37, 2017.

. Fouhey, People Watching: Human Actions as a Cue for Single View Geometry, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01066257

V. Franc, FASOLE: Fast Algorithm for Structured Output LEarning, ECML PKDD, p.128, 2014.

M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly, 1956.

[. Frermann, A hierarchical Bayesian model for unsupervised induction of script knowledge, 2014.

M. Garber, O. Garber, and . Meshi, Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes, NIPS (cit, p.128, 2016.

R. Girshick, Fast R-CNN". In: ICCV (cit. on pp. 16, vol.103, p.91, 2015.

[. Girshick, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR (cit, p.16, 2014.

;. R. Bibliography-[girshick and . Girshick, Deformable part models, p.13, 2013.

[. Goodfellow, Generative Adversarial Networks". In: NIPS (cit, p.42, 2014.

[. Goyal, The something something video database for learning and evaluating visual common sense, 2017.

[. Guillaumin, ImageNet Auto-Annotation with Segmentation Propagation, IJCV (cit, p.169, 2014.

[. Gupta, Observing human-object interactions: Using spatial and functional compatibility for recognition, 2009.

[. Gupta, Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos, CVPR (cit, vol.30, p.29, 2009.

. Harris, ;. C. Stephens, M. Harris, and . Stephens, A combined corner and edge detector, BMVC (cit, p.19, 1988.

[. Hastie, The Elements of Statistical Learning: Data Mining, Inference and Prediction, vol.44, p.38, 2009.

[. He, Mask R-CNN". In: ICCV (cit, p.17, 2017.

[. He, Deep Residual Learning for Image Recognition, CVPR (cit, p.103, 2016.

S. G. Higgins, P. M. Higgins, and . Sharp, Clustal: A package for performing multiple sequence alignment on a microcomputer, Gene (cit, vol.75, p.65, 1988.

;. A. Hoffman and . Hoffman, On Approximate Solutions of Systems of Linear Inequalities, Journal of Research of the National Bureau of Standards, p.156, 1952.

C. A. Holloway, An extension of the Frank and Wolfe method of feasible directions, Mathematical Programming, p.125, 1974.

[. Hsu, Random Design Analysis of Ridge Regression, Foundations of Computational Mathematics, p.82, 2014.

. Huang, Connectionist Temporal Modeling for Weakly Supervised Action Labeling, ECCV (cit. on pp, vol.90, p.27, 2016.

. Huang, Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos, CVPR (cit, vol.36, p.35, 2017.

. Huang, Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video, CVPR (cit, vol.36, p.35, 2018.

. Isola, Discovering States and Transformations in Image Collections, CVPR (cit. on pp. 17, vol.87, 2015.

M. Jaggi, Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, ICML (cit, vol.144, p.141, 2013.

M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, ICML (cit, vol.49, p.48, 2013.

M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, 2013.

[. Jain, Representing videos using mid-level discriminative patches, CVPR (cit, p.90, 2013.

[. Jegelka, Reflection methods for user-friendly submodular optimization, NIPS (cit, p.115, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00905258

[. Joachims, Cutting-plane training of structural SVMs, Machine Learning, 2009.

E. Johnson, ]. S. Johnson, and M. Everingham, Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation, BMVC (cit. on pp. 132, vol.135, p.167, 2010.

. Bibliography-[joo, Panoptic Studio: A Massively Multiview System for Social Motion Capture *". In: ICCV (cit, p.15, 2015.

[. Joulin, Efficient Image and Video Co-localization with Frank-Wolfe Algorithm, 2014.
DOI : 10.1007/978-3-319-10599-4_17

URL : http://ai.stanford.edu/%7Ekdtang/papers/eccv14-vidcoloc.pdf

[. Joulin, Efficient image and video co-localization with Frank-Wolfe algorithm, ECCV (cit, p.115, 2014.
DOI : 10.1007/978-3-319-10599-4_17

URL : http://ai.stanford.edu/%7Ekdtang/papers/eccv14-vidcoloc.pdf

[. Joulin, Discriminative Clustering for Image Co-segmentation", vol.46, 2010.
DOI : 10.1109/cvpr.2010.5539868

URL : http://www.di.ens.fr/%7Efbach/cosegmentation_cvpr2010.pdf

[. Joulin, Multi-class cosegmentation, CVPR (cit, p.46, 2012.
DOI : 10.1109/cvpr.2012.6247719

URL : https://hal.archives-ouvertes.fr/hal-00717448

[. Joulin, Efficient Image and Video Co-localization with Frank-Wolfe Algorithm, ECCV (cit, p.47, 2014.
DOI : 10.1007/978-3-319-10599-4_17

URL : http://ai.stanford.edu/%7Ekdtang/papers/eccv14-vidcoloc.pdf

. Kerdreux, FrankWolfe with Subsampling Oracle, ICML (cit, p.131, 2018.

[. Kjellström, Visual objectaction recognition: Inferring object affordances from human demonstration, CVIU (cit, p.88, 2011.

[. Kolesnikov, Closed-form approximate CRF training for scalable image segmentation, 2014.
DOI : 10.1007/978-3-319-10578-9_36

URL : http://groups.inf.ed.ac.uk/calvin/Publications/kolesnikov14eccv.pdf

K. Krähenbühl, V. Krähenbühl, and . Koltun, Geodesic Object Proposals, ECCV (cit, vol.103, p.16, 2014.

[. Kuehne, HMDB: a large video database for human motion recognition, 2011.
DOI : 10.1109/iccv.2011.6126543

URL : http://dspace.mit.edu/bitstream/1721.1/69981/1/Poggio-HMDB.pdf

J. Julien, ]. S. Lacoste-julien, and M. Jaggi, On the Global Linear Convergence of Frank-Wolfe Optimization Variants, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01248675

J. Julien, ]. S. Lacoste-julien, and M. Jaggi, On the Global Linear Convergence of Frank-Wolfe Optimization Variants, NIPS (cit. on pp. 115, vol.119, pp.153-158, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01248675

. Lacoste-julien, Block-Coordinate Frank-Wolfe Optimization for Structural SVMs, ICML (cit. on pp. 7, 9, vol.10, p.165, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00720158

;. S. Julien and . Lacoste-julien, Convergence Rate of Frank-Wolfe for Non-Convex Objectives, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01415335

[. Lafferty, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ICML (cit, p.157, 2001.

[. Laptev, Learning realistic human actions from movies, CVPR (cit, vol.55, p.23, 2008.
DOI : 10.1109/cvpr.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

I. Laptev, On space-time interest points, IJCV (cit, p.19, 2005.
DOI : 10.1007/s11263-005-1838-7

URL : http://kth.diva-portal.org/smash/get/diva2:442088/FULLTEXT01

[. Laptev, Learning Realistic Human Actions from Movies, CVPR (cit, p.89, 2008.
DOI : 10.1109/cvpr.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

[. Leblond, SEARNN: Training RNNs with Global-Local Losses, International Conference on Learning Representations (ICLR, p.11, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01665263

[. Lee, Multiple sequence alignment using partial order graphs, Bioinformatics (cit. on pp, vol.63, p.73, 2002.
DOI : 10.1093/bioinformatics/18.3.452

URL : https://academic.oup.com/bioinformatics/article-pdf/18/3/452/648375/180452.pdf

[. Lempitsky, A Pylon Model for Semantic Segmentation, NIPS (cit, p.169, 2011.

T. Liao, Clustering of time series data, a survey, Pattern recognition, p.77, 2014.

. Lin, Microsoft coco: Common objects in context, 2014.

S. P. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory, p.42, 1982.

[. Loiola, , 2007.

T. Hahn and . Querido, A Survey for the Quadratic Assignment Problem, EJOR (cit, p.95, 2007.

;. D. Bibliography-[lowe and . Lowe, Distinctive Image Features from Scale-Invariant Keypoint, IJCV (cit, p.170, 2004.

[. Malmaud, What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision, 2015.

. Marcus, Building a large annotated corpus of English: The Penn Treebank, Computational linguistics, p.168, 1993.

[. Marneffe, Generating typed dependency parses from phrase structure parses, LREC (cit, vol.73, p.62, 2006.

[. Marszalek, Actions in context, CVPR (cit, vol.23, p.22, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00548645

[. Miech, Learning from Video and Text via Large-Scale Discriminative Clustering, ICCV (cit, vol.113, p.11, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01569540

[. Mikolov, Distributed Representations of Words and Phrases and their Compositionality, NIPS (cit, p.74, 2013.

G. A. Miller, WordNet: A Lexical Database for English, Communications of the ACM (cit. on p, vol.63, 1995.

[. Mitchell, Finding the Point of a Polyhedron Closest to the Origin, SIAM Journal on Control, vol.125, p.115, 1974.

[. Mittal, Hand detection using multiple proposals, BMVC (cit, p.15, 2011.

]. K. Murphy and . Murphy, Machine learning : a probabilistic perspective, p.38, 2012.

. Naim, Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments, p.55, 2015.

. Bibliography-[ñanculef, A novel Frank-Wolfe algorithm. Analysis and applications to large-scale SVM training, Information Sciences, p.128, 2014.

. Needell, Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, NIPS (cit, vol.147, p.124, 2014.

Y. Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems, SIAM Journal on Optimization, 2012.

Y. Nesterov, Introductory Lectures on Convex Programming Volume I: Basic course (cit, p.48, 1998.

. Nguyen, Human detection from images and videos: A survey, Pattern Recognition, p.14, 2016.

. Niebles, Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification, ECCV (cit, p.56, 2010.

. Niebles, Unsupervised learning of human action categories using spatial-temporal words, IJCV (cit, p.55, 2008.

N. Okazaki, CRFsuite: a fast implementation of Conditional Random Fields (CRFs) (cit, p.168, 2007.

K. Osokin, P. Osokin, and . Kohli, Perceptually Inspired Layout-aware Losses for Image Segmentation, ECCV (cit, p.173, 2014.

[. Papandreou, Towards Accurate Multi-person Pose Estimation in the Wild, CVPR (cit, p.15, 2017.

G. Parikh, K. Parikh, and . Grauman, Relative Attributes". In: ICCV (cit, p.89, 2011.

[. Parkhi, Deep Face Recognition". In: BMVC (cit, p.13, 2015.

, Partial order alignment code for Multiple Sequence Alignment, p.75

. Bibliography-[patterson, The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding, IJCV (cit, p.89, 2014.

[. Peyre, Weaklysupervised learning of visual relations, ICCV (cit, p.19, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01576035

H. Pirsiavash and D. Ramanan, Detecting activities of daily living in first-person camera views, CVPR (cit, p.88, 2012.

J. C. Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods-Support Vector Learning, p.128, 1999.

[. Potapov, Category-specific video summarization, ECCV (cit, p.56, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01022967

[. Prest, Weakly Supervised Learning of Interactions between Humans and Objects, PAMI (cit, p.19, 2012.
URL : https://hal.archives-ouvertes.fr/inria-00516477

L. Priol, ;. A. Lacoste-julien, S. Le-priol, and . Lacoste-julien, Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields, UAI (cit, p.125, 2018.

[. Ramanathan, Linking people with "their" names using coreference resolution, ECCV (cit, p.90, 2014.

M. Raptis and L. Sigal, Poselet Key-framing: A Model for Human Activity Recognition, CVPR (cit, p.56, 2013.

[. Ratliff, Online) Subgradient Methods for Structured Prediction, AISTATS (cit, p.114, 2007.

[. Regneri, Learning Script Knowledge with Web Experiments, ACL (cit. on pp. 28, vol.29, 2010.

[. Regneri, Grounding Action Descriptions in Videos, TACL (cit, p.26, 2013.

[. Ren, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015.

. Bibliography-[richard, Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling, CVPR (cit, p.27, 2017.

[. Rohrbach, A dataset for movie description, CVPR (cit, p.25, 2015.

[. Rohrbach, Script Data for Attribute-Based Recognition of Composite Activities, ECCV (cit, p.30, 2012.

;. E. Buchholz, S. Sang, and . Buchholz, Introduction to the CoNLL-2000 shared task: Chunking (cit. on pp. 132, vol.135, p.167, 2000.

Y. Sener, A. Sener, and . Yao, Unsupervised Learning and Segmentation of Complex Activities from Video, CVPR (cit, p.34, 2018.

[. Sener, Unsupervised Semantic Parsing of Video Collections, ICCV (cit. on pp. 32, vol.33, 2015.

P. Sha, F. Sha, and . Pereira, Shallow Parsing with Conditional Random Fields". In: NAACL (cit, p.168, 2003.

[. Shah, A MultiPlane Block-Coordinate Frank-Wolfe Algorithm for Training Structural SVMs with a Costly max-Oracle, CVPR (cit, p.131, 2015.

. Sigurdsson, Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV (cit, vol.102, p.89, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01418216

[. Silver, Mastering the game of Go without human knowledge, p.41, 2017.

. Simon, Hand Keypoint Detection in Single Images using Multiview Bootstrapping, CVPR (cit, p.15, 2017.

Z. Simonyan, ]. K. Simonyan, and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015.

Z. Simonyan, ]. K. Simonyan, and A. Zisserman, Twostream convolutional networks for action recognition in videos, NIPS (cit, vol.89, p.20, 2014.

[. Singh, Unsupervised Discovery of Mid-level Discriminative Patches, ECCV (cit, p.90, 2012.

[. Sivic, Who are you?"-Learning person specific classifiers from video, CVPR (cit, p.23, 2009.

[. Soomro, UCF101: A dataset of 101 human actions classes from videos in the wild, 2012.

. Sun, Ranking domain-specific highlights by analyzing edited videos, ECCV (cit, p.56, 2014.

B. S. Sutton, A. G. Sutton, and . Barto, Introduction to Reinforcement Learning. 1st, p.41, 1998.

[. Tapaswi, MovieQA: Understanding Stories in Movies through QuestionAnswering, CVPR (cit, p.25, 2016.

B. Taskar, Learning structured prediction models: A large margin approach". Doctoral dissertation. Stanford University (cit, p.128, 2004.

[. Taskar, Max-Margin Markov Networks, 2003.

[. Taylor, Efficient and Precise Interactive Hand Tracking through Joint, Continuous Optimization of Pose and Correspondences, p.15, 2016.

S. Toshev, G. C. Toshev, and . Szegedy, DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR (cit, p.14, 2014.

[. Tran, Learning spatiotemporal features with 3D convolutional networks, 2015.

. Bibliography-[tsochantaridis, , 2005.

A. , Large margin methods for structured and interdependent output variables, 2005.

[. Venugopalan, Sequence to sequence-video to text, ICCV (cit, p.25, 2015.

[. Venugopalan, Translating Videos to Natural Language Using Deep Recurrent Neural Networks, NAACL (cit, p.25, 2015.

. Vinyals, Show and tell: A neural image caption generator, p.24, 2014.

;. P. Jones, M. Viola, and . Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, CVPR (cit, p.13, 2001.

A. J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory 13, vol.2, pp.260-269, 1967.

J. J. Wainwright, M. I. Wainwright, and . Jordan, Graphical models, exponential families, and variational inference, Foundations and Trends in Machine Learning, p.157, 2008.

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, vol.107, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873267

[. Wang, Action recognition by dense trajectories, CVPR (cit, p.20, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00583818

J. Wang, T. Wang, and . Jiang, On the complexity of multiple sequence alignment, Journal of computational biology, 1994.

[. Wang, Actions Transformations, CVPR (cit, vol.89, p.17, 2016.

[. Wang, Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms, p.125, 2014.

[. Wei, Convolutional Pose Machines". In: CVPR (cit, p.14, 2016.
DOI : 10.1109/cvpr.2016.511

URL : http://arxiv.org/pdf/1602.00134

;. P. Bibliography-[wolfe and . Wolfe, Convergence Theory in Nonlinear Programming, Integer and Nonlinear Programming, 1970.

[. Xu, Maximum Margin Clustering". In: NIPS (cit, vol.90, p.44, 2004.

Y. , T. Yang, and G. Toderici, Discriminative tag learning on YouTube videos with latent sub-tags, CVPR (cit, p.23, 2011.
DOI : 10.1109/cvpr.2011.5995402

URL : http://www.sfu.ca/%7Ewya16/cvpr2011_sub_tag_draft.pdf

Y. , R. Yang, and D. Ramanan, Articulated pose estimation with flexible mixtures-of-parts, CVPR (cit, p.14, 2011.
DOI : 10.1109/cvpr.2011.5995741

[. Yao, ;. B. Fei-fei, L. Yao, and . Fei-fei, Grouplet: A structured image representation for recognizing human and object interactions, CVPR (cit, p.19, 2010.
DOI : 10.1109/cvpr.2010.5540234

[. Yao, Human action recognition by learning bases of action attributes and parts, p.88, 2011.
DOI : 10.1109/iccv.2011.6126386

URL : http://people.csail.mit.edu/khosla/papers/iccv2011_yao.pdf

[. Yao, Describing videos by exploiting temporal structure, ICCV (cit, p.25, 2015.
DOI : 10.1109/iccv.2015.512

URL : http://arxiv.org/pdf/1502.08029

Z. Zhang, ]. C. Zhang, and Z. Zhang, A Survey of Recent Advances in Face Detection, p.13, 2010.

Z. Zhao, T. Zhao, and . Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization, ICML (cit, vol.147, p.124, 2015.

[. Zhou, Towards Automatic Learning of Procedures from Web Instructional Videos, AAAI (cit, vol.37, p.33, 2018.