.. Experimental-results, 129 5.4.1 Datasets and metrics, p.130

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, pp.1395-1402, 2005.
DOI : 10.1109/ICCV.2005.28

L. Bo, X. Ren, and D. Fox, Hierarchical matching pursuit for image classification: Architecture and fast algorithms, Advances in Neural Information Processing Systems, pp.6-39, 2011.

P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid et al., Finding Actors and Actions in Movies, 2013 IEEE International Conference on Computer Vision, p.90, 2013.
DOI : 10.1109/ICCV.2013.283
URL : https://hal.archives-ouvertes.fr/hal-00904991

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly Supervised Action Labeling in Videos under Ordering Constraints, Proceedings of the European Conference on Computer Vision, pp.628-643, 2014.
DOI : 10.1007/978-3-319-10602-1_41
URL : https://hal.archives-ouvertes.fr/hal-01053967

K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Schlkopf et al., Integrating structured biological data by Kernel Maximum Mean Discrepancy, Bioinformatics, p.76, 2006.
DOI : 10.1093/bioinformatics/btl242
URL : https://academic.oup.com/bioinformatics/article-pdf/22/14/e49/616383/btl242.pdf

K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, Unsupervised pixel-level domain adaptation with generative adversarial networks. arXiv preprint, p.42, 2016.
DOI : 10.1109/cvpr.2017.18

K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan, Domain separation networks, Advances in Neural Information Processing Systems, pp.343-351, 2016.

M. Brand, Shadow puppetry, Proceedings of the Seventh IEEE International Conference on Computer Vision, pp.1237-1244, 1999.
DOI : 10.1109/ICCV.1999.790422

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, Proceedings of the European Conference on Computer Vision, p.127, 2004.
DOI : 10.1007/978-3-540-24673-2_3

T. Brox and J. Malik, Object Segmentation by Long Term Analysis of Point Trajectories, Proceedings of the European Conference on Computer Vision, p.66, 2010.
DOI : 10.1007/978-3-642-15555-0_21

W. Chen, C. Xiong, R. Xu, C. , and J. , Actionness Ranking with Lattice Conditional Ordinal Random Fields, 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.121, 2014.
DOI : 10.1109/CVPR.2014.101
URL : http://web.eecs.umich.edu/%7Ejjcorso/pubs/jcorso_CVPR2014_actionness.pdf

N. Cherniavsky, I. Laptev, J. Sivic, and A. Zisserman, Semi-supervised Learning of Facial Attributes in Video, Proceedings of the European Conference on Computer Vision, pp.43-56, 2010.
DOI : 10.1007/978-3-642-35749-7_4

S. Chopra, S. Balakrishnan, and R. Gopalan, Dlid: Deep learning for domain adaptation by interpolating between domains, ICML 2013, Workshop on Representation Learning, p.38, 2013.

R. G. Cinbis, J. Verbeek, and C. Schmid, Multi-fold MIL Training for Weakly Supervised Object Localization, 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.148, 2014.
DOI : 10.1109/CVPR.2014.309
URL : https://hal.archives-ouvertes.fr/hal-00975746

R. G. Cinbis, J. Verbeek, and C. Schmid, Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.1, p.89, 2016.
DOI : 10.1109/TPAMI.2016.2535231
URL : https://hal.archives-ouvertes.fr/hal-01123482

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, Workshop on statistical learning in computer vision, ECCV, pp.1-2, 2004.

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.25-26, 2005.
DOI : 10.1109/CVPR.2005.177
URL : https://hal.archives-ouvertes.fr/inria-00548512

I. Daumé and H. , Frustratingly easy domain adaptation. arXiv Preprint, p.36, 2009.

D. Pero, L. Ricco, S. Sukthankar, R. Ferrari, and V. , Behavior discovery and alignment of articulated object classes from unstructured video, International Journal of Computer Vision, pp.1-23, 2016.

D. Pero, L. Ricco, S. Sukthankar, R. Ferrari, and V. , Discovering the physical parts of an articulated object class from multiple videos, Proceedings of Bibliography, 2016.

D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, Scalable Object Detection Using Deep Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.22, 2014.
DOI : 10.1109/CVPR.2014.276
URL : http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Erhan_Scalable_Object_Detection_2014_CVPR_paper.pdf

V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, DAPs: Deep Action Proposals for Action Understanding, Proceedings of the European Conference on Computer Vision, pp.768-784, 2016.
DOI : 10.1007/978-3-319-10602-1_26

V. Escorcia, J. C. Niebles, and B. Ghanem, On the relationship between visual attributes and convolutional networks, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.90, 2015.
DOI : 10.1109/CVPR.2015.7298730

M. Everingham, L. Van-gool, C. Williams, J. , W. Zisserman et al., The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, p.96, 2010.
DOI : 10.1371/journal.pcbi.0040027

M. Everingham, L. Van-gool, C. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, pp.61-62, 2007.
DOI : 10.1371/journal.pcbi.0040027

A. Farhadi and M. K. Tabrizi, Learning to Recognize Activities from the Wrong View Point, Proceedings of the European Conference on Computer Vision, pp.154-166, 2008.
DOI : 10.1145/1273496.1273637

P. Felzenszwalb, R. Girshick, D. Mcallester, and D. And-ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, pp.64-73, 2010.
DOI : 10.1109/TPAMI.2009.167

P. F. Felzenszwalb and D. P. Huttenlocher, Pictorial Structures for Object Recognition, International Journal of Computer Vision, vol.61, issue.1, pp.55-79, 2005.
DOI : 10.1023/B:VISI.0000042934.15159.49

B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, Unsupervised Visual Domain Adaptation Using Subspace Alignment, 2013 IEEE International Conference on Computer Vision, pp.2960-2967, 2013.
DOI : 10.1109/ICCV.2013.368
URL : https://hal.archives-ouvertes.fr/hal-00869417

R. Filipovych and E. Ribeiro, Recognizing primitive interactions by exploring actor-object states, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-7, 2008.
DOI : 10.1109/CVPR.2008.4587726
URL : http://cs.fit.edu/~eribeiro/papers/FilipovychRibeiro_cvpr2008b.pdf

R. Filipovych and E. Ribeiro, Robust sequence alignment for actor???object interaction recognition: Discovering actor???object states, Computer Vision and Image Understanding, vol.115, issue.2, pp.177-193, 2011.
DOI : 10.1016/j.cviu.2010.11.012

A. Gaidon, Z. Harchaoui, and C. Schmid, Temporal Localization of Actions with Actoms, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.11, pp.2782-2795, 2013.
DOI : 10.1109/TPAMI.2013.65
URL : https://hal.archives-ouvertes.fr/hal-00687312

A. Gaidon and E. Vig, Online Domain Adaptation for Multi-Object Tracking, Procedings of the British Machine Vision Conference 2015, p.61, 2015.
DOI : 10.5244/C.29.3

A. Gaidon, G. Zen, and J. A. Rodriguez-serrano, Self-learning camera: Autonomous adaptation of object detectors to unlabeled video streams. arXiv preprint, p.61, 2014.

Y. Ganin and V. Lempitsky, Unsupervised domain adaptation by backpropagation, International Conference on Machine Learning, pp.1180-1189, 2015.

J. Gemert, M. Jain, E. Gati, and C. G. Snoek, APT: Action localization proposals from dense trajectories, Procedings of the British Machine Vision Conference 2015, p.121, 2015.
DOI : 10.5244/C.29.177

M. Ghifary, W. B. Kleijn, and M. Zhang, Domain Adaptive Neural Networks for Object Recognition, Pacific Rim International Conference on Artificial Intelligence, pp.898-904, 2014.
DOI : 10.1007/978-3-319-13560-1_76
URL : http://arxiv.org/pdf/1409.6041.pdf

M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, L. et al., Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation, Proceedings of the European Conference on Computer Vision, pp.597-613, 2016.
DOI : 10.2307/1912526

S. Gidaris and N. Komodakis, Object Detection via a Multi-region and Semantic Segmentation-Aware CNN Model, 2015 IEEE International Conference on Computer Vision (ICCV), pp.1134-1142, 2015.
DOI : 10.1109/ICCV.2015.135

R. Girshick, Fast R-CNN, 2015 IEEE International Conference on Computer Vision (ICCV), pp.27-28, 2015.
DOI : 10.1109/ICCV.2015.169

R. Girshick, Fast R-CNN. https://github.com/rbgirshick/fast-rcnn, p.27, 2015.
DOI : 10.1109/iccv.2015.169

R. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.61-63, 2014.
DOI : 10.1109/CVPR.2014.81
URL : http://arxiv.org/pdf/1311.2524

R. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.65, 2014.
DOI : 10.1109/CVPR.2014.81
URL : http://arxiv.org/pdf/1311.2524

R. B. Girshick, P. F. Felzenszwalb, and D. Mcallester, Discriminatively trained deformable part models, release 5, pp.61-65, 2012.

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.121-127, 2015.
DOI : 10.1109/CVPR.2015.7298676

B. Gong, Y. Shi, F. Sha, and K. Grauman, Geodesic flow kernel for unsupervised domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2066-2073, 2012.

A. Gonzalez-garcia, D. Modolo, and V. Ferrari, Do semantic parts emerge in convolutional neural networks? arXiv Preprint, p.23, 2016.
DOI : 10.1007/s11263-017-1048-0

A. Gonzalez-garcia, D. Modolo, and V. Ferrari, Objects as context for part detection, 2017.

A. Gonzalez-garcia, A. Vezhnevets, and V. Ferrari, An active search strategy for efficient object class detection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3022-3031, 2015.
DOI : 10.1109/CVPR.2015.7298921
URL : http://arxiv.org/abs/1412.3709

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, Advances in Neural Information Processing Systems, pp.2672-2680, 2014.

R. Gopalan, R. Li, C. , and R. , Domain adaptation for object recognition: An unsupervised approach, 2011 International Conference on Computer Vision, pp.999-1006, 2011.
DOI : 10.1109/ICCV.2011.6126344
URL : http://www.umiacs.umd.edu/~raghuram/Publications/2011_ICCV_DomainAdaptation.pdf

R. Gopalan, R. Li, C. , and R. , Unsupervised Adaptation Across Domain Shifts by Generating Intermediate Data Representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.56, 2014.
DOI : 10.1109/TPAMI.2013.249
URL : http://www.research.att.com/export/sites/att_labs/techdocs/TD_101340.pdf

M. Grundmann, V. Kwatra, M. Han, E. , and I. , Efficient hierarchical graphbased video segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.100, 2010.
DOI : 10.1109/cvpr.2010.5539893
URL : http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36247.pdf

C. Gu, J. J. Lim, P. Arbeláez, M. , and J. , Recognition using regions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1030-1037, 2009.

A. Gupta, A. Kembhavi, D. , and L. S. , Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, issue.10, p.90, 2009.
DOI : 10.1109/TPAMI.2009.83

S. Gupta, P. Arbeláez, R. Girshick, M. , and J. , Aligning 3d models to rgbd images of cluttered scenes, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4731-4740, 2015.

S. Gupta, R. Girshick, P. Arbeláez, M. , and J. , Learning Rich Features from RGB-D Images for Object Detection and Segmentation, Proceedings of the European Conference on Computer Vision, pp.345-360, 2014.
DOI : 10.1007/978-3-319-10584-0_23
URL : http://arxiv.org/pdf/1407.5736

S. Gupta, J. Hoffman, M. , and J. , Cross Modal Distillation for Supervision Transfer, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2827-2836, 2016.
DOI : 10.1109/CVPR.2016.309

G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra et al., Weakly Supervised Learning of Object Segmentations from Web-Scale Video, ECCV, Workshops and Demonstrations, pp.198-208, 2012.
DOI : 10.1007/978-3-642-33863-2_20
URL : http://www.cs.cmu.edu/~rahuls/pub/eccv2012wk-cp-rahuls.pdf

K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, issue.9, pp.1904-1916, 2015.
DOI : 10.1109/TPAMI.2015.2389824

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90
URL : http://arxiv.org/pdf/1512.03385

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.149, 2015.
DOI : 10.1109/CVPR.2015.7298698

P. Henderson and V. Ferrari, End-to-end training of object class detectors for mean average precision. arXiv Preprint, p.22, 2016.

M. Hoai, L. Torresani, F. De-la-torre, and C. Rother, Learning discriminative localization from weakly labeled data, Pattern Recognition, vol.47, issue.3, pp.1523-1534, 2014.
DOI : 10.1016/j.patcog.2013.09.028

J. Hoffman, S. Gupta, D. , and T. , Learning with Side Information through Modality Hallucination, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.826-834, 2016.
DOI : 10.1109/CVPR.2016.96

J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko, Efficient learning of domain-invariant image representations, International Conference on Learning Representations, p.37, 2013.

J. Hoffman, E. Rodner, J. Donahue, B. Kulis, S. et al., Asymmetric and Category Invariant Feature Transformations for Domain Adaptation, International Journal of Computer Vision, vol.39, issue.12, p.75, 2014.
DOI : 10.1109/TPAMI.2009.151

J. Hoffman, E. Tzeng, J. Donahue, Y. Jia, K. Saenko et al., Oneshot learning of supervised deep convolutional models, arXiv Preprint, p.39, 2014.

D. Hogg, Model-based vision: a program to see a walking person, Image and Vision Computing, vol.1, issue.1, pp.5-20, 1983.
DOI : 10.1016/0262-8856(83)90003-3

R. Hou, C. Chen, and M. Shah, Tube convolutional neural network (t-cnn) for action detection in videos. arXiv Preprint, p.49, 2017.
DOI : 10.1109/iccv.2017.620

W. Hu, T. Tan, L. Wang, and S. Maybank, A Survey on Visual Surveillance of Object Motion and Behaviors, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), vol.34, issue.3, p.14, 2004.
DOI : 10.1109/TSMCC.2004.829274

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara et al., Speed/accuracy trade-offs for modern convolutional object detectors. arXiv Preprint, p.123, 2016.
DOI : 10.1109/cvpr.2017.351

L. Huang, Y. Yang, Y. Deng, Y. , and Y. , Densebox: Unifying landmark localization with end to end object detection. arXiv preprint, 2015.

Y. Huang, J. Oramas, T. Tuytelaars, V. Gool, and L. , Do motion boundaries improve semantic segmentation?, p.147, 2016.
DOI : 10.1002/lnc3.357

M. Jain, J. Van-gemert, H. Jégou, P. Bouthemy, and C. G. Snoek, Action Localization with Tubelets from Motion, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.740-747, 2014.
DOI : 10.1109/CVPR.2014.100
URL : https://hal.archives-ouvertes.fr/hal-00996844

R. Jain, D. Militzer, and H. Nagel, Separating non-stationary from stationary scene components in a sequence of real world TV-images, 1977.

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards Understanding Action Recognition, 2013 IEEE International Conference on Computer Vision, pp.52-129, 2013.
DOI : 10.1109/ICCV.2013.396
URL : https://hal.archives-ouvertes.fr/hal-00906902

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards Understanding Action Recognition, 2013 IEEE International Conference on Computer Vision, pp.3192-3199, 2013.
DOI : 10.1109/ICCV.2013.396
URL : https://hal.archives-ouvertes.fr/hal-00906902

Y. Jia, Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, p.64, 2013.
DOI : 10.1145/2647868.2654889

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, p.76, 2014.
DOI : 10.1145/2647868.2654889

J. Jiang, A literature survey on domain adaptation of statistical classifiers, 2008.

V. Kalogeiton, C. Schmid, and V. Ferrari, Analysing Domain Shift Factors between Videos and Images for Object Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence. xi, xii, pp.51-56, 2016.
DOI : 10.1109/TPAMI.2016.2551239
URL : https://hal.archives-ouvertes.fr/hal-01281069

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Action Tubelet Detector for Spatio-Temporal Action Localization, 2017 IEEE International Conference on Computer Vision (ICCV), p.118, 2017.
DOI : 10.1109/ICCV.2017.472
URL : https://hal.archives-ouvertes.fr/hal-01519812

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Joint Learning of Object and Action Detectors, 2017 IEEE International Conference on Computer Vision (ICCV), p.86, 2017.
DOI : 10.1109/ICCV.2017.219
URL : https://hal.archives-ouvertes.fr/hal-01575804

M. Kan, S. Shan, H. Zhang, S. Lao, C. et al., Multi-View Discriminant Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.1, pp.188-194, 2016.
DOI : 10.1109/TPAMI.2015.2435740
URL : http://figment.cse.usf.edu/~sfefilat/data/papers/WeBT4.3.pdf

K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan et al., Object Detection in Videos with Tubelet Proposal Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.50, 2017.
DOI : 10.1109/CVPR.2017.101

K. Kang, H. Li, J. Yan, X. Zeng, B. Yang et al., T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.50, 2016.
DOI : 10.1109/TCSVT.2017.2736553

K. Kang, W. Ouyang, H. Li, W. , and X. , Object Detection from Video Tubelets with Convolutional Neural Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.89, 2016.
DOI : 10.1109/CVPR.2016.95
URL : http://arxiv.org/pdf/1604.04053

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization, Proceedings of the European Conference on Computer Vision, pp.350-365, 2016.
DOI : 10.1007/s11263-009-0275-4
URL : https://hal.archives-ouvertes.fr/hal-01421772

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.86, 2014.
DOI : 10.1109/CVPR.2014.223
URL : http://www.cs.cmu.edu/~rahuls/pub/cvpr2014-deepvideo-rahuls.pdf

G. Kim, L. Sigal, and E. P. Xing, Joint summarization of large sets of web images and videos for storyline reconstruction, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.56, 2014.

A. Klaser, M. Marszalek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, pp.275-276, 2008.
DOI : 10.5244/C.22.99
URL : https://hal.archives-ouvertes.fr/inria-00514853

A. Klaser, M. Marsza?ek, C. Schmid, and A. Zisserman, Human Focused Action Localization in Video, SGA 2010-International Workshop on Sign, Gesture, and Activity, ECCV 2010 Workshops, pp.219-233, 2010.
DOI : 10.1007/978-3-642-35749-7_17
URL : https://hal.archives-ouvertes.fr/inria-00514845

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems. xii, pp.65-89, 2012.
DOI : 10.1162/neco.2009.10-08-881
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

T. Kroeger, R. Timofte, D. Dai, V. Gool, and L. , Fast Optical Flow Using Dense Inverse Search, Proceedings of the European Conference on Computer Vision, p.122, 2016.
DOI : 10.1109/CVPR.2015.7298704
URL : http://arxiv.org/pdf/1603.03590

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, pp.2556-2563, 2011.
DOI : 10.1109/ICCV.2011.6126543
URL : http://cbcl.mit.edu/publications/ps/Kuehne_etal_iccv11.pdf

B. Kulis, K. Saenko, D. , and T. , What you saw is not what you get: Domain adaptation using asymmetric kernel transforms, CVPR 2011, pp.1785-1792, 2011.
DOI : 10.1109/CVPR.2011.5995702
URL : http://people.ee.duke.edu/~lcarin/cvpr_adapt.pdf

K. Singh, K. Xiao, F. , J. Lee, and Y. , Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3548-3556, 2016.
DOI : 10.1109/CVPR.2016.386
URL : http://arxiv.org/pdf/1604.05766

W. Kuo, B. Hariharan, M. , and J. , DeepBox: Learning Objectness with Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.2479-2487, 2015.
DOI : 10.1109/ICCV.2015.285
URL : http://arxiv.org/pdf/1505.02146

R. Li and T. Zickler, Discriminative virtual views for cross-view action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2855-2862, 2012.

Y. Li, K. He, and J. Sun, R-fcn: Object detection via region-based fully convolutional networks, Advances in Neural Information Processing Systems, pp.379-387, 2016.

Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, Revisiting batch normalization for practical domain adaptation. arXiv Preprint, p.38, 2016.

Z. Li, E. Gavves, M. Jain, and C. G. Snoek, VideoLSTM convolves, attends and flows for action recognition, arXiv Preprint, p.121, 2016.
DOI : 10.1016/j.cviu.2017.10.011

X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin et al., Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection, 2015 IEEE International Conference on Computer Vision (ICCV), pp.999-1007, 2015.
DOI : 10.1109/ICCV.2015.120

M. Lin, Q. Chen, Y. , and S. , Network in network, International Conference on Learning Representations, p.19, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00737767

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, Proceedings of the European Conference on Computer Vision, p.141, 2014.
DOI : 10.1007/978-3-319-10602-1_48
URL : http://arxiv.org/pdf/1405.0312.pdf

J. Liu, B. Kuipers, and S. Savarese, Recognizing human actions by attributes, CVPR 2011, p.99, 2011.
DOI : 10.1109/CVPR.2011.5995353
URL : http://web.eecs.umich.edu/~kuipers/papers/Liu-cvpr-11_action_attributes.pdf

M. Liu and O. Tuzel, Coupled generative adversarial networks, Advances in Neural Information Processing Systems, pp.469-477, 2016.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed et al., SSD: Single Shot MultiBox Detector, Proceedings of the European Conference on Computer Vision. xiii, pp.31-86, 2016.
DOI : 10.1109/CVPR.2008.4587597
URL : http://arxiv.org/pdf/1512.02325

M. Long, Y. Cao, J. Wang, J. , and M. , Learning transferable features with deep adaptation networks, International Conference on Machine Learning, pp.97-105, 2015.
DOI : 10.1109/tkde.2016.2554549

M. Long, J. Wang, J. , and M. I. , Deep transfer learning with joint adaptation networks. arXiv Preprint, p.40, 2016.
DOI : 10.1109/iccv.2013.274
URL : http://learn.tsinghua.edu.cn:8080/2011310560/publications/joint-iccv14.pdf

M. Long, H. Zhu, J. Wang, J. , and M. I. , Unsupervised domain adaptation with residual transfer networks, Advances in Neural Information Processing Systems, pp.136-144, 2016.

D. G. Lowe, Object recognition from local scale-invariant features, Proceedings of the Seventh IEEE International Conference on Computer Vision, pp.1150-1157, 1999.
DOI : 10.1109/ICCV.1999.790410
URL : http://www-inst.cs.berkeley.edu/~cs294-6/fa06/papers/LoweD_Object recognition from local scale-invariant features.pdf

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual Relationship Detection with Language Priors, Proceedings of the European Conference on Computer Vision, pp.90-104, 2016.
DOI : 10.1023/B:VISI.0000029664.99615.94

S. Ma, J. Zhang, N. Ikizler-cinbis, and S. Sclaroff, Action Recognition and Localization by Hierarchical Space-Time Segments, 2013 IEEE International Conference on Computer Vision, p.89, 2013.
DOI : 10.1109/ICCV.2013.341
URL : http://cs-people.bu.edu/shugaoma/STSegments/iccv2013_preprint_shugao.pdf

T. Malisiewicz, A. Gupta, and A. Efros, Ensemble of exemplar-SVMs for object detection and beyond, 2011 International Conference on Computer Vision, p.89, 2011.
DOI : 10.1109/ICCV.2011.6126229

S. Manen, M. Guillaumin, V. Gool, and L. , Prime Object Proposals with Randomized Prim's Algorithm, 2013 IEEE International Conference on Computer Vision, pp.2536-2543, 2013.
DOI : 10.1109/ICCV.2013.315

S. Manen, M. Gygli, D. Dai, V. Gool, and L. , Pathtrack: Fast trajectory annotation with path supervision. arXiv preprint, 2017.
DOI : 10.1109/iccv.2017.40

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang et al., Deep captioning with multimodal recurrent neural networks (m-rnn), International Conference on Learning Representations, p.90, 2015.

M. Puscas, M. Sangineto, E. Culibrk, D. Sebe, and N. , Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories, 2015 IEEE International Conference on Computer Vision (ICCV), pp.1653-1661, 2015.
DOI : 10.1109/ICCV.2015.193

W. N. Martin and J. Aggarwal, Dynamic scene analysis: The study of moving images, 1977.
DOI : 10.21236/ADA042124

F. Massa, B. C. Russell, A. , and M. , Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6024-6033, 2016.
DOI : 10.1109/CVPR.2016.648
URL : http://arxiv.org/pdf/1512.02497

S. Mathe and C. Sminchisescu, Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition, Proceedings of the European Conference on Computer Vision, pp.842-856, 2012.
DOI : 10.1007/978-3-642-33709-3_60
URL : http://sminchisescu.ins.uni-bonn.de/papers/ms12eccv.pdf

P. Matikainen, M. Hebert, and R. Sukthankar, Representing Pairwise Spatial and Temporal Relations for Action Recognition, Computer Vision?ECCV, issue.8, pp.508-521, 2010.
DOI : 10.1007/978-3-642-15549-9_37
URL : http://www.ri.cmu.edu/pub_files/2010/9/eccv2010pyry.pdf

R. Messing, C. Pal, and H. Kautz, Activity recognition using the velocity histories of tracked keypoints, 2009 IEEE 12th International Conference on Computer Vision, pp.104-111, 2009.
DOI : 10.1109/ICCV.2009.5459154

P. Mettes, J. C. Van-gemert, and C. G. Snoek, Spot On: Action Localization from Pointly-Supervised Proposals, Proceedings of the European Conference on Computer Vision, pp.437-453, 2016.
DOI : 10.1007/s11263-013-0636-x

I. Misra, A. Shrivastava, and M. Hebert, Watch and learn: Semi-supervised learning of object detectors from videos, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3593-3602, 2015.
DOI : 10.1109/CVPR.2015.7298982

A. Mittal, A. Raj, V. P. Namboodiri, and T. Tuytelaars, Unsupervised domain adaptation in the wild: Dealing with asymmetric label sets. arXiv preprint, 2016.

D. Modolo and V. Ferrari, Learning semantic part-based models from google images. arXiv Preprint, p.147, 2016.
DOI : 10.1109/tpami.2017.2724029

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, International Conference on Machine Learning, pp.689-696, 2011.

H. V. Nguyen, H. T. Ho, V. M. Patel, C. , and R. , DASH-N: Joint Hierarchical Domain Adaptation and Feature Learning, IEEE Transactions on Image Processing, vol.24, issue.12, pp.245479-5491, 2015.
DOI : 10.1109/TIP.2015.2479405

J. C. Niebles, C. Chen, and L. Fei-fei, Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification, Proceedings of the European Conference on Computer Vision, pp.392-405, 2010.
DOI : 10.1007/978-3-642-15552-9_29

H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, 2015 IEEE International Conference on Computer Vision (ICCV), p.21, 2015.
DOI : 10.1109/ICCV.2015.178

S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen et al., A large-scale benchmark dataset for event recognition in surveillance video, CVPR 2011, pp.3153-3160, 2011.
DOI : 10.1109/CVPR.2011.5995586

S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen et al., A large-scale benchmark dataset for event recognition in surveillance video, CVPR 2011, p.14, 2011.
DOI : 10.1109/CVPR.2011.5995586

G. L. Oliveira, W. Burgard, and T. Brox, Efficient deep models for monocular road segmentation, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.4885-4891, 2016.
DOI : 10.1109/IROS.2016.7759717

B. Ommer, T. Mader, and J. M. Buhmann, Seeing the Objects Behind the Dots: Recognition in Videos from??a??Moving Camera, International Journal of Computer Vision, vol.3, issue.5, pp.57-71, 2009.
DOI : 10.1007/s11263-009-0211-7

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid, Spatio-temporal Object Detection Proposals, Proceedings of the European Conference on Computer Vision, p.121, 2014.
DOI : 10.1007/978-3-319-10578-9_48
URL : https://hal.archives-ouvertes.fr/hal-01021902

D. Oneata, J. Verbeek, and C. Schmid, Efficient Action Localization with Approximately Normalized Fisher Vectors, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.2545-2552, 2014.
DOI : 10.1109/CVPR.2014.326
URL : https://hal.archives-ouvertes.fr/hal-00979594

W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo et al., DeepID-Net: Deformable deep convolutional neural networks for object detection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2403-2412, 2015.
DOI : 10.1109/CVPR.2015.7298854
URL : http://www.ee.cuhk.edu.hk/%7Exgwang/papers/deepIDNetCVPR15.pdf

S. J. Pan and Q. Yang, A Survey on Transfer Learning, IEEE Transactions on Knowledge and Data Engineering, vol.22, issue.10, p.56, 2010.
DOI : 10.1109/TKDE.2009.191

M. Pandey and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, 2011 International Conference on Computer Vision, p.89, 2011.
DOI : 10.1109/ICCV.2011.6126383
URL : http://www.cs.unc.edu/~lazebnik/publications/megha_iccv2011.pdf

D. P. Papadopoulos, A. D. Clarke, F. Keller, and V. Ferrari, Training Object Class Detectors from Eye Tracking Data, Proceedings of the European Conference on Computer Vision, pp.361-376, 2014.
DOI : 10.1007/978-3-319-10602-1_24
URL : http://groups.inf.ed.ac.uk/calvin/Publications/papadopouloseccv14.pdf

D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari, We Don???t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.854-863, 2016.
DOI : 10.1109/CVPR.2016.99
URL : http://arxiv.org/pdf/1602.08405

D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari, Training Object Class Detectors with Click Supervision, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.148, 2017.
DOI : 10.1109/CVPR.2017.27

A. Papazoglou, D. Pero, L. Ferrari, and V. , Discovering object aspects from video, Image and Vision Computing, vol.52, pp.206-217, 2016.
DOI : 10.1016/j.imavis.2016.04.014

A. Papazoglou, D. Pero, L. Ferrari, and V. , Video Temporal Alignment for Object Viewpoint, Proceedings of the Asian Conference on Computer Vision, pp.273-288, 2016.
DOI : 10.1109/CVPR.2012.6248065

A. Papazoglou and V. Ferrari, Fast Object Segmentation in Unconstrained Video, 2013 IEEE International Conference on Computer Vision, p.67, 2013.
DOI : 10.1109/ICCV.2013.223

X. Peng and C. Schmid, Multi-region Two-Stream R-CNN for Action Detection, Proceedings of the European Conference on Computer Vision. xvi, xx, xxi, pp.138-139, 2016.
DOI : 10.1109/CVPR.2015.7298735
URL : https://hal.archives-ouvertes.fr/hal-01349107

F. Perazzi, J. Pont-tuset, B. Mcwilliams, L. Van-gool, M. Gross et al., A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.724-732, 2016.
DOI : 10.1109/CVPR.2016.85

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, Proceedings of the European Conference on Computer Vision, pp.143-156, 2010.
DOI : 10.1007/978-3-642-15561-1_11
URL : https://hal.archives-ouvertes.fr/inria-00548630

J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, Object retrieval with large vocabularies and fast spatial matching, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383172

P. J. Phillips, O. 'toole, and A. J. , Comparison of human and computer performance across face recognition experiments, Image and Vision Computing, vol.32, issue.1, pp.74-85, 2014.
DOI : 10.1016/j.imavis.2013.12.002

P. O. Pinheiro, T. Lin, R. Collobert, and P. , Learning to Refine Object Segments, Proceedings of the European Conference on Computer Vision. xv, pp.90-102, 2016.
DOI : 10.5244/C.30.15
URL : https://infoscience.epfl.ch/record/224543/files/Pinheiro_ECCV_2016.pdf

H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, Globally-optimal greedy algorithms for tracking a variable number of objects, CVPR 2011, pp.1201-1208, 2011.
DOI : 10.1109/CVPR.2011.5995604

A. Prest, V. Ferrari, and C. Schmid, Explicit Modeling of Human-Object Interactions in Realistic Videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.4, pp.835-848, 2013.
DOI : 10.1109/TPAMI.2012.175
URL : https://hal.archives-ouvertes.fr/inria-00626929

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.62-66, 2012.
DOI : 10.1109/CVPR.2012.6248065
URL : https://hal.archives-ouvertes.fr/hal-00695940

A. Prest, C. Schmid, and V. Ferrari, Weakly Supervised Learning of Interactions between Humans and Objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.3, pp.601-614, 2012.
DOI : 10.1109/TPAMI.2011.158
URL : https://hal.archives-ouvertes.fr/inria-00516477

N. Quadrianto and C. H. Lampert, Learning multi-view neighborhood preserving projections, International Conference on Machine Learning, pp.425-432, 2011.

A. Raj, V. P. Namboodiri, and T. Tuytelaars, Subspace Alignment Based Domain Adaptation for RCNN Detector, Procedings of the British Machine Vision Conference 2015, p.39, 2015.
DOI : 10.5244/C.29.166
URL : http://arxiv.org/pdf/1507.05578

D. Ramanan, D. A. Forsyth, and K. Barnard, Building models of animals from video, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.28, issue.8, pp.1319-1334, 2006.
DOI : 10.1109/TPAMI.2006.155

M. Raptis, I. Kokkinos, and S. Soatto, Discovering discriminative action parts from mid-level video representations, 2012 IEEE Conference on Computer Vision and Pattern Recognition, p.89, 2012.
DOI : 10.1109/CVPR.2012.6247807
URL : https://hal.archives-ouvertes.fr/hal-00918807

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.779-788, 2016.
DOI : 10.1109/CVPR.2016.91

J. Redmon and A. Farhadi, Yolo9000: Better, faster, stronger. arXiv Preprint, p.22, 2016.
DOI : 10.1109/cvpr.2017.690

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Advances in Neural Information Processing Systems. xiii, pp.91-94, 2015.
DOI : 10.1109/TPAMI.2016.2577031

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.6, p.29, 2015.
DOI : 10.1109/TPAMI.2016.2577031

S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, Object Detection Networks on Convolutional Feature Maps, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.7, p.22, 2016.
DOI : 10.1109/TPAMI.2016.2601099

M. D. Rodriguez, J. Ahmed, and M. Shah, Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition, 2008 IEEE Conference on Computer Vision and Pattern Recognition, p.86, 2008.
DOI : 10.1109/CVPR.2008.4587727
URL : http://longwood.cs.ucf.edu/~vision/papers/cvpr2008/7.pdf

G. Rogez, P. Weinzaepfel, and C. Schmid, LCR-Net: Localizationclassification-regression for human pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.23, 2017.
DOI : 10.1109/cvpr.2017.134

K. Rohr, Towards model-based recognition of human movements in image sequences, CVGIP: Image Understanding, vol.59, issue.1, pp.94-115, 1994.

A. Rozantsev, M. Salzmann, and P. Fua, Beyond sharing weights for deep domain adaptation. arXiv Preprint, p.43, 2016.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.1010, issue.1, p.57, 2015.
DOI : 10.1007/978-3-642-15555-0_11
URL : http://dspace.mit.edu/bitstream/1721.1/104944/1/11263_2015_Article_816.pdf

M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, CVPR 2011, p.90, 2011.
DOI : 10.1109/CVPR.2011.5995711
URL : http://www.cs.rit.edu/%7Erlc/Courses/ImageUnderstanding/Papers/Current/visualPhrases.pdf

K. Saenko, B. Kulis, M. Fritz, D. , and T. , Adapting Visual Category Models to New Domains, Proceedings of the European Conference on Computer Vision, pp.213-226, 2010.
DOI : 10.1007/978-3-642-15561-1_16
URL : http://www1.icsi.berkeley.edu/~saenko/saenko_eccv_2010.pdf

S. Saha, G. Singh, C. , and F. , AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture, 2017 IEEE International Conference on Computer Vision (ICCV), p.49, 2017.
DOI : 10.1109/ICCV.2017.473

S. Saha, G. Singh, M. Sapienza, P. H. Torr, C. et al., Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos, Procedings of the British Machine Vision Conference 2016, pp.122-127, 2016.
DOI : 10.5244/C.30.58

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image Classification with the Fisher Vector: Theory and Practice, International Journal of Computer Vision, vol.73, issue.2, pp.222-245, 2013.
DOI : 10.1007/s11263-006-9794-4

S. Satkin and M. Hebert, Modeling the Temporal Extent of Actions, Proceedings of the European Conference on Computer Vision, pp.536-548, 2010.
DOI : 10.1007/978-3-642-15549-9_39

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., pp.32-36, 2004.
DOI : 10.1109/ICPR.2004.1334462
URL : http://www.nada.kth.se/%7Ecaputo/publik/icpr04actions.pdf

C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., p.86, 2004.
DOI : 10.1109/ICPR.2004.1334462

. Overfeat, Integrated recognition, localization and detection using convolutional networks, International Conference on Learning Representations, p.22

A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, Generalized Multiview Analysis: A discriminative latent space, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.2160-2167, 2012.
DOI : 10.1109/CVPR.2012.6247923

P. Sharma, C. Huang, and R. Nevatia, Unsupervised incremental learning for improved object detection in a video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.3298-3305, 2012.
DOI : 10.1109/CVPR.2012.6248067

P. Sharma and R. Nevatia, Efficient Detector Adaptation for Object Detection in a Video, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.44-56, 2013.
DOI : 10.1109/CVPR.2013.418

M. Shi, H. Caesar, and V. Ferrari, Weakly Supervised Object Localization Using Things and Stuff Transfer, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.366

M. Shi and V. Ferrari, Weakly Supervised Object Localization Using Size Estimates, Proceedings of the European Conference on Computer Vision, pp.105-121, 2016.
DOI : 10.1007/978-3-642-33786-4_5

X. Shu, G. Qi, J. Tang, W. , and J. , Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation, Proceedings of the 23rd ACM international conference on Multimedia, MM '15, pp.35-44, 2015.
DOI : 10.1145/2647868.2654914

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, Advances in Neural Information Processing Systems, pp.95-105, 2014.

K. Simonyan and A. Zisserman, Very deep convolutional networks for largescale image recognition, International Conference on Learning Representations, pp.95-130, 2015.

G. Singh, S. Saha, M. Sapienza, P. Torr, C. et al., Online real time multiple spatiotemporal action localisation and prediction on a single platform, arXiv Preprint. xvi, pp.127-137, 2017.
DOI : 10.1109/iccv.2017.393

H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui et al., On learning to localize objects with minimal supervision, International Conference on Machine Learning, p.23, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00996849

S. Song and J. Xiao, Sliding Shapes for 3D Object Detection in Depth Images, Proceedings of the European Conference on Computer Vision, pp.634-651, 2014.
DOI : 10.1007/978-3-319-10599-4_41

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, CRCV-TR-12-01. iii, xi, xvii, pp.5-6, 2012.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, vol.15, issue.1, pp.1929-1958, 2014.

H. Su, J. Deng, and L. Fei-fei, Crowdsourcing annotations for visual object detection, Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, p.148, 2012.

B. Sun and K. Saenko, Deep coral: Correlation alignment for deep domain adaptation, Computer Vision?ECCV 2016 Workshops, pp.443-450, 2016.

C. Szegedy, S. Ioffe, V. Vanhoucke, A. , and A. , Inception-v4, inceptionresnet and the impact of residual connections on learning. arXiv Preprint, 2016.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
DOI : 10.1109/CVPR.2015.7298594

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.1701-1708, 2014.
DOI : 10.1109/CVPR.2014.220

K. Tang, V. Ramanathan, L. Fei-fei, and D. Koller, Shifting weights: Adapting object detectors from image to video, Advances in Neural Information Processing Systems, pp.42-44, 2012.

K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-fei, Discriminative Segment Annotation in Weakly Labeled Video, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.56-59, 2013.
DOI : 10.1109/CVPR.2013.321
URL : http://www.cs.cmu.edu/~rahuls/pub/cvpr2013-crane-rahuls.pdf

Y. Tian, R. Sukthankar, and M. Shah, Spatiotemporal Deformable Part Models for Action Detection, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.2642-2649, 2013.
DOI : 10.1109/CVPR.2013.341
URL : http://www.cs.cmu.edu/~rahuls/pub/cvpr2013-sdpm-rahuls.pdf

P. Tokmakov, K. Alahari, and C. Schmid, Weakly-Supervised Semantic Segmentation Using Motion Cues, Proceedings of the European Conference on Computer Vision, pp.388-404, 2016.
DOI : 10.1109/TPAMI.2012.120
URL : https://hal.archives-ouvertes.fr/hal-01292794

P. Tokmakov, K. Alahari, and C. Schmid, Learning Motion Patterns in Videos, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.23, 2017.
DOI : 10.1109/CVPR.2017.64
URL : https://hal.archives-ouvertes.fr/hal-01427480

P. Tokmakov, K. Alahari, and C. Schmid, Learning Video Object Segmentation with Visual Memory, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.480
URL : https://hal.archives-ouvertes.fr/hal-01511145

A. Torralba and A. A. Efros, Unbiased look at dataset bias, CVPR 2011, pp.75-161, 2011.
DOI : 10.1109/CVPR.2011.5995347
URL : http://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf

D. Tran and J. Yuan, Optimal spatio-temporal path discovery for video event detection, CVPR 2011, pp.3321-3328, 2011.
DOI : 10.1109/CVPR.2011.5995416

D. Tran and J. Yuan, Max-margin structured output regression for spatiotemporal action localization, Advances in Neural Information Processing Systems, pp.350-358, 2012.

E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, Simultaneous Deep Transfer Across Domains and Tasks, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4068-4076, 2015.
DOI : 10.1109/ICCV.2015.463

E. Tzeng, J. Hoffman, K. Saenko, D. , and T. , Adversarial discriminative domain adaptation. arXiv preprint, pp.42-113, 2017.
DOI : 10.1109/cvpr.2017.316

E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, D. et al., Deep domain confusion: Maximizing for domain invariance. arXiv Preprint, p.40, 2014.

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.27-47, 2013.
DOI : 10.1023/B:VISI.0000013087.49260.fb

L. Van-der-maaten and G. Hinton, Visualizing data using t-SNE, p.76, 2008.

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell et al., Sequence to Sequence -- Video to Text, 2015 IEEE International Conference on Computer Vision (ICCV), p.87, 2015.
DOI : 10.1109/ICCV.2015.515
URL : http://arxiv.org/pdf/1505.00487

A. Vezhnevets and V. Ferrari, Object localization in ImageNet by looking out of the window, Procedings of the British Machine Vision Conference 2015, p.23, 2015.
DOI : 10.5244/C.29.27

P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.1096-1103, 2008.
DOI : 10.1145/1390156.1390294

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.90, 2015.
DOI : 10.1109/CVPR.2015.7298935
URL : http://arxiv.org/pdf/1411.4555

P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, p.89, 2001.
DOI : 10.1109/CVPR.2001.990517

A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, vol.13, issue.2, pp.260-269, 1967.
DOI : 10.1109/TIT.1967.1054010

M. Volpi and V. Ferrari, Semantic segmentation of urban scenes by learning local class interactions, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.1-9, 2015.
DOI : 10.1109/CVPRW.2015.7301377

A. Wang, J. Lu, J. Cai, T. Cham, W. et al., Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition, IEEE Transactions on Multimedia, vol.17, issue.11, pp.1887-1898, 2015.
DOI : 10.1109/TMM.2015.2476655

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense Trajectories and Motion Boundary Descriptors for Action Recognition, International Journal of Computer Vision, vol.73, issue.2, pp.60-79, 2013.
DOI : 10.1007/s11263-006-9794-4
URL : https://hal.archives-ouvertes.fr/hal-00725627

H. Wang, D. Oneata, J. Verbeek, and C. Schmid, A Robust and Efficient Video Representation for Action Recognition, International Journal of Computer Vision, vol.103, issue.1, p.89, 2015.
DOI : 10.1109/ICCV.2013.442
URL : https://hal.archives-ouvertes.fr/hal-01145834

L. Wang, Y. Qiao, and X. Tang, Video Action Detection with Relational Dynamic-Poselets, Proceedings of the European Conference on Computer Vision, pp.565-580, 2014.
DOI : 10.1007/978-3-319-10602-1_37

L. Wang, Y. Qiao, X. Tang, V. Gool, and L. , Actionness Estimation Using Hybrid Fully Convolutional Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.142, 2016.
DOI : 10.1109/CVPR.2016.296

X. Wang and A. Gupta, Unsupervised Learning of Visual Representations Using Videos, 2015 IEEE International Conference on Computer Vision (ICCV), pp.2794-2802, 2015.
DOI : 10.1109/ICCV.2015.320

X. Wang, G. Hua, and T. X. Han, Detection by detections: Non-parametric detector adaptation for a video, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.350-357, 2012.

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to track for spatiotemporal action localization, Proceedings of the International Conference on Computer Vision. xiii, pp.127-129, 2015.
DOI : 10.1109/iccv.2015.362
URL : https://hal.archives-ouvertes.fr/hal-01159941

P. Weinzaepfel, X. Martin, and C. Schmid, Human action localization with sparse spatial supervision, p.150, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01317558

C. Xu and J. J. Corso, Evaluation of super-voxel methods for early video processing, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1202-1209, 2012.

C. Xu and J. J. Corso, Actor-action semantic segmentation with groupingprocess models, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.90, 2016.
DOI : 10.1109/cvpr.2016.336
URL : http://arxiv.org/pdf/1512.09041

C. Xu, S. Hsieh, C. Xiong, C. , and J. J. , Can humans fly? Action understanding with multiple classes of actors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.88-90, 2015.
DOI : 10.1109/CVPR.2015.7298839

J. Xu, S. Ramos, D. Vázquez, and A. M. Lopez, Incremental Domain Adaptation of Deformable Part-based Models, Proceedings of the British Machine Vision Conference 2014, pp.2367-2380, 2014.
DOI : 10.5244/C.28.120

J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in timesequential images using hidden markov model, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.379-385, 1992.
DOI : 10.1109/cvpr.1992.223161

J. Yang, R. Yan, and A. G. Hauptmann, Adapting SVM Classifiers to Data with Shifted Distributions, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), pp.69-76, 2007.
DOI : 10.1109/ICDMW.2007.37
URL : http://repository.cmu.edu/cgi/viewcontent.cgi?article=1943&context=compsci

J. Yang, R. Yan, and A. G. Hauptmann, Cross-domain video concept detection using adaptive svms, Proceedings of the 15th international conference on Multimedia , MULTIMEDIA '07, pp.188-197, 2007.
DOI : 10.1145/1291233.1291276

B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas et al., Human action recognition by learning bases of action attributes and parts, 2011 International Conference on Computer Vision, p.90, 2011.
DOI : 10.1109/ICCV.2011.6126386

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal et al., Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), p.87, 2015.
DOI : 10.1109/ICCV.2015.512

D. Yoo, S. Park, J. Lee, A. S. Paek, S. Kweon et al., AttentionNet: Aggregating Weak Directions for Accurate Object Detection, 2015 IEEE International Conference on Computer Vision (ICCV), pp.2659-2667, 2015.
DOI : 10.1109/ICCV.2015.305
URL : http://arxiv.org/pdf/1506.07704

G. Yu and J. Yuan, Fast action proposals for human action detection and search, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.121, 2015.
DOI : 10.1109/CVPR.2015.7298735

J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2442-2449, 2009.

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, Proceedings of the European Conference on Computer Vision, pp.818-833, 2014.
DOI : 10.1007/978-3-319-10590-1_53
URL : http://cs.nyu.edu/%7Efergus/papers/zeilerECCV2014.pdf

K. Zhang, W. Chao, F. Sha, and K. Grauman, Summary Transfer: Exemplar-Based Subset Selection for Video Summarization, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1059-1067, 2016.
DOI : 10.1109/CVPR.2016.120
URL : http://arxiv.org/pdf/1603.03369

X. Zhu, Y. Wang, J. Dai, L. Yuan, W. et al., Flow-guided feature aggregation for video object detection. arXiv Preprint, p.51, 2017.
DOI : 10.1109/iccv.2017.52

X. Zhu, Y. Xiong, J. Dai, L. Yuan, W. et al., Deep Feature Flow for Video Recognition, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.50, 2017.
DOI : 10.1109/CVPR.2017.441

C. L. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, Proceedings of the European Conference on Computer Vision, pp.391-405, 2014.
DOI : 10.1007/978-3-319-10602-1_26
URL : http://research.microsoft.com/en-us/um/people/larryz/ZitnickDollarECCV14edgeBoxes.pdf

M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, Chained multistream networks exploiting pose, motion, and appearance for action classification and detection, p.148, 2017.
DOI : 10.1109/iccv.2017.316