P. Agrawal, J. Carreira, and J. Malik, Learning to see by moving, Proceedings of the IEEE International Conference on Computer Vision, p.77, 2015.

A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks, ECCV, p.59, 2008.

B. Alexe, T. Deselaers, and V. Ferrari, What is an object?, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, vol.41, p.75, 2010.

B. Alexe, T. Deselaers, and V. Ferrari, Measuring the objectness of image windows, IEEE transactions on pattern analysis and machine intelligence, vol.34, p.41, 2012.

R. Arandjelovi? and A. Zisserman, Look, listen and learn, ICCV, p.70, 2017.

P. Arbeláez, J. Pont-tuset, J. T. Barron, F. Marques, and J. Malik, Multiscale combinatorial grouping, CVPR, p.65, 2014.

M. Arjovsky, S. Chintala, L. Bottou, and G. Wasserstein, , vol.151, p.261, 2017.

S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, Generalization and equilibrium in generative adversarial nets (gans), vol.169, p.180, 2017.

Y. Aytar and A. Zisserman, Tabula rasa: Model transfer for object category detection, ICCV, p.59, 2011.

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on neural networks, vol.5, issue.2, p.54, 1994.

H. Bilen and A. Vedaldi, Weakly supervised deep detection networks, CVPR, vol.76, p.130, 2016.

M. Blaschko, A. Vedaldi, and A. Zisserman, Simultaneous object detection and ranking with weak supervision, NIPS, p.74, 2010.

P. Bojanowski and A. Joulin, Unsupervised Learning by Predicting Noise, p.179, 2017.

L. Bottou, Online algorithms and stochastic approximations, Online Learning and Neural Networks, vol.206, p.207, 0205.

L. Bottou, From machine learning to machine reasoning, 2011.

L. Bottou and Y. Lecun, Sn: A simulator for connectionist models, Proceedings of NeuroNimes 88, p.45, 1988.

L. Bottou, Stochastic gradient learning in neural networks, In Proceedings of Neuro-Nîmes, vol.91, p.207, 1991.

Y. Boureau, F. Bach, Y. Lecun, and J. Ponce, Learning mid-level features for recognition, CVPR, p.58, 2010.

T. Brox, L. Bourdev, S. Maji, and J. Malik, Object segmentation by alignment of poselet activations to image contours, CVPR, p.72, 2011.

J. Bruna and S. Mallat, Invariant scattering convolution networks, IEEE PAMI, vol.35, issue.8, p.179, 2013.

. J-brian, R. S. Burns, E. M. Weiss, and . Riseman, View variation of point-set and line-segment features, IEEE PAMI, vol.15, issue.1, p.29, 1993.

M. Campbell, F. Joseph-hoane, and . Hsu, Deep blue, Artificial intelligence, vol.134, issue.1-2, p.60, 2002.

J. Canny, A computational approach to edge detection, IEEE PAMI, issue.6, p.29, 1986.

J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, Proc. BMVC, vol.111, p.121, 2014.

K. Chellapilla, S. Puri, and P. Simard, High performance convolutional neural networks for document processing, Tenth International Workshop on Frontiers in Handwriting Recognition, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00112631

X. Chen, A. Shrivastava, and A. Gupta, Neil: Extracting visual knowledge from web data, ICCV, 2013.

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever et al., Infogan: Interpretable representation learning by information maximizing generative adversarial nets, Advances in Neural Information Processing Systems, p.153, 2016.

X. Chen and A. L. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, NIPS, p.67, 2014.

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran et al., Efficient primitives for deep learning, vol.60, p.242, 2014.

L. Chizat, G. Peyré, B. Schmitzer, and F. Vialard, Scaling Algorithms for Unbalanced Transport Problems, p.261, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01434914

O. Chum and A. Zisserman, An exemplar model for learning object classes, CVPR, p.74, 2007.

J. Ramazan-gokberk-cinbis, C. Verbeek, and . Schmid, Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning, IEEE PAMI, vol.39, issue.1, p.75, 2017.

R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A matlab-like environment for machine learning, BigLearn, NIPS Workshop, vol.25, p.154, 2011.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu et al., Natural language processing (almost) from scratch, JMLR, vol.12, p.102, 2011.

D. Crandall and D. Huttenlocher, Weakly supervised learning of part-based spatial models for visual object recognition, ECCV, p.74, 2006.

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, ECCV Workshop, vol.34, p.111, 2004.

M. Cuturi, Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. NIPS, p.261, 2013.

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, CVPR, vol.39, p.87, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548512

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., ImageNet: A Large-Scale Hierarchical Image Database, CVPR, vol.87, p.88, 2009.

S. Emily-l-denton, R. Chintala, and . Fergus, Deep generative image models using a[U+FFFC] laplacian pyramid of adversarial networks, NIPS, vol.81, p.141, 2015.

T. Deselaers, B. Alexe, and V. Ferrari, Localizing objects while learning their appearance, ECCV, p.75, 2010.

S. Divvala, A. Farhadi, and C. Guestrin, Learning everything about anything: Weblysupervised visual concept learning, CVPR, p.73, 2014.

C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, What makes paris look like paris?, ACM Transactions on Graphics (TOG), vol.31, issue.4, p.73, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01248528

C. Doersch, A. Gupta, and A. A. Efros, Unsupervised visual representation learning by context prediction, ICCV, vol.78, p.179, 2015.

J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang et al., Decaf: A deep convolutional activation feature for generic visual recognition, ICML, vol.58, p.107, 2014.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, JMLR, vol.12, p.247, 2011.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, JMLR, vol.12, p.62, 2011.

O. Richard, P. E. Duda, and . Hart, Use of the hough transformation to detect lines and curves in pictures, Communications of the ACM, vol.15, issue.1, p.29, 1972.

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The pascal visual object classes (VOC) challenge, IJCV, vol.88, issue.2, p.193, 2010.

C. Farabet, C. Couprie, L. Najman, and Y. Lecun, Learning hierarchical features for scene labeling, IEEE PAMI, vol.88, p.97, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00742077

A. Farhadi, M. K. Tabrizi, I. Endres, and D. Forsyth, A latent model of discriminative aspect, ICCV, p.59, 2009.

L. Fei-fei, R. Fergus, and P. Perona, Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, vol.106, p.51, 2007.
URL : https://hal.archives-ouvertes.fr/hal-02053466

P. Felzenszwalb, D. Mcallester, and D. Ramanan, A discriminatively trained, multiscale, deformable part model, CVPR, vol.93, p.111, 2008.

P. Felzenszwalb, R. Girshick, D. Mcallester, and D. Ramanan, Object detection with discriminatively trained part based models, IEEE PAMI, vol.32, issue.9, p.118, 2010.

F. Pedro, . Felzenszwalb, P. Daniel, and . Huttenlocher, Efficient graph-based image segmentation, IJCV, vol.59, issue.2, p.42, 2004.

R. Fergus, P. Perona, and A. Zisserman, Object class recognition by unsupervised scale-invariant learning, CVPR, p.74, 2003.

R. Fergus, P. Perona, and A. Zisserman, Object class recognition by unsupervised scale-invariant learning, CVPR, vol.2, p.58, 2003.

A. Martin, R. A. Fischler, and . Elschlager, The representation and matching of pictorial structures, IEEE Transactions on computers, vol.100, issue.1, p.37, 1973.

J. Foulds and E. Frank, A review of multi-instance learning assumptions. The Knowledge Engineering Review, vol.25, p.76, 2010.

K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, Learning visual predictive models of physics for playing billiards, ICLR, p.154, 2016.

K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological cybernetics, vol.36, issue.4, p.87, 1980.

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle et al., Domain-adversarial training of neural networks, JMLR, vol.17, issue.59, p.174, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01624607

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, vol.127, p.193, 2014.

R. Girshick, Fast R-CNN, ICCV, p.63, 2015.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, AISTATS, vol.54, p.244, 2010.

X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, International Conference on Artificial Intelligence and Statistics, vol.54, p.57, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00752497

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets. NIPS, vol.173, p.178, 2014.

K. Grauman and T. Darrell, The pyramid match kernel: Discriminative classification with sets of image features, Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol.2, p.35, 2005.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola, A kernel two-sample test, JMLR, vol.143, p.147, 2012.

A. Gretton, M. Karsten, M. Borgwardt, B. Rasch, A. J. Schölkopf et al., A kernel method for the two-sample-problem, Advances in neural information processing systems, p.163, 2007.

G. Griffin, A. Holub, and P. Perona, Caltech-256 object category dataset, CalTech, vol.87, 2007.

S. Gross and M. Wilber, Training and investigating residual nets, 2016.

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation, CVPR, p.73, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00439276

A. Handa, M. Bloesch, V. P?tr?ucean, S. Stent, J. Mccormac et al., gvnn: Neural network library for geometric computer vision, ECCV Workshop on Geometry Meets Deep Learning, p.26, 2016.

P. Bharath-hariharan, R. Arbeláez, J. Girshick, and . Malik, Simultaneous detection and segmentation, ECCV, p.65, 2014.

H. Harzallah, F. Jurie, and C. Schmid, Combining efficient object localization and image classification, CVPR, p.111, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00439516

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, vol.61, p.147, 2016.

D. O. Hebb, The organization of behavior: A neuropsychological theory, p.43, 1949.

M. Hejrati and D. Ramanan, Analyzing 3d objects in cluttered images, NIPS, p.73, 2012.

G. E. Hinton, Learning multiple layers of representation, Trends in cognitive sciences, vol.11, issue.10, p.88, 2007.

G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, vol.313, issue.5786, p.88, 2006.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, p.248, 1997.

G. B. Huang, M. Ramesh, T. Berg, and E. Learned-miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments, vol.146, p.254, 2007.

D. H. Hubel and T. N. Wiesel, Receptive fields of single neurones in the cat's striate cortex, Journal of Physiology, vol.148, p.87, 1959.

P. Daniel and . Huttenlocher, Object recognition using alignment. ICCV, p.29, 1987.

A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis, vol.257, p.258, 2001.

S. Ioffe and D. A. Forsyth, Probabilistic methods for finding people, IJCV, vol.43, issue.1, p.38, 2001.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, vol.61, p.251, 2015.

P. Isola, J. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with conditional adversarial networks, vol.136, p.143, 2017.

M. Jaderberg, K. Simonyan, and A. Zisserman, Spatial transformer networks, NIPS, p.26, 2015.

D. Jayaraman and K. Grauman, Learning image representations tied to ego-motion, Proceedings of the IEEE International Conference on Computer Vision, p.77, 2015.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe: Convolutional architecture for fast feature embedding, Proceedings of the 22nd ACM international conference on Multimedia, vol.58, p.107, 2014.

J. Jiang and C. Zhai, Instance weighting for domain adaptation in NLP, ACL, p.89, 2007.

W. Jitkrittum, Z. Szabo, K. Chwialkowski, and A. Gretton, Interpretable distribution features with maximum testing power, NIPS, p.147, 2016.

J. Johnson, A. Alahi, and L. Fei-fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution. ECCV, p.149, 2016.

J. Johnson, A. Karpathy, and L. Fei-fei, Densecap: Fully convolutional localization networks for dense captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.26, 2016.

A. Joulin, L. Van-der-maaten, A. Jabri, and N. Vasilache, Learning visual features from large weakly supervised data, European Conference on Computer Vision, p.73, 2016.

M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman, Blocks that shout: Distinctive parts for scene classification, CVPR, p.58, 2013.

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, Contextlocnet: Context-aware deep network models for weakly supervised localization, ECCV, vol.25, p.130, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01421772

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-scale video classification with convolutional neural networks, CVPR, p.70, 2014.

K. Kavukcuoglu, R. Fergus, and Y. Lecun, Learning invariant features through topographic filter maps, CVPR, vol.52, p.53, 2009.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, p.70, 2017.

J. D. Keeler, D. E. Rumelhart, and W. K. Leow, Integrated segmentation and recognition of hand-printed numerals, NIPS, p.112, 1991.

A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, Undoing the damage of dataset bias, ECCV, p.59, 2012.

D. Kingma and J. Ba, Adam: A method for stochastic optimization. ICLR, p.149, 2015.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization. ICLR, vol.62, p.247, 2015.

P. Diederik, M. Kingma, and . Welling, Auto-encoding variational bayes, vol.135, p.141, 2013.

J. Jan, A. J. Koenderink, and . Van-doorn, Representation of local geometry in the visual system, Biological cybernetics, vol.55, issue.6, p.31, 1987.

D. Kotzias, M. Denil, P. Blunsom, and N. De-freitas, Deep multi-instance transfer learning, NIPS Deep Learning Workshop, p.76, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, vol.247, p.250, 2012.

A. Krizhevsky, Learning multiple layers of features from tiny images, vol.55, p.56, 2009.

J. Lafond, N. Vasilache, and L. Bottou, Diagonal rescaling for neural networks, 2017.

K. J. Lang and G. E. Hinton, A time delay neural network architecture for speech recognition, vol.45, p.111, 1988.

K. J. Lang, A. H. Waibel, and G. E. Hinton, A time-delay neural network architecture for isolated word recognition, Neural networks, vol.3, issue.1, p.112, 1990.

A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.242, 2016.

L. Chuck, R. J. Lawson, . Hanson, R. David, F. T. Kincaid et al., Basic linear algebra subprograms for fortran usage, ACM Transactions on Mathematical Software (TOMS), vol.5, issue.3, p.237, 1979.

S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, CVPR, vol.35, p.87, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00548585

Q. Le, W. Zou, S. Yeung, and A. Ng, Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis, CVPR, p.58, 2011.

Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen et al., Building high-level features using large scale unsupervised learning, ICML, vol.53, p.92, 2012.

Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard et al., Backpropagation applied to handwritten zip code recognition, Neural Computation, vol.1, issue.4, p.111, 1989.

Y. Lecun, L. Bottou, and Y. Bengio, Reading checks with graph transformer networks, International Conference on Acoustics, Speech, and Signal Processing, vol.1, p.63, 1997.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, PIEEE, vol.86, issue.11, p.87, 1998.

Y. Lecun, L. Bottou, G. Orr, and K. Muller, Efficient backprop, Neural Networks: Tricks of the trade, vol.245, p.251, 1998.

Y. Lecun, J. Fu, L. Huang, and . Bottou, Learning methods for generic object recognition with invariance to pose and lighting, CVPR, vol.2, p.87, 2004.

C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham et al., Photo-realistic single image super-resolution using a generative adversarial network, p.142, 2016.

E. L. Lehmann and J. P. Romano, Testing statistical hypotheses, p.144, 2006.

A. Lerer, S. Gross, and R. Fergus, Learning physical intuition of block towers by example, p.154, 2016.

T. Leung and J. Malik, Representing and recognizing the visual appearance of materials using three-dimensional textons, International journal of computer vision, vol.43, issue.1, p.135, 2001.

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, CVPR, vol.65, p.130, 2015.

D. Lopez-paz, From dependence to causation, vol.136, p.153, 2016.

D. Lopez-paz, K. Muandet, B. Schölkopf, and I. Tolstikhin, Towards a learning theory of cause-effect inference, ICML, p.161, 2015.

D. Lopez, -. Paz, and M. Oquab, Revisiting classifier two-sample tests, vol.25, p.146, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01862834

D. Lopez-paz, R. Nishihara, S. Chintala, B. Schölkopf, and L. Bottou, Discovering causal signals in images, vol.152, p.178, 2017.

D. Lowe, Distinctive image features from scale-invariant keypoints, IJCV, vol.60, issue.2, pp.91-110, 2004.

G. David and . Lowe, Object recognition from local scale-invariant features, ICCV, vol.2, p.31, 1999.

G. David and . Lowe, Distinctive image features from scale-invariant keypoints, IJCV, vol.60, issue.2, p.32, 2004.

D. Marr and H. K. Nishihara, Representation and recognition of the spatial organization of three-dimensional shapes, Proceedings of the Royal Society of London B: Biological Sciences, vol.200, p.37, 1140.

M. Marszalek, C. Schmid, H. Harzallah, and J. Van-de-weijer, Learning object representations for visual object class recognition, Visual Recognition Challenge workshop, ICCV, vol.99, p.100, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00548669

M. Mathieu, M. Henaff, and Y. Lecun, Fast training of convolutional networks through ffts, ICLR, p.242, 2014.

L. Metz, B. Poole, D. Pfau, and J. Sohl-dickstein, Unrolled generative adversarial networks, vol.147, p.157, 2017.

K. Mikolajczyk and C. Schmid, An affine invariant interest point detector, p.34, 2002.
URL : https://hal.archives-ouvertes.fr/inria-00548252

M. Mirza and S. Osindero, Conditional generative adversarial nets, vol.141, p.160, 2014.

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou et al., Playing atari with deep reinforcement learning, p.60, 2013.

J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf, Distinguishing cause from effect using observational data: methods and benchmarks, JMLR, vol.152, p.180, 2016.

R. Mottaghi, H. Bagherinezhad, M. Rastegari, and A. Farhadi, Newtonian scene understanding: Unfolding the dynamics of objects in static images, CVPR, p.153, 2016.

H. Murase, K. Shree, and . Nayar, Visual learning and recognition of 3-d objects from appearance, IJCV, vol.14, issue.1, p.30, 1995.

Y. Nesterov, A method of solving a convex programming problem with convergence rate o (1/k2), Soviet Mathematics Doklady, vol.27, p.245, 1983.

A. Newell, K. Yang, and J. Deng, Stacked hourglass networks for human pose estimation, ECCV, vol.68, p.69, 2016.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, Proceedings of the 28th international conference on machine learning (ICML-11), p.77, 2011.

S. Nowozin, B. Cseke, R. Tomioka, and . Gan, Training generative neural samplers using variational divergence minimization. NIPS, p.163, 2016.

A. Odena, V. Dumoulin, and C. Olah, Deconvolution and checkerboard artifacts, vol.147, p.148, 2016.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, CVPR, vol.25, p.111, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00911179

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free?-weakly-supervised learning with convolutional neural networks, CVPR, p.25, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01015140

V. Ordonez, G. Kulkarni, and T. Berg, Im2text: Describing images using 1 million captioned photographs, NIPS, vol.72, p.73, 2011.

R. Osadchy, M. Miller, and Y. Lecun, Synergistic face detection and pose estimation with energy-based model, NIPS, vol.87, 2005.

S. Pan and Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, vol.22, issue.10, p.59, 2010.

M. Pandey and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, ICCV, p.75, 2011.

G. Papandreou, I. Kokkinos, and P. Savalle, Untangling Local and Global Deformations in Deep Convolutional Networks for Image Classification and Sliding Window Detection, CVPR, p.117, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01109289

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, Context encoders: Feature learning by inpainting, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol.81, p.82, 2016.

J. Pearl and . Causality, , p.158, 2009.

F. Perronnin, J. Sánchez, and T. Mensink, Improving the fisher kernel for large-scale image classification, ECCV, vol.87, p.111, 1941.
URL : https://hal.archives-ouvertes.fr/inria-00548630

T. Pfister, J. Charles, and A. Zisserman, Flowing convnets for human pose estimation in videos, ICCV, p.67, 2015.

O. Pedro, R. Pinheiro, P. Collobert, and . Dollar, Learning to segment object candidates, NIPS, vol.66, p.130, 1990.

H. Pirsiavash and D. Ramanan, Detecting activities of daily living in first-person camera views, CVPR, p.89, 2012.

T. Boris and . Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, vol.4, issue.5, p.245, 1964.

H. William, S. A. Press, W. T. Teukolsky, B. P. Vetterling, and . Flannery, The Art of Scientific Computing

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, CVPR, p.73, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00695940

A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, vol.156, p.254, 2016.

V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh, Pose machines: Articulated pose estimation via inference machines, ECCV, p.68, 2014.

. Ma-ranzato, J. Fu, Y. Huang, Y. Boureau, and . Lecun, Unsupervised learning of invariant feature hierarchies with applications to object recognition, CVPR, p.52, 2007.

A. Marc, Y. Ranzato, Y. Boureau, and . Lecun, Sparse feature learning for deep belief networks, NIPS, p.52, 2007.

A. Sharif-razavian, H. Azizpour, J. Sullivan, and S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, CVPR DeepVision workshop, vol.58, p.111, 2014.

S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele et al., Learning what and where to draw, Advances in Neural Information Processing Systems, p.26, 2016.

K. Shaoqing-ren, R. He, J. Girshick, and . Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, NIPS, vol.63, p.64, 2015.

X. Ren and D. Ramanan, Histograms of sparse codes for object detection, CVPR, p.58, 2013.

. Lawrence-gilman-roberts, Machine perception of three-dimensional solids, p.28, 1963.

F. Rosenblatt, The perceptron: A perceiving and recognizing automaton, Project PARA, vol.43, p.87, 1957.

C. Rothwell, A. Zisserman, D. Forsyth, and J. Mundy, Canonical frames for planar object recognition, ECCV, p.29, 1992.

Y. Rubner, C. Tomasi, and L. J. Guibas, The earth mover's distance as a metric for image retrieval, IJCV, vol.40, issue.2, p.166, 2000.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, vol.323, issue.6088, p.250, 1986.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., , vol.56, p.192

K. Saenko, B. Kulis, M. Fritz, and T. Darrell, Adapting visual category models to new domains, ECCV, vol.59, p.89, 2010.

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford et al., Improved techniques for training GANs. NIPS, vol.142, p.149, 2016.

T. Schaul, S. Zhang, and Y. Lecun, No more pesky learning rates, ICML, p.61, 2013.

B. Schiele, L. James, and . Crowley, Object recognition using multidimensional receptive field histograms, European Conference on Computer Vision, p.30, 1996.
URL : https://hal.archives-ouvertes.fr/tel-00004962

C. Schmid and R. Mohr, Local grayvalue invariants for image retrieval, IEEE PAMI, vol.19, issue.5, p.32, 1997.
URL : https://hal.archives-ouvertes.fr/inria-00548358

B. Schmitzer, Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems, p.261, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01385251

B. Schölkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, p.34, 2002.

B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang et al., On causal and anticausal learning, ICML, p.152, 2012.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., Overfeat: Integrated recognition, localization and detection using convolutional networks. ICLR, vol.63, p.130, 2014.

A. Shrivastava and A. Gupta, Building part-based object detectors via 3d geometry, ICCV, 2013.

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre et al., Mastering the game of go with deep neural networks and tree search, Nature, vol.529, issue.7587, p.60, 2016.

P. Simard, D. Steinkraus, and J. C. Platt, Best practices for convolutional neural networks applied to visual document analysis, ICDAR, vol.3, pp.958-962, 2003.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, vol.70, p.71, 2014.

K. Simonyan and A. Zisserman, Very deep convolutional networks for largescale image recognition, p.61, 2014.

S. Singh, A. Gupta, and A. A. Efros, Unsupervised discovery of mid-level discriminative patches, ECCV, p.58, 2012.

L. Sirovich and M. Kirby, Low-dimensional procedure for the characterization of human faces, Journal of the Optical Society of America A, vol.4, issue.3, p.30, 1987.

J. Sivic and A. Zisserman, Video Google: A text retrieval approach to object matching in videos, ICCV, vol.32, p.135, 2003.

J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, Discovering object categories in image collections, ICCV, p.135, 2005.

H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui et al., On learning to localize objects with minimal supervision, ICML, p.75, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00996849

Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, Contextualizing object detection and classification, CVPR, vol.99, p.121, 2011.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, JMLR, vol.15, issue.1, pp.1929-1958, 0201.

C. Sun, M. Paluri, R. Collobert, R. Nevatia, and L. Bourdev, Pronet: Learning to propose object-specific boxes for cascaded neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.76, 2016.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, International conference on machine learning, p.246, 2013.

S. Richard, A. G. Sutton, and . Barto, Reinforcement learning: An introduction, vol.1, p.133, 1998.

J. Michael, . Swain, H. Dana, and . Ballard, Color indexing, International journal of computer vision, vol.7, issue.1, p.30, 1991.

C. Szegedy, W. Zaremba, I. Sutskever, and J. Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks, p.80, 2013.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, CVPR, p.61, 2015.

G. W. Taylor, R. Fergus, Y. Lecun, and C. Bregler, Convolutional learning of spatiotemporal features, ECCV, p.58, 2010.

L. Theis, A. Van-den-oord, and M. Bethge, A note on the evaluation of generative models. ICLR, vol.142, p.262, 2016.

T. Tieleman and G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, vol.4, p.247, 2012.

T. Tommasi, F. Orabona, and B. Caputo, Safety in numbers: Learning categories from few examples with multi model knowledge transfer, CVPR, p.59, 2010.

A. Torralba and A. A. Efros, Unbiased look at dataset bias, CVPR, vol.88, p.190, 2011.

A. Toshev and C. Szegedy, Deeppose: Human pose estimation via deep neural networks, CVPR, vol.66, p.111, 2014.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3d convolutional networks, ICCV, p.70, 2015.

M. Turk and A. Pentland, Eigenfaces for recognition, Journal of cognitive neuroscience, vol.3, issue.1, p.30, 1991.

R. Vaillant, C. Monrocq, and Y. Lecun, Original approach for the localisation of objects in images, IEE Proc on Vision, Image, and Signal Processing, vol.141, issue.4, pp.245-250, 1994.

K. Van-de-sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders, Segmentation as Selective Search for Object Recognition, ICCV, vol.63, p.129, 2011.

V. Vapnik and . Ya-chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Measures of Complexity, pp.11-30, 0198.

C. Villani, Optimal transport: old and new, p.171, 2008.

P. Viola, J. Platt, and C. Zhang, Multiple instance boosting for object detection, NIPS, p.76, 2005.

C. Wang, W. Ren, K. Huang, and T. Tan, Weakly supervised object localization with latent category learning

H. Wang and C. Schmid, Action recognition with improved trajectories, ICCV, p.69, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873267

H. Wang, A. Kläser, C. Schmid, and C. Liu, Action recognition by dense trajectories, CVPR, p.69, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00583818

X. Wang and A. Gupta, Unsupervised learning of visual representations using videos, ICCV, vol.78, p.179, 2015.

. Shih-en, V. Wei, T. Ramakrishna, Y. Kanade, and . Sheikh, Convolutional pose machines, CVPR, p.68, 2016.

Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong et al., Cnn: Single-label to multi-label, p.121, 2014.

I. Weiss, Projective invariants of shapes, CVPR, p.29, 1988.

J. Winn and N. Jojic, Locus: Learning object classes with unsupervised segmentation, ICCV, p.74, 2005.

Y. Wu, M. Schuster, Z. Chen, V. Quoc, M. Le et al., Google's neural machine translation system: Bridging the gap between human and machine translation, vol.60, p.248, 2016.

P. Yadollahpour, D. Batra, and G. Shakhnarovich, Discriminative re-ranking of diverse segmentations, CVPR, 2013.

S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan et al., Generalized hierarchical matching for sub-category aware object classification, Visual Recognition Challenge workshop, vol.99, p.101, 2012.

Y. Yang and D. Ramanan, Articulated pose estimation with flexible mixtures-ofparts, CVPR, vol.66, p.111, 2011.

F. Yu, Y. Zhang, S. Song, A. Seff, J. Xiao et al., Construction of a large-scale image dataset using deep learning with humans in the loop, vol.146, p.254, 2015.

M. Zeiler, G. Taylor, and R. Fergus, Adaptive deconvolutional networks for mid and high level feature learning, ICCV, vol.11, p.92

D. Matthew, R. Zeiler, and . Fergus, Visualizing and understanding convolutional networks, ECCV, vol.243, p.250, 2014.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, p.247, 2017.

J. Zhang, M. Marsza?ek, S. Lazebnik, and C. Schmid, Local features and kernels for classification of texture and object categories: a comprehensive study, IJCV, vol.73, issue.2, pp.213-238, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00548574

R. Zhang, P. Isola, and A. A. Efros, Colorful image colorization, European Conference on Computer Vision, vol.78, p.79, 2016.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Object Detectors Emerge in Deep Scene CNNs, vol.129, p.130, 2015.

J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, Generative visual manipulation on the natural image manifold, ECCV, vol.141, p.280, 2016.