F. Bach and Z. Harchaoui, Diffrac: a discriminative and flexible framework for clustering, NIPS, 2004.

C. F. Baker, C. J. Fillmore, and J. B. Lowe, The berkeley framenet project, COLING-ACL, 1998.

K. Barnard, P. Duygulu, N. De-freitas, D. Forsyth, D. Blei et al., Matching words and pictures, J. Machine Learning Research, issue.2, 2003.

T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White et al., Names and faces in the news, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., 2004.
DOI : 10.1109/CVPR.2004.1315253

T. Cour, B. Sapp, C. Jordan, and B. Taskar, Learning from ambiguously labeled images, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2006.
DOI : 10.1109/CVPR.2009.5206667

D. Das, A. F. Martins, and N. A. Smith, An exact dual decomposition algorithm for shallow semantic parsing with constraints, SEM, 2012.

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459279

M. Everingham, J. Sivic, and A. Zisserman, Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video, Procedings of the British Machine Vision Conference 2006, 2005.
DOI : 10.5244/C.20.92

A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian et al., Every Picture Tells a Story: Generating Sentences from Images, ECCV, 2010.
DOI : 10.1007/978-3-642-15561-1_2

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459266

URL : https://hal.archives-ouvertes.fr/inria-00439276

Y. Guo and D. Schuurmans, Convex relaxations of latent variable training, NIPS, 2007.

A. Gupta and L. Davis, Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers, ECCV, 2008.
DOI : 10.1007/978-3-540-88682-2_3

A. Gupta, P. Srinivasan, J. Shi, and L. Davis, Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206492

G. B. Huang, M. Ramesh, T. Berg, and E. Learned-miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments, 2007.

A. Joulin, F. Bach, and J. Ponce, Multi-class cosegmentation, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6247719

URL : https://hal.archives-ouvertes.fr/hal-00717448

A. Joulin, F. R. Bach, and J. Ponce, Discriminative clustering for image co-segmentation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004.
DOI : 10.1109/CVPR.2010.5539868

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

J. Luo, B. Caputo, and V. Ferrari, Who's doing what: Joint modeling of names and verbs for simultaneous face and pose annotation, NIPS, 2009.

M. Marszalek, I. Laptev, and C. Schmid, Actions in context, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2005.
DOI : 10.1109/CVPR.2009.5206557

URL : https://hal.archives-ouvertes.fr/inria-00548645

V. Ordonez, G. Kulkarni, and T. Berg, Im2text: Describing images using 1 million captioned photographs, NIPS, 2011.

J. Sivic, M. Everingham, and A. Zisserman, -learning person specific classifiers from video, CVPR, 2006.
DOI : 10.1109/cvpr.2009.5206513

URL : https://hal.archives-ouvertes.fr/hal-01110678

M. Tapaswi, M. Bauml, and R. Stiefelhagen, knock! knock! who is it? " probabilistic person identification in tv-series, CVPR, 2012.

S. Vijayanarasimhan and K. Grauman, Keywords to visual categories: Multiple-instance learning forweakly supervised object categorization, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587632

Y. Wang and G. Mori, A discriminative latent model of image region and object tag correspondence, NIPS, 2005.

X. Zhu and D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, CVPR, 2012.