A. Agrawal, D. Batra, and D. Parikh, Analyzing the Behavior of Visual Question Answering Models, EMNLP, p.36, 2016.

A. Agrawal, A. Kembhavi, D. Batra, D. Parikh, and . C-vqa, A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset, p.37, 2017.

H. Agrawal, K. Desai, X. Chen, R. Jain, D. Batra et al., Novel Object Captioning at Scale, p.37, 2018.

J. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic et al., Unsupervised learning from Narrated Instruction Videos, CVPR, p.54, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01171193

B. Alexe, T. Deselaers, and V. Ferrari, Measuring the Objectness of Image Windows, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.21, 2012.

P. Anderson, B. Fernando, M. Johnson, and S. Gould, SPICE: Semantic Propositional Image Caption Evaluation, ECCV, vol.30, p.36, 2016.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR, vol.33, p.52, 2018.

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Deep Compositional Question Answering with Neural Module Networks, CVPR, vol.35, p.67, 2016.

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Learning to Compose Neural Networks for Question Answering, HLT-NAACL, p.35, 2016.

G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, Deep Canonical Correlation Analysis, ICML, p.31, 2013.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., Visual Question Answering. In ICCV, vol.26, p.27, 2015.

Y. Atzmon, J. Berant, V. Kezami, A. Globerson, and G. Chechik, Learning to generalize to new compositions in image understanding, p.95, 2016.

Y. Aytar, A. Zisserman, and . Rasa, Model Transfer for Object Category Detection, ICCV, p.96, 2011.

F. R. Bach and Z. Harchaoui, Diffrac: a discriminative and flexible framework for clustering, NIPS, vol.16, p.73, 2007.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, p.31, 2015.

S. Banerjee and A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, p.36, 2005.

T. Bansal, A. Neelakantan, and A. Mccallum, Relnet: End-to-end modeling of entities & relations, p.94, 2017.

C. Barnes, F. Zhang, L. Lou, X. Wu, and S. Hu, Patchtable: efficient patch queries for large datasets and applications, ACM Trans. Graph, p.62, 2015.

P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu, Interaction networks for learning about objects, relations and physics, NIPS, vol.35, p.94, 2016.

H. Ben-younes, R. Cadène, M. Cord, N. Thome, and . Mutan, Multimodal Tucker Fusion for Visual Question Answering. In ICCV, p.31, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02073637

H. Ben-younes, R. Cadene, N. Thome, M. Cord, and . Block, Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. In AAAI, p.31, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02073644

P. Bénard, F. Cole, M. Kass, I. Mordatch, J. Hegarty et al., Stylizing animation by example, ACM Trans. Graph, p.62, 2013.

H. Bilen and A. Vedaldi, Weakly Supervised Deep Detection Networks, CVPR, vol.53, p.68, 2016.

H. Bilen, M. Pedersoli, and T. Tuytelaars, Weakly supervised object detection with convex clustering, CVPR, p.53, 2015.

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, ICCV, p.39, 2005.

P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid et al., Finding actors and actions in movies, ICCV, p.54, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00904991

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly supervised action labeling in videos under ordering constraints, ECCV, vol.54, p.69, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01053967

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-supervised alignment of video with text, ICCV, vol.54, p.87, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01154523

A. Bordes, X. Glorot, J. Weston, and Y. Bengio, A Semantic Matching Energy Function for Learning with Multi-Relational Data, Machine Learning, vol.49, p.50, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00835282

J. S. Bowers, On the Biological Plausibility of Grandmother Cells: Implications for Neural Network Theories in Psychology and Neuroscience, Psychological Review, issue.7, 2009.

R. Cadene, H. Ben-younes, N. Thome, and M. Cord, MUREL: Multimodal Relational Reasoning for Visual Question Answering, CVPR, p.35, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02073649

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields, CVPR, p.42, 2017.

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka et al., Toward an Architecture for Never-Ending Language Learning, AAAI, p.46, 2010.

M. Caron, P. Bojanowski, A. Joulin, and M. Douze, Deep Clustering for Unsupervised Learning of Visual Features, In ECCV, vol.87, 2018.

A. Chang, W. Monroe, M. Savva, C. Potts, and C. D. Manning, Text to 3D Scene Generation with Rich Lexical Grounding, ACL, vol.35, p.67, 2015.

A. X. Chang, M. Savva, and C. D. Manning, Learning Spatial Knowledge for Text to 3D Scene Generation, EMNLP, p.35, 2014.

Y. Chao, Z. Wang, Y. He, J. Wang, and J. Deng, HICO: A benchmark for recognizing human-object interactions in images, ICCV, vol.40, p.106, 2015.

Y. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, Learning to Detect Human-Object Interactions, WACV, vol.40, p.110, 2018.

X. Chen and A. Gupta, Webly Supervised Learning of Convolutional Networks, ICCV, p.54, 2015.

X. Chen, A. Shrivastava, A. Gupta, and . Neil, Extracting Visual Knowledge from Web Data, ICCV, vol.55, p.67, 2013.

L. Cheng, S. V. Vishwanathan, and X. Zhang, Consistent Image Analogies using Semi-supervised Learning, CVPR, p.62, 2008.

G. Chéron, I. Laptev, C. Schmid, and . P-cnn, Pose-based CNN Features for Action Recognition, ICCV, p.131, 2015.

G. Chéron, J. Alayrac, I. Laptev, and C. Schmid, A flexible model for training action localization with varying levels of supervision, NIPS, p.137, 2018.

M. Cho, S. Kwak, C. Schmid, and J. Ponce, Unsupervised Object Discovery and Localization in the Wild: Part-based Matching with Bottom-up Region Proposals, CVPR, p.55, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01110036

M. J. Choi, A. Torralba, and A. S. Willsky, Context models and out-of-context objects, Pattern Recognition Letters, vol.68, p.69, 2012.

G. Christie, A. Laddha, A. Agrawal, S. Antol, Y. Goyal et al., Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes, EMNLP, p.139, 2016.

R. G. Cinbis, J. J. Verbeek, and C. Schmid, Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning, IEEE Transactions
URL : https://hal.archives-ouvertes.fr/hal-01123482

C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning, 1921.

B. Dai, Y. Zhang, L. , and D. , Detecting Visual Relationships with Deep Relational Networks, CVPR, vol.42, p.95, 2017.

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, CVPR, 1921.
URL : https://hal.archives-ouvertes.fr/inria-00548512

A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In CVIU, p.37, 2017.

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav et al., Visual Dialog. In CVPR, vol.27, p.28, 2017.

V. Delaitre, J. Sivic, and I. Laptev, Learning person-object interactions for action recognition in still images, NIPS, vol.67, p.131, 1941.
URL : https://hal.archives-ouvertes.fr/hal-00648156

V. Delaitre, I. Laptev, and J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based representations, BMVC, vol.44, p.45, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01060885

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., ImageNet: A large-scale hierarchical image database, CVPR, vol.42, p.54, 2009.

J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy et al., Large-Scale Object Classification Using Label Relation Graphs, ECCV, vol.57, p.124, 2014.

C. Desai and D. Ramanan, Detecting Actions, Poses, and Objects with Relational Phraselets, ECCV, p.41, 2012.

C. Desai, D. Ramanan, and C. Fowlkes, Discriminative models for static humanobject interactions, SMiCV) CVPR Workshops, vol.44, p.67, 2010.

C. Desai, D. Ramanan, and C. C. Fowlkes, Discriminative Models for Multi-Class Object Layout, In International Journal of Computer Vision, issue.23, 2011.

T. G. Dietterich, R. H. Lathrop, and T. Lozano-pérez, Solving the Multiple Instance Problem with Axis-Parallel Rectangles, Artificial Intelligence, p.53, 1997.

G. Dinu and M. Baroni, How to make words with vectors: Phrase generation in distributional semantics, ACL, p.51, 2014.

S. K. Divvala, A. Farhadi, and C. Guestrin, Learning Everything about Anything: Webly-Supervised Visual Concept Learning, CVPR, vol.50, p.95, 2014.

M. Elhoseiny, S. Cohen, W. Chang, B. Price, A. Elgammal et al., Scalable fact learning in images, AAAI, vol.49, p.68, 2016.

D. Elliott and F. Keller, Image Description using Visual Dependency Representations, EMNLP, p.31, 2013.

M. Engilberge, L. Chevallier, P. Pérez, and M. Cord, Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization, CVPR, p.37, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02171857

F. Faghri, D. J. Fleet, J. Kiros, and S. Fidler, VSE++: Improving Visual-Semantic Embeddings with Hard Negatives, BMVC, p.31, 2017.

A. Faktor and M. Irani, Clustering by Composition" -Unsupervised Discovery of Image Categories, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.55, 2012.

H. Fang, S. Gupta, F. N. Iandola, R. Srivasta, L. Deng et al., From Captions to Visual Concepts and Back, CVPR, vol.32, p.68, 2015.

A. Farhadi, I. Endres, and D. Hoiem, Attribute-Centric Recognition for Crosscategory Generalization, p.56

A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth, Describing Objects by their Attributes, CVPR, vol.56, p.131, 2009.

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian et al., Every picture tells a story: Generating sentences from images, ECCV, p.26, 2010.

P. F. Felzenszwalb, R. B. Girshick, D. A. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part Based Models, vol.21

Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, p.21, 1997.

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean et al., DeViSE: A Deep Visual-Semantic Embedding Model, NIPS, vol.59, p.69, 2013.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, p.31, 2016.

C. Galleguillos, A. Rabinovich, and S. Belongie, Object Categorization using Cooccurrence, Location and Appearance, CVPR, vol.23, p.68, 2008.

C. Gao, Y. Zou, and J. Huang, ICAN: Instance-Centric Attention Network for Human-Object Interaction Detection, BMVC, vol.42, p.116, 2018.

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang et al., Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question, NIPS, p.27, 2015.

D. Geman, S. Geman, N. Hallonquist, and L. Younes, Visual turing test for computer vision systems, Proceedings of the National Academy of Sciences of the United States of America, p.27, 2015.

D. Gentner, Structure Mapping: A Theoretical Framework for Analogy, Cognitive Science, p.60, 1983.

D. Gentner, Analogical Reasoning, Psychology Of, vol.60, p.61, 2003.

D. Gentner and L. A. Smith, Analogical Reasoning. Encyclopedia of Human Behavior, p.61, 2012.

R. Girshick and . Fast-r-cnn, ICCV, vol.75, p.82, 2015.

R. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, vol.22, p.77, 2014.

R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, K. He et al., , p.108, 2018.

G. Gkioxari, R. B. Girshick, M. , and J. , Contextual Action Recognition with R*CNN, ICCV, p.53, 2015.

G. Gkioxari, R. Girshick, and K. He, Detecting and Recognizing Human-Object Interactions, vol.42, p.116, 2018.

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, p.31, 2014.

I. J. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative Adversarial Nets. In NIPS, p.137, 2014.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, CVPR, p.37, 2017.

J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai et al., Scene graph generation with external knowledge and image reconstruction, CVPR, p.47, 2019.

R. A. Güler, N. Neverova, and I. Kokkinos, DensePose: Dense Human Pose Estimation in the Wild, In CVPR, p.42, 2018.

G. Guo and A. Lai, A survey on still image based human action recognition, Pattern Recognition, p.39, 2014.

A. Gupta and L. S. Davis, Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers, ECCV, p.67, 2008.

A. Gupta, A. Kembhavi, and L. S. Davis, Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, p.67, 2009.

S. Gupta and J. Malik, Visual Role Semantic Labeling, vol.40, p.110, 2015.

D. R. Hardoon, S. Szedmák, and J. Shawe-taylor, Canonical Correlation Analysis: An Overview with Application to Learning Methods, Neural Computation, p.31, 2004.

A. Harel, D. J. Kravitz, and C. I. Baker, Deconstructing Visual Scenes in Cortex: Gradients of Object and Spatial Layout Information, Cerebral cortex, issue.7, 2013.

H. Harzallah, F. Jurie, and C. Schmid, Combining efficient object localization and image classification, ICCV, p.21, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00439516

K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.22, 2014.

K. He, G. Gkioxari, P. Dollár, R. B. Girshick, and . Mask-r-cnn, ICCV, vol.42, p.43, 2017.

L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele et al., Generating Visual Explanations, ECCV, p.37, 2016.

L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko et al., Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data, CVPR, vol.37, p.69, 2016.

L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach, Women Also Snowboard: Overcoming Bias in Captioning Models, In ECCV, p.36, 2018.

A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. Salesin, Image analogies, SIGGRAPH, p.61, 2001.

R. Hinami and S. Satoh, Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation, EMNLP, p.136, 2018.

M. Hodosh, P. Young, and J. Hockenmaier, Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, In Journal of Artificial Intelligence Research, issue.26, 2013.

D. Hoeim, A. A. Efros, and M. Hebert, Putting Objects in Perspective, CVPR, vol.43, p.132, 2006.

K. Holyoak, The Pragmatics of Analogical Transfer. The Psychology of Learning and Motivation, p.60, 1985.

H. Hu, J. Gu, Z. Zhang, J. Dai, W. et al., Relation Networks for Object Detection, In CVPR, p.24, 2018.

H. Hu, I. Misra, and L. Van-der-maaten, Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding, p.37, 2019.

J. Hu, W. Zheng, J. Lai, S. Gong, and T. Xiang, Recognising Human-Object Interaction via Exemplar Based Modelling, ICCV, p.48, 2013.

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko et al., Natural Language Object Retrieval, CVPR, vol.29, p.67, 2016.

R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, Modeling Relationships in Referential Expressions with Compositional Modular Networks, CVPR, vol.34, p.135, 2017.

D. A. Hudson and C. D. Manning, GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, CVPR, p.37, 2019.

Z. Hung, A. Mallya, and S. Lazebnik, Union Visual Translation Embedding for Visual Relationship Detection and Scene Graph Generation, p.50, 2019.

S. J. Hwang, S. N. Ravi, Z. Tao, H. J. Kim, M. D. Collins et al., Robust Visual Relationship Learning, vol.50, p.96, 2018.

S. J. Hwang, K. Grauman, S. , and F. , Analogy-preserving Semantic Embedding for Visual Object Categorization, ICML, p.62, 2013.

N. Ikizler, R. G. Cinbis, S. Pehlivan, and P. Duygulu, Recognizing actions from still images, ICPR, vol.39, p.41, 2008.

L. Itti, C. Koch, and E. Niebur, A Model of Saliency-Based Visual Attention for Rapid Scene Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.32, 1998.

H. Izadinia, F. Sadeghi, S. K. Divvala, Y. Choi, and A. Farhadi, Segment-phrase table for semantic segmentation, visual entailment and paraphrasing, ICCV, p.95, 2015.

A. Jabri, A. Joulin, and L. Van-der-maaten, Revisiting Visual Question Answering Baselines, ECCV, p.36, 2016.

R. Jenatton, N. L. Roux, A. Bordes, and G. R. Obozinski, A latent factor model for highly multi-relational data, NIPS, vol.49, p.94, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00776335

J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma et al., Image Retrieval using Scene Graphs, CVPR, vol.68, p.95, 2015.

J. Johnson, A. Karpathy, and L. Fei-fei, DenseCap: Fully Convolutional Localization Networks for Dense Captioning, CVPR, vol.67, p.142, 2016.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, C. L. Zitnick et al., Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, CVPR, p.35, 2017.

J. Johnson, B. Hariharan, L. Van-der-maaten, J. Hoffman, L. Fei-fei et al., Inferring and Executing Programs for Visual Reasoning, ICCV, vol.35, p.36, 2017.

J. Johnson, A. Gupta, and L. Fei-fei, Image Generation from Scene Graphs, CVPR, vol.30, p.137, 2018.

A. Joulin, F. Bach, and J. Ponce, Discriminative clustering for image cosegmentation, CVPR, p.54, 2010.

A. Joulin, K. Tang, and L. Fei-fei, Efficient image and video co-localization with frank-wolfe algorithm, ECCV, p.69, 2014.

D. Kaiser, T. Stein, and M. V. Peelen, Object grouping based on real-world regularities facilitates perception by reducing competitive interactions in visual cortex, Proceedings of the National Academy of Sciences of the United States of America, 2014.

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Joint Learning of Object and Action Detectors, ICCV, p.137, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01575804

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization, ECCV, p.53, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01421772

A. Karpathy and L. Fei-fei, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR, vol.32, p.95, 2015.

A. Karpathy, A. Joulin, and L. Fei-fei, Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS, vol.95, p.99, 2014.

K. Kato, Y. Li, and A. Gupta, Compositional Learning for Human Object Interaction, ECCV, vol.58, p.137, 2018.

S. Kazemzadeh, V. Ordonez, M. Matten, T. L. Berg, and . Referitgame, Referring to Objects in Photographs of Natural Scenes, EMNLP, vol.29, p.67, 2014.

J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim et al., Multimodal Residual Learning for Visual QA, NIPS, p.31, 2016.

J. G. Kim and I. Biederman, Where Do Objects Become Scenes? Cerebral Cortex, 2010.

D. P. Kingma, J. Ba, and . Adam, A Method for Stochastic Optimization, ICLR, p.109, 2015.

T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, ICLR, p.94, 2016.

R. Kiros, R. Salakhutdinov, and R. S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, p.31, 2014.

B. E. Klein, G. Lev, G. Sadeh, and L. Wolf, Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation, CVPR, p.31, 2015.

A. Kolesnikov, C. H. Lampert, and V. Ferrari, Detecting Visual Relationships Using Box Attention, p.25, 2018.

C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler, What Are You Talking About? Text-to-Image Coreference, CVPR, vol.35, p.139, 2014.

S. Kottur, J. M. Moura, D. Parikh, D. Batra, and M. Rohrbach, Visual Coreference Resolution in Visual Dialog using Neural Module Networks, ECCV, p.139, 2018.

J. Krause, H. Jin, J. Yang, and L. Fei-fei, Fine-grained recognition without part annotations, CVPR, p.131, 2015.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, International Journal of Computer Vision, vol.83, p.141, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, p.22, 2012.

G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li et al., Understanding and generating simple image descriptions, CVPR, vol.26, p.31, 2011.

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin et al., The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale, vol.41, p.137, 2018.

C. H. Lampert, H. Nickisch, and S. Harmeling, Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer, CVPR, vol.56, p.57, 2009.

T. Lan, M. Raptis, L. Sigal, and G. Mori, From Subcategories to Visual Composites: A Multi-Level Framework for Object Detection, 2013.

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, CVPR, p.39, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00548659

A. Lazaridou, E. Bruni, and M. Baroni, Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world, ACL, vol.59, p.69, 2014.

K. Lee, X. Chen, G. Hua, H. Hu, and X. He, Stacked Cross Attention for Image-Text Matching, ECCV, vol.33, p.134, 2018.

W. G. Lehnert, A conceptual theory of question answering. International Joint Conferences on Artificial Intelligence Organization, p.27, 1977.

C. Li, D. Parikh, C. , and T. , Extracting Adaptive Contextual Cues from Unlabeled Regions, ICCV, 2011.

C. Li, D. Parikh, C. , and T. , Automatic Discovery of Groups of Objects for Scene Understanding, CVPR, vol.23, p.68, 2012.

S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, Composing Simple Image Descriptions using Web-scale N-grams, CoNLL, p.31, 2011.

Y. Li, W. Ouyang, W. , and X. , ViP-CNN: A visual Phrase Reasoning Convolutional Neural Network for Visual Relationship Detection, CVPR, vol.42, p.95, 2017.

Y. Li, W. Ouyang, B. Zhou, K. Wand, W. et al., Scene Graph Generation from Objects, Phrases and Region Captions, ICCV, p.45, 2017.

X. Liang, L. Lee, and E. P. Xing, Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection, CVPR, p.45, 2017.

J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang, Visual Attribute Transfer through Deep Image Analogy, ACM Transactions on Graphics, p.62, 2017.

W. Liao, S. Lin, B. Rosenhahn, Y. , and M. Y. , Natural Language Guided Visual Relationship Detection, p.46, 2017.

D. Lin, S. Fidler, C. Kong, and R. Urtasun, Visual Semantic Search: Retrieving Videos via Complex Textual Queries, CVPR, p.35, 2014.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Common objects in context, ECCV, vol.75, p.107, 2014.

T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan et al., Feature Pyramid Networks for Object Detection, CVPR, vol.22, p.108, 2017.

Z. Lin, M. Feng, C. N. Santos, M. Yu, B. Xiang et al., A Structured Self-Attentive Sentence Embedding, ICLR, p.135, 2017.

F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun, iVQA: Inverse Visual Question Answering, CVPR, p.37, 2018.

S. Liu, J. Feng, J. Han, S. Yan, Y. Sun et al., Context
URL : https://hal.archives-ouvertes.fr/hal-00962015

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed et al., Single Shot MultiBox Detector, ECCV, p.22, 2016.

D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, 1921.

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual Relationship Detection with Language Priors, ECCV, vol.142, p.143, 2016.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering, NIPS, p.32, 2016.

J. Lu, J. Yang, D. Batra, and D. Parikh, Neural Baby Talk, CVPR, vol.33, p.52, 2018.

P. Lu, L. Ji, W. Zhang, N. Duan, M. Zhou et al., Learning Visual Relation Facts with Semantic Attention for Visual Question Answering, KDD, p.35, 2018.

S. Maji, L. Bourdev, M. , and J. , Action Recognition from a Distributed Representation of Pose and Appearance, CVPR, vol.43, p.131, 2011.

T. Malisiewicz and A. A. Efros, Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships, NIPS, p.42, 2009.

A. Mallya and S. Lazebnik, Learning models for actions and person-object interactions with transfer to question answering, ECCV, p.52, 2016.

C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard et al., The Stanford CoreNLP Natural Language Processing Toolkit, ACL, vol.51, p.134, 2014.

J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille et al., Generation and comprehension of unambiguous object descriptions, CVPR, vol.29, p.67, 2016.

K. Marino, M. Rastegari, A. Farhadi, R. Mottaghi, and . Ok-vqa, A Visual Question Answering Benchmark Requiring External Knowledge, p.28, 2019.

D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual Information, p.25, 1982.

A. Miech, J. Alayrac, P. Bojanowski, I. Laptev, and S. , J, vol.54, p.75

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed Representations of Words and Phrases and Their Compositionality, NIPS, vol.46, p.109, 2013.

G. A. Miller, WORDNET: A Lexical Database for English, Communications of the ACM, p.57, 1992.

I. Misra, A. Gupta, and M. Hebert, From Red Wine to Red Tomato: Composition with Context, CVPR, vol.50, p.96, 2017.

J. Mitchell and M. Lapata, Vector-based models of semantic composition, ACL, p.51, 2008.

M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos et al., Generating Image Descriptions From Computer Vision Detections, EACL, p.31, 2012.

D. Movshovitz-attias, W. W. Cohen, and . Kb-lda, Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts, ACL, vol.46, p.67, 2015.

V. K. Nagaraja, V. I. Morariu, and L. S. Davis, Modeling Context Between Objects for Referring Expression Understanding, ECCV, vol.34, p.53, 2016.

A. Newell and J. Deng, Pixels to Graphs by Associative Embedding, NIPS, p.45, 2017.

M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, A Review of Relational Machine Learning for Knowledge Graphs, Proceedings of the IEEE, p.49, 2015.

J. C. Niebles, H. Wang, and L. Fei-fei, Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words, BMVC, vol.39, p.55, 2006.

W. Norcliffe-brown, E. Vafeias, and S. Parisot, Learning Conditioned Graph Structures for Interpretable Visual Question Answering, NIPS, p.35, 2018.

A. Oliva, Visual Scene Perception. Encyclopaedia of Perception, p.25, 2009.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free? -Weakly-supervised learning with convolutional neural networks, CVPR, vol.53, p.68, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01015140

V. Ordonez, G. Kulkarni, and T. L. Berg, Im2Text: Describing Images Using 1 Million Captioned Photographs, NIPS, p.26, 2011.

A. Osokin, J. Alayrac, I. Lukasewitz, P. K. Dokania, and S. Lacoste-julien, Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs, ICML, vol.54, p.75, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01323727

M. Pandey and S. Lazebnik, Scene Recognition and Weakly Supervised Object Localization with Deformable Part-Based Models, ICCV, p.131, 2011.

D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele et al., Multimodal explanations: Justifying decisions and pointing to the evidence, CVPR, p.37, 2018.

J. Peyre, I. Laptev, C. Schmid, and J. Sivic, Weakly-Supervised Learning of Visual Relations, ICCV, vol.108, p.113, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01576035

J. Peyre, I. Laptev, C. Schmid, and J. Sivic, Detecting Unseen Visual Relations Using Analogies, ICCV, p.17, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01975760

F. Plesse, A. Ginsca, B. Delezoide, and F. J. Prêteux, Visual Relationship Detection Based on Guided Proposals and Semantic Knowledge Distillation

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier et al., Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, ICCV, vol.29, p.95, 2015.

B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik, Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues, vol.29, p.95, 2017.

B. A. Plummer, K. J. Shih, Y. Li, K. Xu, S. Lazebnik et al., Open-vocabulary Phrase Detection, p.136, 2019.

A. Prest, C. Schmid, and V. Ferrari, Weakly supervised learning of interactions between humans and objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.43, p.67, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00516477

S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu, Learning Human-Object Interactions by Graph Parsing Neural Networks, ECCV, vol.42, p.110, 2018.

K. Raja, I. Laptev, P. Pérez, and L. Oisel, Joint pose estimation and action recognition in image graphs, ICIP, p.41, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01063329

V. Ramanathan, C. Li, J. Deng, W. Han, Z. Li et al., Learning Semantic Relationships for Better Action Retrieval in Images, CVPR, vol.57, p.124, 2015.

J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger, CVPR, 2017.

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, CVPR, p.22, 2016.

S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee, Deep Visual Analogy-Making, NIPS, vol.17, p.96, 2015.

S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele et al., Learning What and Where to Draw, NIPS, p.137, 2016.

S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele et al., Generative Adversarial Text to Image Synthesis, ICML, p.137, 2016.

M. Ren, R. Kiros, and R. S. Zemel, Exploring Models and Data for Image Question Answering, NIPS, p.27, 2015.

S. Ren, K. He, R. Girshick, J. Sun, and . Faster-r-cnn, Towards real-time object detection with region proposal networks, NIPS, vol.65, p.108, 2015.

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, Grounding of textual phrases in images by reconstruction, vol.33, p.67, 2016.

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, Object Hallucination in Image Captioning, In EMNLP, vol.36, p.37, 2018.

M. R. Ronchi and P. Perona, Describing Common Human Visual Actions in Images, BMVC, vol.40, p.107, 2015.

G. Rosenthal, A. Shamir, and L. Sigal, Learn How to Choose: Independent Detectors Versus Composite Visual Phrases, p.49, 2017.

B. H. Ross, Distinguishing types of superficial similarities: Different effects on the access and use of earlier problems, Journal of Experimental Psychology: Learning, Memory, and Cognition, p.60, 1989.

S. Sabour, N. Frosst, and G. E. Hinton, Dynamic Routing Between Capsules, NIPS, p.131, 2017.

F. Sadeghi, S. K. Divvala, and A. Farhadi, VisKE: Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases, CVPR, vol.49, p.95, 2015.

F. Sadeghi, C. L. Zitnick, A. Farhadi, and . Visalogy, Answering Visual Analogy Questions, NIPS, vol.62, p.96, 2015.

M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, CVPR, vol.141, p.142, 1995.

A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu et al., A Simple Neural Network Module for Relational Reasoning, vol.35, p.94, 2017.

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, The Graph Neural Network Model, IEEE Transactions on Neural Networks, p.35, 2009.

C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, ICPR, p.39, 2004.

M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, Signal Processing, p.32, 1997.

S. Schuster, R. Krishna, A. X. Chang, L. Fei-fei, and C. D. Manning, Generating semantically precise scene graphs from textual descriptions for improved image retrieval, VL@EMNLP, p.52, 2015.

X. Shang, T. Ren, J. Guo, H. Zhang, and T. Chua, Video Visual Relation Detection, ACM International Conference on Multimedia, p.137, 2017.

G. Sharma, F. Jurie, and C. Schmid, Expanded Parts Model for Human Attribute and Action Recognition in Still Images, CVPR, p.131, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00816144

L. Shen, S. Yeung, and J. Hoffman, Scaling Human-Object Interaction Recognition through Zero-Shot Learning, WACV, vol.42, p.95, 2018.

M. Simon and E. Rodner, Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks, ICCV, p.131, 2015.

S. Singh, A. Gupta, and A. A. Efros, Unsupervised Discovery of Mid-Level Discriminative Patches, ECCV, 2012.

J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, Discovering object categories in image collections, ICCV, p.55, 2005.

R. Socher, D. Chen, C. D. Manning, and A. Ng, Reasoning with neural tensor networks for knowledge base completion, NIPS, vol.49, p.67, 2013.

R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, Zero-shot learning through cross-modal transfer, NIPS, vol.59, p.69, 2013.

H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui et al., One-bit object detection: On learning to localize objects with minimal supervision, ICML, p.52, 2014.

R. Speer and C. Havasi, ConceptNet 5: A Large Semantic Network for Relational Knowledge, p.47, 2013.

D. Stansbury, T. Naselaris, and J. L. Gallant, Natural Scene Statistics Account for the Representation of Scene Categories in Human Visual Cortex, Neuron, issue.7, 2013.

T. Stein, D. Kaiser, and M. Peelen, Interobject grouping facilitates visual awareness, Journal of Vision, issue.6, 2015.

F. M. Suchanek, G. Kasneci, G. Weikum, and G. , Yago: a core of semantic knowledge, WWW, p.46, 2007.
URL : https://hal.archives-ouvertes.fr/hal-01472497

I. Sutskever, R. R. Salakhutdinov, and J. B. Tenenbaum, Modelling Relational Data using Bayesian Clustered Tensor Factorization, NIPS, p.49, 2009.

D. Teney, L. Liu, . Van-den, and A. Hengel, Graph-Structured Representations for Visual Question Answering, CVPR, vol.30, p.35, 2017.

C. Thurau and V. Hlavác, Pose primitive based human action recognition in videos or still images, CVPR, p.41, 2008.

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.21, p.22, 2013.

L. Van-der-maaten and G. Hinton, Visualizing Data using t-SNE, Journal of Machine Learning Research, vol.122, p.128, 2008.

R. Vedantam, C. L. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, CVPR, p.36, 2015.

R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik, Context-Aware Captions from Context-Agnostic Supervision, CVPR, p.34, 2017.

I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, Order-embeddings of images and language, ICLR, p.58, 2016.

S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. Mooney, T. Darrell et al., Captioning Images with Diverse Objects, CVPR, vol.37, p.69, 2017.

L. Wang, Y. Li, and S. Lazebnik, Learning Deep Structure-Preserving Image-Text Embeddings, CVPR, vol.98, p.99, 1995.

M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng, Structured Matching for Phrase Localization, ECCV, p.34, 2016.

P. Wang, Q. Wu, C. Shen, A. R. Dick, . Van-den et al., Explicit knowledgebased reasoning for visual question answering, IJCAI, p.28, 2017.

P. Wang, Q. Wu, C. Shen, A. R. Dick, . Van-den et al., FVQA: Fact-Based Visual Question Answering, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.28, 2018.

X. Wang, D. F. Fouhey, and A. Gupta, Designing deep networks for surface normal estimation, CVPR, p.43, 2015.

Y. Wang, H. Jiang, M. S. Drew, Z. Li, and G. Mori, Unsupervised Discovery of Action Classes, CVPR, p.55, 2006.

S. Woo, D. Kim, D. Cho, K. , and I. , LinkNet: Relational Embedding for Scene Graph, NIPS, p.45, 2018.

Q. Wu, P. Wang, C. Shen, A. R. Dick, . Van-den et al., Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources, CVPR, p.28, 2016.

Q. Wu, D. Teney, P. Wang, C. Shen, A. R. Dick et al., Visual Question Answering: A Survey of Methods and Datasets. Computer Vision and Image Understanding, p.33, 2017.

Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein et al., Latent Embeddings for Zero-shot Classification, CVPR, p.69, 2016.

F. Xiao, L. Sigal, and Y. J. Lee, Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures, CVPR, p.34, 2017.

N. Xie, F. Lai, D. Doran, and A. Kadav, Visual Entailment Task for Visually-Grounded Language Learning, p.37, 2018.

D. Xu, Y. Zhu, C. B. Choy, and L. Fei-fei, Scene Graph Generation by Iterative Message Passing, CVPR, vol.30, p.46, 2017.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML, p.32, 2015.

L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, Maximum Margin Clustering, NIPS, p.54, 2004.

F. Yan and K. Mikolajczyk, Deep correlation for matching images and text, CVPR, p.31, 2015.

J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, Graph r-cnn for scene graph generation, ECCV, vol.24, p.45, 2018.

S. Yang, L. Bo, J. Wang, and L. G. Shapiro, Unsupervised Template Learning for Fine-Grained Object Recognition, NIPS, p.131, 2012.

X. Yang, H. Zhang, and J. Cai, Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features, ECCV, p.42, 2018.

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, Stacked Attention Networks for Image Question Answering, CVPR, p.32, 2016.

B. Yao and L. Fei-fei, Grouplet: A structured image representation for recognizing human and object interactions, CVPR, vol.41, p.67, 2010.

B. Yao and L. Fei-fei, Modeling mutual context of object and human pose in humanobject interaction activities, CVPR, vol.39, p.131, 2010.

B. Yao and L. Fei-fei, Action Recognition with Exemplar Based 2.5D Graph Matching, ECCV, p.41, 2012.

B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas et al., Human Action Recognition by Learning Bases of Action Attributes and Parts, ICCV, p.67, 2011.

T. Yao, Y. Pan, Y. Li, M. , and T. , Exploring Visual Relationship for Image Captioning, ECCV, vol.30, p.35, 2018.

M. Yatskar, V. Ordonez, and A. Farhadi, Stating the Obvious: Extracting Visual Common Sense Knowledge, NAACL, p.67, 2016.

G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang et al., Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition, ECCV, p.46, 2018.

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, Modeling Context in Referring Expressions, ECCV, vol.34, p.53, 2016.

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu et al., MAttNet: Modular Attention Network for Referring Expression Comprehension, CVPR, p.35, 2018.

R. Yu, A. Li, V. I. Morariu, and L. S. Davis, Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation, ICCV, vol.46, p.95, 2017.

S. Zagoruyko, A. Lerer, T. Lin, P. O. Pinheiro, S. Gross et al., A MultiPath Network for Object Detection, BMVC, p.65, 2016.

R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, Neural Motifs: Scene Graph Parsing with Global Context, CVPR, p.42, 2018.

R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, From Recognition to Cognition: Visual Commonsense Reasoning, CVPR, p.37, 2019.

H. Zhang, Z. Kyaw, S. Chang, and T. Chua, Visual Translation Embedding Network for Visual Relation Detection, CVPR, vol.43, p.96, 2017.

H. Zhang, Z. Kyaw, J. Yu, C. , and S. Ppr-fcn, Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN. In ICCV, p.53, 2017.

J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. M. Elgammal, Relationship Proposal Networks, CVPR, vol.24, p.132, 2017.

J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal et al., Large-Scale Visual Relationship Understanding, AAAI, vol.40, p.96, 2019.

J. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks, ICCV, vol.43, p.138, 2017.

Y. Zhu, A. Fathi, and L. Fei-fei, Reasoning about object affordances in a knowledge base representation, ECCV, vol.47, p.67, 1928.

Y. Zhu, O. Groth, M. S. Bernstein, and L. Fei-fei, Visual7W: Grounded Question Answering in Images, CVPR, vol.27, p.32, 2016.

B. Zhuang, L. Liu, C. Shen, R. , and I. , Towards Context-aware Interaction Recognition for Visual Relationship Detection, ICCV, vol.43, p.95, 2017.

B. Zhuang, Q. Wu, C. Sehn, I. Reid, . Van-den et al., HCVRD: a benchmark for large-scale Human-Centered Visual Relationship Detection, In AAAI, vol.13, p.40, 2018.

C. L. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, ECCV, 1921.

C. L. Zitnick, D. Parikh, and L. Vanderwende, Learning the visual interpretation of sentences, ICCV, p.35, 2013.