. Vedantam, C. L. Ramakrishna, D. Zitnick, and . Parikh, CIDEr: Consensus-based image description evaluation, pp.4566-4575, 2015.

O. Vinyals, Show and tell: A neural image caption generator, pp.3156-3164, 2015.

N. Sawant, J. Li, and J. Z. Wang, Automatic image semantic interpretation using social action and tagging data, Multimedia Tools & Applications, vol.51, pp.213-246, 2011.

. Ma and . Hao, Bridging the Semantic Gap Between Image Contents and Tags, IEEE Transactions on Multimedia, vol.12, pp.462-473, 2010.

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, vol.8689, pp.818-833, 2013.

L. Hollink, S. Little, and J. Hunter, Evaluating the application of semantic inferencing rules to image annotation, International Conference on Knowledge Capture ACM, pp.91-98, 2005.

A. Karpathy and F. F. Li, Deep visual-semantic alignments for generating image descriptions, A neural Transactions on Pattern Analysis & Machine Intelligence, vol.39, pp.664-676, 2017.

K. Xu, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Computer Science, pp.2048-2057, 2015.

O. Russakovsky, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.115, pp.211-252, 2015.

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, Computer Science, 2014.

. Jia and . Xu, Guiding the Long-Short Term Memory Model for Image Caption Generation, IEEE International Conference on Computer Vision IEEE, pp.2407-2415, 2016.

K. Papineni, A Method for Automatic Evaluation of Machine Translation, Proc Acl, 2002.