Neural machine translation by jointly learning to align and translate, ICLR, 2015. ,
Scheduled sampling for sequence prediction with recurrent neural networks, 2015. ,
Weakly Supervised Deep Detection Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.311
URL : http://arxiv.org/abs/1511.02853
Microsoft coco captions: Data collection and evaluation server, 2015. ,
Attention-based models for speech recognition, NIPS, 2015. ,
Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014. ,
Multi-fold MIL Training for Weakly Supervised Object Localization, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014. ,
DOI : 10.1109/CVPR.2014.309
URL : https://hal.archives-ouvertes.fr/hal-00975746
Imagenet: A large-scale hierarchical image database, CVPR, 2009. ,
Meteor Universal: Language Specific Translation Evaluation for Any Target Language, Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014. ,
DOI : 10.3115/v1/W14-3348
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.675.6117
Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015. ,
DOI : 10.1109/cvpr.2015.7298878
URL : http://arxiv.org/abs/1411.4389
From captions to visual concepts and back, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298754
URL : http://arxiv.org/abs/1411.4952
DRAW: A recurrent neural network for image generation, ICML, 2015. ,
Spatial pyramid pooling in deep convolutional networks for visual recognition, ECCV, 2014. ,
Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997. ,
DOI : 10.1016/0893-6080(88)90007-X
Aligning where to see and what to tell: image caption with region-based attention and scene factorization DenseCap: Fully convolutional localization networks for dense captioning, CVPR, 2016. ,
Deep visual-semantic alignments for generating image descriptions, CVPR, 2015. ,
DOI : 10.1109/cvpr.2015.7298932
URL : http://arxiv.org/abs/1412.2306
Adam: A method for stochastic optimization, ICLR, 2015. ,
Multimodal neural language models, ICML, 2014. ,
Zoneout: Regularizing RNNs by randomly preserving hidden activations ,
Microsoft COCO: Common Objects in Context, ECCV, 2014. ,
DOI : 10.1007/978-3-319-10602-1_48
URL : http://arxiv.org/abs/1405.0312
SSD: Single Shot MultiBox Detector, ECCV, 2016. ,
DOI : 10.1007/978-3-642-33712-3_25
URL : http://arxiv.org/abs/1512.02325
Deep captioning with multimodal recurrent neural networks (m-RNN) ICLR, 2015. ,
Linguistic regularities in continuous space word representations, NAACL-HLT, 2013. ,
BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02, 2002. ,
DOI : 10.3115/1073083.1073135
Sequence level training with recurrent neural networks, 2016. ,
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015. ,
DOI : 10.1109/TPAMI.2016.2577031
URL : http://arxiv.org/abs/1506.01497
Grounding of Textual Phrases in Images by Reconstruction, ECCV, 2016. ,
DOI : 10.1007/978-3-319-46448-0_49
Objectcentric spatial pooling for image classification, ECCV, 2012. ,
DOI : 10.1007/978-3-642-33709-3_1
Very deep convolutional networks for large-scale image recognition, 2015. ,
Sequence to sequence learning with neural networks, 2014. ,
Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.154-171, 2013. ,
DOI : 10.1007/s11263-013-0620-5
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.361.3382
CIDEr: Consensus-based image description evaluation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7299087
URL : http://arxiv.org/abs/1411.5726
Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298935
URL : http://arxiv.org/abs/1411.4555
Show, attend and tell: Neural image caption generation with visual attention, ICML, 2015. ,
Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016. ,
Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.512
URL : http://arxiv.org/abs/1502.08029
Every moment counts: Dense detailed labeling of actions in complex videos ,
Image Captioning with Semantic Attention, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.503
URL : http://arxiv.org/abs/1603.03925
Edge Boxes: Locating Object Proposals from Edges, ECCV, 2014. ,
DOI : 10.1007/978-3-319-10602-1_26
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.453.5208