Neural machine translation by jointly learning to align and translate, ICLR, 2015. ,
Scheduled sampling for sequence prediction with recurrent neural networks, 2015. ,
Weakly Supervised Deep Detection Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.311
URL : http://arxiv.org/abs/1511.02853
Attention-based models for speech recognition, NIPS, 2015. ,
Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014. ,
Multi-fold MIL Training for Weakly Supervised Object Localization, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014. ,
DOI : 10.1109/CVPR.2014.309
URL : https://hal.archives-ouvertes.fr/hal-00975746
Imagenet: A large-scale hierarchical image database, CVPR, 2009. ,
Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015. ,
DOI : 10.1109/tpami.2016.2599174
URL : http://arxiv.org/abs/1411.4389
From captions to visual concepts and back, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298754
URL : http://arxiv.org/abs/1411.4952
DRAW: A recurrent neural network for image generation, ICML, 2015. ,
Spatial pyramid pooling in deep convolutional networks for visual recognition, ECCV, 2014. ,
DOI : 10.1007/978-3-319-10578-9_23
URL : http://arxiv.org/abs/1406.4729
Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997. ,
DOI : 10.1016/0893-6080(88)90007-X
Aligning where to see and what to tell: image caption with region-based attention and scene factorization, 2015. ,
DenseCap: Fully Convolutional Localization Networks for Dense Captioning, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.494
URL : http://arxiv.org/pdf/1511.07571
Deep visual-semantic alignments for generating image descriptions, CVPR, 2015. ,
DOI : 10.1109/tpami.2016.2598339
URL : http://arxiv.org/abs/1412.2306
Adam: A method for stochastic optimization, ICLR, 2015. ,
Multimodal neural language models, ICML, 2014. ,
Microsoft COCO: common objects in context Attention correctness in neural image captioning, ECCV, 2014. [21] AAAI, 2017. ,
SSD: Single shot multibox detector, ECCV, 2016. ,
Deep captioning with multimodal recurrent neural networks (m-RNN) ICLR, 2015. ,
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-tosentence models, ICCV, 2015. ,
DOI : 10.1007/s11263-016-0965-7
URL : http://arxiv.org/abs/1505.04870
Sequence level training with recurrent neural networks, 2016. ,
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015. ,
DOI : 10.1109/TPAMI.2016.2577031
URL : http://arxiv.org/abs/1506.01497
Grounding of Textual Phrases in Images by Reconstruction, ECCV, 2016. ,
DOI : 10.1007/978-3-319-10602-1_26
Objectcentric spatial pooling for image classification, ECCV, 2012. ,
DOI : 10.1007/978-3-642-33709-3_1
Very deep convolutional networks for large-scale image recognition, 2015. ,
Sequence to sequence learning with neural networks, 2014. ,
Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.154-171, 2013. ,
DOI : 10.1023/B:VISI.0000013087.49260.fb
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.361.3382
Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298935
URL : http://arxiv.org/abs/1411.4555
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.29
URL : http://arxiv.org/abs/1506.01144
Show, attend and tell: Neural image caption generation with visual attention, ICML, 2015. ,
Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016. ,
Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.512
URL : http://arxiv.org/abs/1502.08029
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos, International Journal of Computer Vision, vol.25, issue.1 ,
DOI : 10.1109/CVPR.1992.223161
Image Captioning with Semantic Attention, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.503
URL : http://arxiv.org/pdf/1603.03925
Edge Boxes: Locating Object Proposals from Edges, ECCV, 2014. ,
DOI : 10.1007/978-3-319-10602-1_26
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.453.5208
Scheduled sampling for sequence prediction with recurrent neural networks, 2015. ,
Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014. ,
Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015. ,
DOI : 10.1109/tpami.2016.2599174
URL : http://arxiv.org/abs/1411.4389
Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997. ,
DOI : 10.1016/0893-6080(88)90007-X
Aligning where to see and what to tell: image caption with region-based attention and scene factorization, 2015. ,
BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02, 2002. ,
DOI : 10.3115/1073083.1073135
CIDEr: Consensus-based image description evaluation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7299087
URL : http://arxiv.org/abs/1411.5726
Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298935
URL : http://arxiv.org/abs/1411.4555
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.29
URL : http://arxiv.org/abs/1506.01144
Show, attend and tell: Neural image caption generation with visual attention, ICML, 2015. ,
Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016. ,
Image captioning with semantic attentionA surfer is riding a big wave in the ocean Generated caption: -A man is doing a trick on a skateboard Ground-truth captions: -a man flying through the air while riding a skateboard. -A skateboarder is high in the air separated from his board in the break of a rampA man standing in a living room playing a video game Ground-truth captions: -A man standing in a living room holding a Nintendo Wii game controller. -Man standing in a family room with a game controller in his hand. -A man is standing and playing video games. -A man is standing in a living room with a Wii controller. -A middle aged man is playing with a WII game Generated caption: -A baseball player swinging a bat at a ball Ground-truth captions: -A batter holds the bat behind his head for a powerful swing. -A man holding a bat at a professional baseball game. -a baseball player in the batters box at a game -A man is at bat during a professional baseball game. -There are spectators watching a baseball game going on Generated caption: -A vase filled with flowers on a table Ground-truth captions: -A counter containing two silver vases with colorful flowers. -A base with yellow pink and orange daisies in it. -There are some flowers in decorative silver vases -Colorful flowers in a metal vase sitting on a mirror ledge. -Flowers in a paper that is silver and standing. Generated caption: -A bathroom with a toilet sink and bathtub Ground-truth captions: -A tiled bathroom containing a vanity sink, toilet and bathtub. -a bath room with a toilet a bath tub and a sink -There is a bathroom with a toilet and tub. -A white toilet and bath in a room. -A white bathtub sitting next to a toilet Generated caption: -A man in a suit and tie wearing a hat Ground-truth captions: -A man wearing a hat and a gray jacket. -A man with a suit and tie and a fedora on. -A man wears a suit with a blue shirt and a multicolored tie. -A man in a suit and tie wearing a hat. -a man that is in a suite and a tie Generated caption: -A cat is standing next to a bicycle Ground-truth captions: -A cat standing next to a bike parked against a wall. -A ca walking with its tail straight up . -A kitten is walking next to a parked bike inside. -A small cat is walking behind a bike. -A baby tabby cat walking behind a bicycle leaning against a wall Generated caption: -A large airplane flying through a blue sky Ground-truth captions: -A picture of a plane is flying in the air. -An airplane with two propellor engines flying in the sky. -A propeller plane flying through a blue sky. -A small airplane is flying through a clear blue sky. -A propeller plane flying through a blue sky. Generated caption: -A couple of kids laying on a bed Ground-truth captions: -A custom cake featuring a fisherman for a man's 65th birthday. -A birthday cake made to look like a man on a pier fishing. -a fathers birthday cake with a pond and sheep on the hill -The birthday cake is in the shape of a hill with a fisherman sitting on it. -The birthday cake for a 65 year old. Generated caption: -A large body of water with boats in the background Ground-truth captions: -Two boats that are sitting in the water. -Those boats are waiting by the pier in the water. -A large city is on the water with boats. -City skyline as seen beyond waterway docking area. -A large body of water filled with boats next to a tall building. Generated caption: -A man holding a piece of cake on a plate Ground-truth captions: -A man is sitting down with a piece of chocolate cake in front of him with a fork in his hand. -a person at a table with a large piece of cake -This man is holding a fork to eat a piece of chocolate cake. -The man sitting at a table with a large slice of cake. -a man sitting at a table with a piece of cake and holding a fork Generated caption: -A man standing in a kitchen preparing food Ground-truth captions: -A cook in a restaurant kitchen putting chopped vegetables in a bow. -At a restaurant's kitchen, a gentleman wearing sanitary gloves prepares a salad. -A man in a blue shirt is putting food into a bowl. -The man is in the kitchen preparing a meal. -A man in blue shirt preparing food in a kitchen Generated caption: -A herd of sheep grazing on a lush green field Ground-truth captions: -A herd of animals traveling down a country road surrounded by a lush green landscape, CVPRJogger running past a flock of sheep in a rural area -A man walking beside sheep on a country road. -Jogger running past a flock of white sheep. -A runner wearing spandex has come across a large herd traveling down the dirt road. Generated caption: -A woman is eating a piece of pizza Ground-truth captions: -A woman wearing a hat bites into a pastry. -A girl eating a donut out of a bag -a woman eating a powdered sugar pastry out of a bag -A woman eating a pastry at a coffee shop. -The woman in the hat is taking a bite of a pastry, 2016. ,