Areas of Attention for Image Captioning

Marco Pedersoli 1 Thomas Lucas 1 Cordelia Schmid 1 Jakob Verbeek 1
1 Thoth - Apprentissage de modèles à partir de données massives
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann
Abstract : We propose "Areas of Attention" , a novel attention-based model for automatic image caption generation. Our approach models the interplay between the state of the RNN, image region descriptors and word embedding vectors by three pairwise interactions. It allows association of caption words with local visual appearances rather than with descriptors of the entire scene. This enables better generalization to complex scenes not seen during training. Our model is agnostic to the type of attention areas, and we instantiate it using regions based on CNN activation grids, object proposals, and spatial transformer networks. Our results show that all components of our model contribute to obtain state-of-the-art performance on the MSCOCO dataset. In addition, our results indicate that attention areas are correctly associated to meaningful latent semantic structure in the generated captions.
Type de document :
Pré-publication, Document de travail
Liste complète des métadonnées
Contributeur : Thoth Team <>
Soumis le : vendredi 6 janvier 2017 - 16:42:08
Dernière modification le : jeudi 12 janvier 2017 - 13:32:15
Document(s) archivé(s) le : vendredi 7 avril 2017 - 17:58:33


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-01428963, version 1



Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek. Areas of Attention for Image Captioning. 2016. <hal-01428963>



Consultations de
la notice


Téléchargements du document