M. Abadi, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow

P. Steven and . Abney, Parsing by chunks " . In: Principle-based parsing, pp.257-278, 1991.

G. Awad, J. Fiscus, M. Michel, D. Joy, W. Kraaij et al., Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking, Proceedings of TRECVID, 2016.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014.

F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow et al., Theano: new features and speed improvements, Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

H. Bay, T. Tuytelaars, and L. Van-gool, Surf: Speeded up robust features, pp.2006-404, 2006.
DOI : 10.1007/11744023_32

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, Neural Networks, IEEE Transactions on 5, pp.157-166, 1994.
DOI : 10.1109/72.279181

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu et al., Theano: a CPU and GPU math expression compiler, Proceedings of the Python for scientific computing conference (SciPy, p.3, 2010.

M. David, A. Y. Blei, M. I. Ng, and . Jordan, Latent Dirichlet Allocation, J. Mach. Learn. Res, vol.3, pp.993-1022, 2003.

R. Bois, V. Vukoti´cvukoti´c, A. Simon, R. Sicre, C. Raymond et al., Exploiting Multimodality in Video Hyperlinking to Improve Target Diversity, International Conference on Multimedia Modeling, pp.185-197, 2017.
DOI : 10.1007/s10994-010-5198-3
URL : https://hal.archives-ouvertes.fr/hal-01498130

R. Bois, R. Anca-roxana¸simonroxana¸-roxana¸simon, G. Sicre, P. Gravier, and . Sébillot, IRISA at TRECVid2015: Leveraging Multimodal LDA for Video Hyperlinking, Proc. of TRECVID, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01403726

H. Bonneau-maynard, S. Rosset, C. Ayache, A. Kuhn, and D. Mostefa, Semantic Annotation of the French Media Dialog Corpus, 2005.

M. Campr and K. Je?ek, Comparing Semantic Models for Evaluating Automatic Document Summarization, Text, Speech, and Dialogue, 2015.
DOI : 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

M. Emre, C. , and K. Aydin, Unsupervised Learning Algorithms, 2016.

M. Cha, Y. Gwon, and H. T. Kung, Multimodal sparse representation learning and applications, CoRR abs, p.6238, 1511.

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever et al., Infogan: Interpretable representation learning by information maximizing generative adversarial nets, Advances in Neural Information Processing Systems. 2016, pp.2172-2180

K. Cho, B. Van-merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., Learning Phrase Representations using RNN Encoder???Decoder for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1179
URL : https://hal.archives-ouvertes.fr/hal-01433235

F. Chollet, Xception: Deep Learning with Depthwise Separable Convolutions, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2017.195

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.

R. Anca-roxana¸simonroxana¸-roxana¸simon, R. Sicre, G. Bois, P. Gravier, and . Sébillot, IRISA at TRECVid2015: Leveraging Multimodal LDA for Video Hyperlinking, Proc. of TRECVID, 2015.

D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-smith et al., Expanding the scope of the ATIS task, Proceedings of the workshop on Human Language Technology , HLT '94, pp.43-48, 1994.
DOI : 10.3115/1075812.1075823

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.886-893, 2005.
DOI : 10.1109/CVPR.2005.177
URL : https://hal.archives-ouvertes.fr/inria-00548512

N. Dehak, J. Patrick, R. Kenny, P. Dehak, P. Dumouchel et al., Front-End Factor Analysis for Speaker Verification, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.4, pp.788-798, 2011.
DOI : 10.1109/TASL.2010.2064307

M. Dinarelli, V. Vukotic, and C. Raymond, Label-Dependency Coding in Simple Recurrent Networks for Spoken Language Understanding, Interspeech 2017, 2017.
DOI : 10.21437/Interspeech.2017-1480
URL : https://hal.archives-ouvertes.fr/hal-01553830

C. Dong, C. C. Loy, K. He, and X. Tang, Learning a Deep Convolutional Network for Image Super-Resolution, European Conference on Computer Vision, pp.184-199, 2014.
DOI : 10.1007/978-3-319-10593-2_13

L. Jeffrey and . Elman, Finding structure in time, Cognitive science 14, pp.179-211, 1990.

M. Eskevich, M. Larson, R. Aly, S. Sabetghadam, J. F. Gareth et al., Multimodal Video-to-Video Linking: Turning to the Crowd for Insight and Evaluation, Proc. of the 23rd International Conference on Multimedia Modeling, 2017.
DOI : 10.1145/2483977.2483988

M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen et al., The Search and Hyperlinking Task at MediaEval, 2014.

F. Feng, X. Wang, and R. Li, Cross-modal retrieval with correspondence autoencoder, ACM Intl. Conf. on Multimedia. 2014, pp.7-16
DOI : 10.1145/2647868.2654902

D. Fouhey and C. Zitnick, Predicting Object Dynamics in Scenes, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.2019-2026
DOI : 10.1109/CVPR.2014.260

L. Gatys, A. Ecker, and M. Bethge, A Neural Algorithm of Artistic Style, Journal of Vision, vol.16, issue.12, p.CoRR, 2015.
DOI : 10.1167/16.12.326

J. Gauvain, L. Lamel, and G. Adda, The LIMSI Broadcast News transcription system, Speech Communication, vol.37, issue.1-2, pp.89-108, 2002.
DOI : 10.1016/S0167-6393(01)00061-9
URL : https://hal.archives-ouvertes.fr/hal-01434493

A. Felix, J. Gers, F. Schmidhuber, and . Cummins, Learning to forget: Continual prediction with LSTM, Neural computation, vol.1210, pp.2451-2471, 2000.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets Advances in neural information processing systems, pp.2672-2680, 2014.

C. Guinaudeau, A. R. Simon, G. Gravier, and P. Sébillot, HITS and IRISA at MediaEval 2013: Search and hyperlinking task, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00906249

S. Hahn, P. Lehnen, C. Raymond, and H. Ney, A Comparison of Various Methods for Concept Tagging for Spoken Language Understanding
URL : https://hal.archives-ouvertes.fr/hal-01321122

S. Hahn, M. Dinarelli, C. Raymond, F. Lefèvre, P. Lehnen et al., Comparing Stochastic Approaches to Spoken Language Understanding in Multiple Languages, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.6, pp.1569-1583, 2010.
DOI : 10.1109/TASL.2010.2093520
URL : https://hal.archives-ouvertes.fr/hal-00746965

Y. He and S. Young, Semantic processing using the Hidden Vector State model, Computer Speech & Language, vol.19, issue.1, pp.85-106, 2005.
DOI : 10.1016/j.csl.2004.03.001

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural computation 9, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

S. Hong, T. You, S. Kwak, and B. Han, Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network, pp.597-606, 2015.

D. Huang and K. Kitani, Action-Reaction: Forecasting the Dynamics of Human Interaction, pp.489-504, 2014.
DOI : 10.1007/978-3-319-10584-0_32

C. D. Daniel-jiwoong-im, H. Kim, R. Jiang, and . Memisevic, Generating images with recurrent adversarial networks, 2016.

H. Jégou, M. Douze, C. Schmid, and P. Pérez, Aggregating local descriptors into a compact image representation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3304-3311, 2010.
DOI : 10.1109/CVPR.2010.5540039

L. Jiang, S. Yu, D. Meng, Y. Yang, T. Mitamura et al., Fast and Accurate Content-based Semantic Search in 100M Internet Videos, Proceedings of the 23rd ACM international conference on Multimedia, MM '15, pp.49-58
DOI : 10.1111/j.1467-9868.2005.00532.x

J. Johnson, L. Alahi, and . Fei-fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution, p.CoRR, 2016.
DOI : 10.1007/978-3-642-27413-8_47

I. Michael and . Jordan, Serial order: A parallel distributed processing approach, In: Advances in psychology, vol.121, pp.471-495, 1997.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

K. Kitani, J. Ziebart, M. Bagnell, and . Hebert, Activity forecasting, pp.201-214, 2012.

H. Koppula and A. Saxena, Anticipating human activities using object affordances for reactive robotic response, pp.14-29, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Communications of the ACM, vol.60, issue.6, pp.1097-1105, 2012.
DOI : 10.1162/neco.2009.10-08-881
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

T. Kudo and Y. Matsumoto, Chunking with Support Vector Machines, Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. NAACL '01, pp.1-8, 2001.

G. Kurata, B. Xiang, B. Zhou, and M. Yu, Leveraging Sentence-level Information with Encoder LSTM for Natural Language Understanding, 2016.

J. D. Lafferty, A. Mccallum, and F. C. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, International Conference on Machine Learning, pp.282-289, 2001.

D. Lahat, T. Adali, and C. Jutten, Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects, Proceedings of the IEEE, pp.1449-1477, 2015.
DOI : 10.1109/JPROC.2015.2460697
URL : https://hal.archives-ouvertes.fr/hal-01179853

T. Lan, T. Chen, and S. Savarese, A Hierarchical Representation for Future Action Prediction, pp.689-704, 2014.
DOI : 10.1007/978-3-319-10578-9_45

A. Laurent, N. Camelin, and C. Raymond, Boosting bonsai trees for efficient features combination : application to speaker role identification, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01025171

T. Lavergne, O. Cappé, and F. Yvon, Practical Very Large Scale CRFs, Proceedings the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pp.504-513, 2010.

V. Quoc, T. Le, and . Mikolov, Distributed Representations of Sentences and Documents, In: ICML, vol.14, pp.1188-1196, 2014.

R. Lebret, J. Legrand, and R. Collobert, Is Deep Learning Really Necessary for Word Embeddings? Tech. rep, 2013.

Y. Lecun, L. Jackel, . Bottou, C. Brunot, . Cortes et al., Comparison of learning algorithms for handwritten digit recognition, In: International conference on artificial neural networks, vol.60, pp.53-60, 1995.

C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham et al., Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2017.19

C. Liu, A. Yuen, and . Torralba, SIFT Flow: Dense Correspondence Across Scenes and Its Applications, PAMI 33, pp.978-994, 2011.
DOI : 10.1007/978-3-319-23048-1_2

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3431-3440, 2015.
DOI : 10.1109/CVPR.2015.7298965

G. David and . Lowe, Object recognition from local scale-invariant features " . In: Computer vision, 1999. The proceedings of the seventh, IEEE international conference on, vol.2, pp.1150-1157, 1999.

H. Lu, Y. Liou, H. Lee, and L. Lee, Semantic Retrieval of Personal Photos Using a Deep Autoencoder Fusing Visual Features with Speech Annotations Represented as Word/Paragraph Vectors, Annual Conf. of the Intl. Speech Communication Association, 2015.

A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, Adversarial autoencoders, 2015.

G. Mesnil, X. He, L. Deng, and Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding, 14th Annual Conference of the International Speech Communication Association, pp.3771-3775, 2013.

K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, IEEE transactions, pp.1615-1630, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548227

T. Mikolov, I. Sutskever, K. Chen, S. Greg, J. Corrado et al., Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, 2013.

M. Mirza and S. Osindero, Conditional generative adversarial nets, 2014.

R. Mottaghi, . Bagherinezhad, A. Rastegari, and . Farhadi, Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.CoRR, 2015.
DOI : 10.1109/CVPR.2016.383

A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune, Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2017.374

A. Van-den-oord, K. Kalchbrenner, and . Kavukcuoglu, Pixel Recurrent Neural Networks, p.CoRR, 2016.

A. Van-den-oord, . Kalchbrenner, . Vinyals, . Espeholt, K. Graves et al., Conditional image generation with pixelcnn decoders, p.CoRR, 2016.

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, Context Encoders: Feature Learning by Inpainting, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2536-2544, 2016.
DOI : 10.1109/CVPR.2016.278

J. Pennington, R. Socher, D. Christopher, and . Manning, Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1532-1543, 2014.
DOI : 10.3115/v1/D14-1162

G. Perarnau, J. Van-de-weijer, B. Raducanu, M. Jose, and . Álvarez, Invertible Conditional GANs for image editing, 2016.

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383266

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, pp.2010-143, 2010.
DOI : 10.1007/978-3-642-15561-1_11
URL : https://hal.archives-ouvertes.fr/inria-00548630

S. Pintea and J. Van-gemert, Making a Case for Learning Motion Representations with Phase, 2016.
DOI : 10.1145/2185520.2185561

A. Radford, S. Metz, and . Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, p.CoRR, 2015.

A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, pp.2016-2015

M. Ranzato, . Szlam, . Bruna, . Mathieu, S. Collobert et al., Video (language ) modeling: a baseline for generative models of natural videos, p.CoRR, 2014.

C. Raymond and G. Riccardi, Generative and Discriminative Algorithms for Spoken Language Understanding, In: InterSpeech. Antwerp, Belgium, pp.1605-1608, 2007.

S. Reed, . Akata, . Yan, . Logeswaran, H. Schiele et al., Generative adversarial text to image synthesis, p.CoRR, 2016.

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele et al., Generative adversarial text to image synthesis, Proceedings of The 33rd International Conference on Machine Learning, 2016.

M. Ruder, T. Dosovitskiy, and . Brox, Artistic Style Transfer for Videos, p.CoRR, 2016.
DOI : 10.1109/TVCG.2011.51

L. Berg and . Fei-fei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV), vol.1153, pp.211-252, 2015.

M. Saito and E. Matsumoto, Temporal Generative Adversarial Nets, 2016.

G. Salton, J. Michael, and . Mcgill, Introduction to modern information retrieval, 1986.

E. Robert, Y. Schapire, and . Singer, BoosTexter: A boosting-based system for text Categorization, Machine Learning, vol.39, pp.135-168, 2000.

C. Schuldt, B. Laptev, and . Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., pp.32-36, 2004.
DOI : 10.1109/ICPR.2004.1334462

M. Schuster, K. Kuldip, and . Paliwal, Bidirectional recurrent neural networks, Signal Processing, pp.2673-2681, 1997.
DOI : 10.1109/78.650093

A. Sharif-razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.806-813, 2014.

A. Simon, Semantic structuring of video collections from speech: segmentation and hyperlinking, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01253678

A. Simon, R. Sicre, R. Bois, G. Gravier, and P. Sébillot, IRISA at TrecVid2015: Leveraging Multimodal LDA for Video Hyperlinking
URL : https://hal.archives-ouvertes.fr/hal-01403726

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

W. De, S. , and M. Moens, Cross-language linking of news stories on the web using interlingual topic modelling, Proc. of ACM Workshop on Social Web Search and Mining, 2009.

R. Socher, J. Pennington, H. Eric, . Huang, Y. Andrew et al., Semi-supervised recursive autoencoders for predicting sentiment distributions, Proceedings of the conference on empirical methods in natural language processing, pp.151-161, 2011.

J. Tobias-springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: The all convolutional net, pp.1412-6806, 2014.

M. Steyvers and T. Griffiths, Probabilistic Topic Models, pp.424-440, 2007.
DOI : 10.4324/9780203936399.ch21

I. Sutskever, O. Vinyals, V. Quoc, and . Le, Sequence to sequence learning with neural networks " . In: Advances in neural information processing systems, pp.3104-3112, 2014.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1-9, 2015.
DOI : 10.1109/CVPR.2015.7298594

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the Inception Architecture for Computer Vision, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2818-2826, 2016.
DOI : 10.1109/CVPR.2016.308

M. Tatarchenko, T. Dosovitskiy, and . Brox, Multi-view 3D Models from Single Images with a Convolutional Network, pp.322-337, 2016.
DOI : 10.1109/ICCV.2015.123

T. Tommasi, T. Tuytelaars, and B. Caputo, A Testbed for Cross-Dataset Analysis, pp.1402-5923, 2014.
DOI : 10.1007/978-3-319-16199-0_2

G. Tur, D. Hakkani-tur, and L. Heck, What is left to be understood in ATIS?, 2010 IEEE Spoken Language Technology Workshop, pp.19-24, 2010.
DOI : 10.1109/SLT.2010.5700816

O. Vinyals, ?. Kaiser, T. Koo, S. Petrov, I. Sutskever et al., Grammar as a foreign language, Advances in Neural Information Processing Systems. 2015, pp.2755-2763

C. Vondrick, A. Pirsiavash, and . Torralba, Anticipating the future by watching unlabeled video, p.CoRR, 2015.

C. Vondrick, A. Pirsiavash, and . Torralba, Generating videos with scene dynamics, pp.613-621, 2016.

V. Vukotic, C. Raymond, and G. Gravier, A Step Beyond Local Observations with a Dialog Aware Bidirectional GRU Network for Spoken Language Understanding, Interspeech 2016, 2016.
DOI : 10.21437/Interspeech.2016-1301
URL : https://hal.archives-ouvertes.fr/hal-01351733

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications, Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp.343-346, 2016.

V. Vukotic, C. Raymond, and G. Gravier, Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking, Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval , ICMR '17, 2017.
DOI : 10.1145/2983563.2983567
URL : https://hal.archives-ouvertes.fr/hal-01522419

V. Vukotic, C. Raymond, and G. Gravier, Is it time to switch to Word Embedding and Recurrent Neural Networks for Spoken Language Understanding, In: InterSpeech, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01196915

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, Multimodal and crossmodal representation learning from textual and visual features with bidirectional deep neural networks for video hyperlinking, Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion. ACM. 2016, pp.37-44

V. Vukoti´cvukoti´c, S. Pintea, C. Raymond, G. Gravier, and J. Van-gemert, One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network, 19th International Conference on Image Analysis and Processing (ICIAP), 2017.

J. Walker, M. Gupta, and . Hebert, Dense Optical Flow Prediction from a Static Image, 2015 IEEE International Conference on Computer Vision (ICCV), pp.2443-2451, 2015.
DOI : 10.1109/ICCV.2015.281

J. Walker, M. Gupta, and . Hebert, Patch to the Future: Unsupervised Visual Prediction, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.3302-3309, 2014.
DOI : 10.1109/CVPR.2014.416

J. Walker, . Doersch, M. Gupta, and . Hebert, An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders, pp.835-851, 2016.
DOI : 10.1007/978-3-642-15552-9_51

X. Wang and A. Gupta, Generative Image Modeling Using Style and Structure Adversarial Networks, European Conference on Computer Vision, pp.318-335, 2016.
DOI : 10.1109/CVPR.2016.309

J. Weston, A. Bordes, S. Chopra, M. Alexander, B. Rush et al., Towards ai-complete question answering: A set of prerequisite toy tasks, 2015.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, International Conference on Machine Learning, pp.2048-2057, 2015.

K. Yao, G. Zweig, M. Hwang, Y. Shi, and D. Yu, Recurrent Neural Networks for Language Understanding, 2013.

K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig et al., Spoken language understanding using long short-term memory neural networks, 2014 IEEE Spoken Language Technology Workshop (SLT)
DOI : 10.1109/SLT.2014.7078572

G. Ye, Y. Li, H. Xu, D. Liu, and S. Chang, EventNet, Proceedings of the 23rd ACM international conference on Multimedia, MM '15, pp.471-480
DOI : 10.1109/CVPR.2014.20

R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-johnson, N. Minh et al., Semantic Image Inpainting with Perceptual and Contextual Losses, 2016.

J. Yuen and A. Torralba, A Data-Driven Approach for Event Prediction, pp.707-720, 2010.
DOI : 10.1007/978-3-642-15552-9_51

H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang et al., StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, 2016.

F. Zhao, J. Feng, J. Zhao, W. Yang, and S. Yan, Robust LSTM-Autoencoders for Face De-Occlusion in the Wild, IEEE Transactions on Image Processing, vol.27, issue.2, 2016.
DOI : 10.1109/TIP.2017.2771408

W. Cohen-zhilin-yang-ruslan and . Salakhutdinov, Multi-Task Cross-Lingual Sequence Tagging from Scratch

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking, IEEE MultiMedia Special Issue: Vision and Language Integration Meets Multimedia Fusion, 2018.

V. Vukoti´cvukoti´c, S. Pintea, C. Raymond, G. Gravier, and J. C. Van-gemert, One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network, Intl. Conf. on Image Analysis and Processing, 2017.

M. Dinarelli, V. Vukoti´cvukoti´c, and C. Raymond, Label-Dependency Coding in Simple Recurrent Networks for Spoken Language Understanding, Interspeech 2017, 2017.
DOI : 10.21437/Interspeech.2017-1480
URL : https://hal.archives-ouvertes.fr/hal-01553830

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking, ACM International Conference on Multimedia Retrieval, 2017.

R. Bois, V. Vukoti´cvukoti´c, A. Simon, R. Sicre, C. Raymond et al., Exploiting Multimodality in Video Hyperlinking to Improve Target Diversity, International Conference on Multimedia Modeling, 2017.
DOI : 10.1007/s10994-010-5198-3
URL : https://hal.archives-ouvertes.fr/hal-01498130

V. Vukoti´cvukoti´c, S. Pintea, C. Raymond, G. Gravier, and J. Van-gemert, OneStep Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network, Netherlands Conference on Computer Vision, 2016.

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, A step beyond local observations with a dialog aware bidirectional GRU network for Spoken Language Understanding, Annual Conf. of the Intl. Speech Communication Association ? Interspeech . 2016. 106 Chapter

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking, ACM Multimedia 2016 Workshop: Vision and Language Integration Meets Multimedia Fusion, 2016.

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications, ACM International Conference on Multimedia Retrieval, 2016.

V. Vukoti´cvukoti´c, C. Raymond, and G. Gravier, Is it time to switch to Word Embedding and Recurrent Neural Networks for Spoken Language Understanding, Annual Conf. of the Intl. Speech Communication Association ? Interspeech, 2015.