R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

M. Mathieu, C. Couprie, and Y. Lecun, Deep multi-scale video prediction beyond mean square error, In: ICLR, 2016.

M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert et al., Video (language) modeling: a baseline for generative models of natural videos. arXiv 1412, p.6604, 2014.

N. Srivastava, E. Mansimov, and R. Salakhutdinov, Unsupervised learning of video representations using LSTMs, In: ICML, 2015.

S. Shalev-shwartz, N. Ben-zrihem, A. Cohen, and A. Shashua, Long-term planning by short-term prediction, 2016.

S. Shalev-shwartz and A. Shashua, On the sample complexity of end-to-end training vs. semantic abstraction training, 2016.

P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. Lecun, Predicting Deeper into the Future of Semantic Segmentation, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.77

URL : https://hal.archives-ouvertes.fr/hal-01494296

I. Kokkinos, UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.579

R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, Decomposing motion and content for natural video sequence prediction, In: ICLR, 2017.

R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin et al., Learning to generate long-term future via hierarchical prediction, In: ICML, 2017.

J. Walker, C. Doersch, A. Gupta, and M. Hebert, An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders, In: ECCV, vol.33, issue.5, 2016.
DOI : 10.1007/978-3-642-15552-9_51

URL : http://arxiv.org/pdf/1606.07873

T. Lan, T. C. Chen, and S. Savarese, A Hierarchical Representation for Future Action Prediction, In: ECCV, 2014.
DOI : 10.1007/978-3-319-10578-9_45

URL : http://cvgl.stanford.edu/papers/lan_eccv14.pdf

K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert, Activity forecasting, In: ECCV, 2012.

N. Lee, W. Choi, P. Vernaza, C. Choy, P. Torr et al., DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.233

URL : http://arxiv.org/pdf/1704.04394

A. Dosovitskiy and V. Koltun, Learning to act by predicting the future, In: ICLR, 2017.

C. Vondrick, H. Pirsiavash, and A. Torralba, Anticipating the future by watching unlabeled video, In: CVPR, 2016.

X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin et al., Predicting scene parsing and motion dynamics in the future, In: NIPS, 2017.

B. Romera-paredes and P. Torr, Recurrent Instance Segmentation, In: ECCV, vol.27, issue.8, 2016.
DOI : 10.5244/C.29.CVPPP.1

URL : http://arxiv.org/pdf/1511.08250

M. Bai and R. Urtasun, Deep Watershed Transform for Instance Segmentation, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.305

URL : http://arxiv.org/pdf/1611.08303

P. Pinheiro, T. Y. Lin, R. Collobert, and P. Dollár, Learning to Refine Object Segments, In: ECCV, vol.38, issue.4, 2016.
DOI : 10.5244/C.30.15

URL : http://arxiv.org/pdf/1603.08695

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.6, 2015.
DOI : 10.1109/TPAMI.2016.2577031

URL : http://arxiv.org/pdf/1506.01497

T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan et al., Feature Pyramid Networks for Object Detection, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.106

URL : http://arxiv.org/pdf/1612.03144

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler et al., The Cityscapes Dataset for Semantic Urban Scene Understanding, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.350

URL : http://arxiv.org/pdf/1604.01685

T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick et al., Microsoft COCO: Common Objects in Context, 2014.
DOI : 10.1007/978-3-319-10602-1_48

URL : http://arxiv.org/pdf/1405.0312.pdf

A. Yang, J. Wright, Y. Ma, and S. Sastry, Unsupervised segmentation of natural images via lossy data compression, Computer Vision and Image Understanding, vol.110, issue.2, pp.212-225, 2008.
DOI : 10.1016/j.cviu.2007.07.005

C. Parntofaru and M. Hebert, A comparison of image segmentation algorithms, 2005.

D. Martin, C. Fowlkes, D. Tal, and J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, 2001.
DOI : 10.1109/ICCV.2001.937655

M. Meil?ameil?a, Comparing clusterings: An axiomatic view, In: ICML, 2005.

F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, In: ICLR, 2016.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, In: NIPS, 2014.

D. Kingma and M. Welling, Auto-encoding variational Bayes, In: ICLR, 2014.

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298676

Q. Chen and V. Koltun, Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.509

URL : http://arxiv.org/pdf/1604.03513