T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High accuracy optical flow estimation based on a theory for warping, ECCV, 2004.

J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, CVPR, 2008.

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. PAMI, issue.2, 2018.

A. Diba, A. M. Pazandeh, and L. Van-gool, Efficient two-stream motion and appearance 3D CNNs for video classification, ECCV workshop, 2016.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas et al., Flownet: Learning optical flow with convolutional networks, ICCV, 2015.

L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong et al., End-to-end learning of motion representation for video understanding, CVPR, 2018.

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, CVPR, 2016.

N. Garcia, P. Morerio, and V. Murino, Modality distillation with multiple stream networks for action recognition, ECCV, 2018.

R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal et al., The something something video database for learning and evaluating visual common sense, ICCV, 2005.

K. Hara, H. Kataoka, and Y. Satoh, Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet, CVPR, vol.3, 2018.

K. He and G. Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN, ICCV, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, vol.1, 2016.

G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, NIPS workshop, vol.2, p.3, 2014.

J. Hoffman, S. Gupta, and T. Darrell, Learning with side information through modality hallucination, CVPR, 2016.

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy et al., Flownet 2.0: Evolution of optical flow estimation with deep networks, CVPR, 2017.

V. Kantorov and I. Laptev, Efficient feature extraction, encoding and classification for action recognition, CVPR, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01058734

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, vol.1, p.4, 2017.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, vol.1, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, ICCV, vol.3, p.4, 2011.

M. Lee and S. Lee, Motion feature network: Fixed motion filter for action recognition, ECCV, vol.3, 2018.

Y. Li, Y. Li, and N. Vasconcelos, RESOUND: Towards action recognition without representation bias, ECCV, 2018.

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, CVPR, 2015.

D. Lopez-paz, L. Bottou, B. Schölkopf, and V. Vapnik, Unifying distillation and privileged information, ICLR, 2016.

Z. Luo, J. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-fei, Graph distillation for action detection with privileged modalities, ECCV, 2018.

J. Ng, J. Choi, J. Neumann, and L. S. Davis, ActionFlowNet: Learning motion representation for action recognition, WACV, 2018.

A. Ranjan, J. Michael, and . Black, Optical flow estimation using a spatial pyramid network, CVPR, 2017.

K. Shaoqing-ren, R. He, J. Girshick, and . Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, NIPS, 2015.

J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Epicflow: Edge-preserving interpolation of correspondences for optical flow, CVPR, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01097477

L. Sevilla-lara, Y. Liao, F. Guney, V. Jampani, A. Geiger et al., On the integration of optical flow and action recognition, GCPR, 2018.

A. Sharif-razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN features off-the-shelf: An astounding baseline for recognition, CVPR workshops, 2014.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2008.

K. Soomro, M. Amir-roshan-zamir, and . Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, vol.3, p.4, 2012.

D. Sun, X. Yang, M. Liu, and J. Kautz, PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume, CVPR, vol.1, p.5, 2018.

S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, Optical flow guided feature: a fast and robust motion representation for video action recognition, CVPR, vol.3, 2018.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, CVPR, 2015.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3D convolutional networks, ICCV, 2008.

D. Tran, H. Wang, L. Torresani, J. Ray, Y. Lecun et al., A closer look at spatiotemporal convolutions for action recognition, CVPR, 2008.

V. Vapnik and R. Izmailov, Learning using privileged information: Similarity control and knowledge transfer, JMLR, vol.2, p.4, 2015.

G. Varol, I. Laptev, and C. Schmid, Long-term temporal convolutions for action recognition, IEEE Trans. PAMI, issue.6, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01241518

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal segment networks: Towards good practices for deep action recognition, ECCV, 2016.

X. Wang and R. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks, CVPR, vol.7, 2018.

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, CVPR, 2017.

S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, ECCV, 2008.

C. Zach, T. Pock, and H. Bischof, A duality based approach for realtime TV-L1 optical flow, Joint Pattern Recognition Symposium, vol.5, p.6, 2003.

D. Matthew, R. Zeiler, and . Fergus, Visualizing and understanding convolutional networks, ECCV, 2014.

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, Temporal relational reasoning in videos, ECCV, 2018.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, CVPR, 2016.

Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann, Hidden two-stream convolutional networks for action recognition, ACCV, 2018.