S. Abu-el-haija, N. Kothari, J. Lee, P. Natsev, G. Toderici et al., Youtube-8m: A large-scale video classification benchmark

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, NetVLAD: CNN architecture for weakly supervised place recognition, CVPR, 2016. 1
DOI : 10.1109/cvpr.2016.572

URL : https://hal.archives-ouvertes.fr/hal-01242052

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, Sequential Deep Learning for Human Action Recognition, Human Behavior Understanding, issue.2, pp.29-39, 2011.
DOI : 10.1007/978-3-642-25446-8_4

URL : https://hal.archives-ouvertes.fr/hal-01354493

F. Basura, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, Modeling video evolution for action recognition, CVPR, 2015.

J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, CVPR, 2017.

K. Cho, B. Van-merrienboer, D. Bahdanau, and Y. Bengio, On the Properties of Neural Machine Translation: Encoder- Decoder Approaches. arXiv preprint, 2014.

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, ECCV Workshop, 2004.

Y. N. Dauphin, F. Angela, M. Auli, and D. Grangier, Language modeling with gated convolutional networks, arXiv preprint, 2016.

C. R. De-souza, A. Gaidon, E. Vig, and A. M. López, Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition, ECCV, 2016.
DOI : 10.1109/TPAMI.2012.24

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description . arXiv preprint, 2014.
DOI : 10.1109/tpami.2016.2599174

URL : http://arxiv.org/abs/1411.4389

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional Two-Stream Network Fusion for Video Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.213

URL : http://arxiv.org/abs/1604.06573

Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, Compact Bilinear Pooling, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.41

URL : http://arxiv.org/abs/1511.06062

R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, Actionvlad: Learning spatio-temporal aggregation for action classification, CVPR, 2017, p.3

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90

URL : http://arxiv.org/abs/1512.03385

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen et al., CNN architectures for large-scale audio classification, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
DOI : 10.1109/ICASSP.2017.7952132

URL : http://arxiv.org/abs/1609.09430

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computing, 1997.
DOI : 10.1016/0893-6080(88)90007-X

M. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and M. Greg, A Hierarchical Deep Temporal Model for Group Activity Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.217

URL : http://arxiv.org/abs/1511.06040

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate. arXiv preprint

H. Jegou, M. Douze, C. Schmid, and P. Perez, Aggregating local descriptors into a compact image representation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5540039

URL : https://hal.archives-ouvertes.fr/inria-00548637

M. I. Jordan, Hierarchical mixtures of experts and the em algorithm, Neural Computation, issue.2, 1994.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.1725-1732, 2014.
DOI : 10.1109/CVPR.2014.223

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.471.3312

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, 2012.
DOI : 10.1162/neco.2009.10-08-881

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.299.205

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

G. Lev, G. Sadeh, B. Klein, and L. Wolf, RNN Fisher Vectors for Action Recognition and Image Annotation, ECCV, 2016.
DOI : 10.1109/ICCV.2015.521

URL : http://arxiv.org/abs/1512.03958

A. Miech, LOUPE tensorflow toolbox for learnable pooling module. https://github.com/antoine77340, LOUPE, issue.5, 2017.

X. Peng, L. Wang, Y. Qiao, and Q. Peng, Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics, ECCV, 2014.
DOI : 10.1007/978-3-319-10578-9_43

X. Peng, C. Zou, Y. Qiao, and Q. Peng, Action Recognition with Stacked Fisher Vectors, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_38

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, p.3, 2007.
DOI : 10.1109/CVPR.2007.383266

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.7388

F. Perronnin and D. Larlus, Fisher vectors meet Neural Networks: A hybrid classification architecture, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298998

J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, Lost in quantization: Improving particular object retrieval in large scale image databases, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587635

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.9621

C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004.
DOI : 10.1109/ICPR.2004.1334462

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, ICLR, pp.568-576, 2014.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, 2015.

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, 2003.
DOI : 10.1109/ICCV.2003.1238663

V. Sydorov, M. Sakurada, and C. H. Lampert, Deep Fisher Kernels -- End to End Learning of the Fisher Kernel GMM Parameters, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.182

C. Szegedy, S. Ioffe, and V. Vanhoucke, Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.510

URL : http://arxiv.org/abs/1412.0767

G. Varol, I. Laptev, and C. Schmid, Long-term Temporal Convolutions for Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, issue.2, 2017.
DOI : 10.1109/TPAMI.2017.2712608

URL : https://hal.archives-ouvertes.fr/hal-01241518

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4305-4314, 2015.
DOI : 10.1109/CVPR.2015.7299059

URL : http://arxiv.org/abs/1505.04868

L. Wang, Y. Xiong, Y. Qiao, D. Lin, X. Tang et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV, 2016.
DOI : 10.1109/CVPR.2016.219

URL : http://arxiv.org/abs/1608.00859

Z. Xu, Y. Yang, and A. G. Hauptmann, A discriminative CNN video representation for event detection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298789

URL : http://arxiv.org/abs/1411.4006

J. Yue-hei, M. Ng, S. Hausknecht, O. Vijayanarasimhan, R. Vinyals et al., Beyond short snippets: Deep networks for video classification, CVPR, 2015.