S. Buch, V. Escorcia, B. Ghanem, L. Fei-fei, and J. Niebles, End-to-end, singlestream temporal action detection in untrimmed videos, 2017.

F. Caba-heilbron, V. Escorcia, B. Ghanem, and J. Carlos-niebles, Activitynet: A large-scale video benchmark for human activity understanding, 2015.

J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, 2017.

T. Chen, I. Goodfellow, and J. Shlens, Net2net: Accelerating learning via knowledge transfer, 2015.

Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, A?2-nets: Double attention networks, 2018.

X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Qiu-chen, Temporal context network for activity localization in videos, 2017.

T. G. Dietterich, R. H. Lathrop, and T. Lozano-pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence, 1997.

J. Donahue, A. Hendricks, L. Guadarrama, S. Rohrbach, M. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, 2015.

V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, Daps: Deep action proposals for action understanding, 2016.

C. Feichtenhofer, H. Fan, J. Malik, and K. He, Slowfast networks for video recognition, 2018.

J. Gao, Z. Yang, and R. Nevatia, Cascaded boundary regression for temporal action detection, 2017.

J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia, Turn tap: Temporal unit regression network for temporal action proposals, 2017.

G. Gkioxari and J. Malik, Finding action tubes, 2015.

A. Gudi, N. Van-rosmalen, M. Loog, and J. Van-gemert, Object-extent pooling for weakly supervised single-shot localization, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, 2014.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, 2016.

G. Huang, Z. Liu, L. Van-der-maaten, and K. Q. Weinberger, Densely connected convolutional networks, 2017.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.

S. Ji, W. Xu, M. Yang, and K. Yu, 3d convolutional neural networks for human action recognition, IEEE transactions, 2013.

Y. G. Jiang, J. Liu, A. Roshan-zamir, G. Toderici, I. Laptev et al., THUMOS challenge: Action recognition with a large number of classes, 2014.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Largescale video classification with convolutional neural networks, 2014.

T. Lin, X. Zhao, and Z. Shou, Single shot temporal action detection, Proceedings of the 2017 ACM on Multimedia Conference, 2017.

T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, Bsn: Boundary sensitive network for temporal action proposal generation, 2018.

S. Ma, L. Sigal, and S. Sclaroff, Learning activity progression in lstms for activity detection and early detection, 2016.

P. Nguyen, T. Liu, G. Prasad, and B. Han, Weakly supervised action localization by sparse temporal pooling network, 2018.

S. Paul, S. Roy, and A. K. Roy-chowdhury, W-talc: Weakly-supervised temporal activity localization and classification, 2018.

L. Sevilla-lara, Y. Liao, F. Guney, V. Jampani, A. Geiger et al., On the integration of optical flow and action recognition, 2017.

B. Seybold, D. Ross, J. Deng, R. Sukthankar, S. Vijayanarasimhan et al., Rethinking the faster r-cnn architecture for temporal action localization, 2018.

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. F. Chang, Cdc: Convolutionalde-convolutional networks for precise temporal action localization in untrimmed videos, 2017.

Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. F. Chang, Autoloc: Weaklysupervised temporal action localization in untrimmed videos, 2018.

Z. Shou, D. Wang, and S. F. Chang, Temporal action localization in untrimmed videos via multi-stage cnns, 2016.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, Advances in neural information processing systems, 2014.

K. K. Singh and Y. J. Lee, Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization, 2017.

C. Sun, S. Shetty, R. Sukthankar, and R. Nevatia, Temporal localization of finegrained actions in videos by domain transfer from web images, Proceedings of the 23rd ACM international conference on Multimedia, 2015.

S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, Optical flow guided feature: a fast and robust motion representation for video action recognition, 2018.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, p.ICML, 2013.

C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, AAAI, vol.4, p.12, 2017.

L. Wang, Y. Xiong, D. Lin, and L. Van-gool, Untrimmednets for weakly supervised action recognition and detection, 2017.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal segment networks: Towards good practices for deep action recognition, 2016.

X. Wang, R. Girshick, A. Gupta, and K. He, Non-local neural networks, In: CVPR, 2018.

H. Xu, A. Das, and K. Saenko, R-c3d: region convolutional 3d network for temporal activity detection, 2017.

J. Yuan, B. Ni, X. Yang, and A. A. Kassim, Temporal action localization with pyramid of score distribution features, 2016.

Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang et al., Temporal action detection with structured segment networks, ICCV, 2017.