J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-kelley, J. Bosboom et al., OpenTuner: An Extensible Framework for Program Autotuning, Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14), pp.303-316, 2014.

M. Sara-s-baghsorkhi, . Delahaye, J. Sanjay, W. D. Patel, W. Gropp et al., An adaptive performance modeling tool for GPU architectures, ACM Sigplan Notices, vol.45, pp.105-114, 2010.

M. Baskaran, J. Ramanujam, and P. Sadayappan, Automatic C-to-CUDA code generation for affine programs, Compiler Construction, pp.244-263, 2010.
DOI : 10.1007/978-3-642-11970-5_14

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer, In ACM SIGPLAN Notices, vol.43, pp.101-113, 2008.
DOI : 10.1145/1379022.1375595

M. Christen, O. Schenk, and H. Burkhart, PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures, Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS '11), pp.676-687, 2011.
DOI : 10.1109/ipdps.2011.70

, CUDA occupancy calculator

L. Dagum and R. Menon, OpenMP: an industry standard API for shared-memory programming, IEEE computational science and engineering, vol.5, pp.46-55, 1998.
DOI : 10.1109/99.660313

J. Dongarra, Report on the sunway taihulight system. PDF). www. netlib. org, 2016.

T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege, Hybrid Hexagonal/Classical Tiling for GPUs, Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14), vol.66, 2014.
DOI : 10.1145/2581122.2544160

URL : https://hal.archives-ouvertes.fr/hal-00911177

T. Grosser, A. Cohen, H. J. Paul, J. Kelly, P. Ramanujam et al., Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pp.24-31, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00786812

J. Holewinski, L. Pouchet, and P. Sadayappan, High-performance Code Generation for Stencil Computations on GPU Architectures, Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12), pp.311-320, 2012.
DOI : 10.1145/2304576.2304619

URL : http://www.cs.ucla.edu/%7Epouchet/doc/ics-article.12.pdf

C. Hong, A. Sukumaran-rajam, J. Kim, P. Singh-rawat, S. Krishnamoorthy et al., GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis, 2018.
DOI : 10.1145/3192366.3192397

URL : https://hal.archives-ouvertes.fr/hal-01955475

S. Hong and H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, In ACM SIGARCH Computer Architecture News, vol.37, pp.152-163, 2009.
DOI : 10.1145/1555754.1555775

URL : http://www.cc.gatech.edu/%7Ehyesoon/hong_isca09.pdf

H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan et al., GPURoofline: a model for guiding performance optimizations on GPUs, European Conference on Parallel Processing, pp.920-932, 2012.
DOI : 10.1007/978-3-642-32820-6_90

J. Lai and A. Seznec, Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs, Code Generation and Optimization (CGO), pp.1-10, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00789958

S. Lee, J. S. Meredith, and J. S. Vetter, COMPASS: A Framework for Automated Performance Modeling and Prediction, ACM International Conference on Supercomputing (ICS15, 2015.

S. Lee and J. Vetter, OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing, HPDC '14: Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, 2014.
DOI : 10.1109/waccpd.2014.7

W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski, Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters, Proceedings of the 2010 IEEE International Conference on Cluster Computing, pp.207-216, 2010.
DOI : 10.1109/cluster.2010.26

W. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal, Optimizing tensor contraction expressions for hybrid CPU-GPU execution, Cluster computing, vol.16, pp.131-155, 2013.
DOI : 10.1007/s10586-011-0179-2

A. Magni, C. Dubach, and M. Boyle, Automatic optimization of thread-coarsening for graphics processors, Proceedings of the 23rd international conference on Parallel architectures and compilation, pp.455-466, 2014.
DOI : 10.1145/2628071.2628087

URL : https://www.pure.ed.ac.uk/ws/files/19958629/magni14pact.pdf

A. Magni, C. Dubach, F. Michael, and . Boyle, A large-scale cross-architecture evaluation of threadcoarsening, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.11, 2013.
DOI : 10.1145/2503210.2503268

, Manuscript submitted to ACM GPU Code Optimization using Abstract Kernel

, Whitepaper 2012. NVIDIA Tesla K100

, Whitepaper 2016. NVIDIA Tesla P100

S. Wienke, P. Springer, C. Terboven, and . Mey, OpenACC-first experiences with real-world applications, Euro-Par, pp.859-870, 2012.

S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Commun. ACM, vol.52, pp.65-76, 2009.

S. Xu, Y. Xu, W. Xue, X. Shen, X. Huang et al., Taming the "Monster": Overcoming program optimization challenges on SW26010 through precise performance modeling, Parallel and Distributed Processing Symposium (IPDPS), 2018.

X. Zhang, G. Tan, S. Xue, J. Li, K. Zhou et al., Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.31-43, 2017.

Y. Zhang and J. Owens, A quantitative performance analysis model for GPU architectures, 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp.382-393, 2011.

K. Zhou, G. Tan, X. Zhang, C. Wang, and N. Sun, A performance analysis framework for exploiting GPU microarchitectural capability, Proceedings of the International Conference on Supercomputing. ACM, 15, 2017.