J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-kelley, J. Bosboom et al., OpenTuner: An Extensible Framework for Program Autotuning, Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14), pp.303-316, 2014.

M. Sara-s-baghsorkhi, . Delahaye, J. Sanjay, W. D. Patel, W. Gropp et al., An adaptive performance modeling tool for GPU architectures, ACM Sigplan Notices, vol.45, pp.105-114, 2010.

M. Baskaran, J. Ramanujam, and P. Sadayappan, Automatic C-to-CUDA code generation for affine programs, Compiler Construction, pp.244-263, 2010.
DOI : 10.1007/978-3-642-11970-5_14

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer, In ACM SIGPLAN Notices, vol.43, pp.101-113, 2008.
DOI : 10.1145/1379022.1375595

M. Christen, O. Schenk, and H. Burkhart, PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures, Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS '11), pp.676-687, 2011.
DOI : 10.1109/ipdps.2011.70

, CUDA occupancy calculator

L. Dagum and R. Menon, OpenMP: an industry standard API for shared-memory programming, IEEE computational science and engineering, vol.5, pp.46-55, 1998.
DOI : 10.1109/99.660313

J. Dongarra, Report on the sunway taihulight system. PDF). www. netlib. org, 2016.

T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege, Hybrid Hexagonal/Classical Tiling for GPUs, Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14), vol.66, 2014.
DOI : 10.1145/2581122.2544160
URL : https://hal.archives-ouvertes.fr/hal-00911177

T. Grosser, A. Cohen, H. J. Paul, J. Kelly, P. Ramanujam et al., Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pp.24-31, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00786812

J. Holewinski, L. Pouchet, and P. Sadayappan, High-performance Code Generation for Stencil Computations on GPU Architectures, Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12), pp.311-320, 2012.
DOI : 10.1145/2304576.2304619
URL : http://www.cs.ucla.edu/%7Epouchet/doc/ics-article.12.pdf

C. Hong, A. Sukumaran-rajam, J. Kim, P. Singh-rawat, S. Krishnamoorthy et al., GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis, 2018.
DOI : 10.1145/3192366.3192397
URL : https://hal.archives-ouvertes.fr/hal-01955475

S. Hong and H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, In ACM SIGARCH Computer Architecture News, vol.37, pp.152-163, 2009.
DOI : 10.1145/1555754.1555775
URL : http://www.cc.gatech.edu/%7Ehyesoon/hong_isca09.pdf

H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan et al., GPURoofline: a model for guiding performance optimizations on GPUs, European Conference on Parallel Processing, pp.920-932, 2012.
DOI : 10.1007/978-3-642-32820-6_90

J. Lai and A. Seznec, Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs, Code Generation and Optimization (CGO), pp.1-10, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00789958

S. Lee, J. S. Meredith, and J. S. Vetter, COMPASS: A Framework for Automated Performance Modeling and Prediction, ACM International Conference on Supercomputing (ICS15, 2015.

S. Lee and J. Vetter, OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing, HPDC '14: Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, 2014.
DOI : 10.1109/waccpd.2014.7

W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski, Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters, Proceedings of the 2010 IEEE International Conference on Cluster Computing, pp.207-216, 2010.
DOI : 10.1109/cluster.2010.26

W. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal, Optimizing tensor contraction expressions for hybrid CPU-GPU execution, Cluster computing, vol.16, pp.131-155, 2013.
DOI : 10.1007/s10586-011-0179-2

A. Magni, C. Dubach, and M. Boyle, Automatic optimization of thread-coarsening for graphics processors, Proceedings of the 23rd international conference on Parallel architectures and compilation, pp.455-466, 2014.
DOI : 10.1145/2628071.2628087
URL : https://www.pure.ed.ac.uk/ws/files/19958629/magni14pact.pdf

A. Magni, C. Dubach, F. Michael, and . Boyle, A large-scale cross-architecture evaluation of threadcoarsening, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.11, 2013.
DOI : 10.1145/2503210.2503268

, Nervana maxas

N. Sass, CUDA Binary Utilities, 2018.

M. Papadopoulou, M. Sadooghi-alvandi, and H. Wong, Micro-benchmarking the GT200 GPU, 2009.

M. Ravishankar, P. Micikevicius, and V. Grover, Fusing Convolution Kernels Through Tiling, Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp.43-48, 2015.

P. Rawat, C. Hong, M. Ravishankar, V. Grover, L. Pouchet et al., Resource Conscious Reuse-Driven Tiling for GPUs, International Conference on Parallel Architectures and Compilation Techniques, pp.99-111, 2016.

M. O. Timothy-g-rogers, . Connor, and . Aamodt, Cache-conscious wavefront scheduling, Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp.72-83, 2012.

I. Shavitt, . Rodney, and . Bartlett, Many-body methods in chemistry and physics: MBPT and coupled-cluster theory, 2009.

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, A performance analysis framework for identifying potential benefits in GPGPU applications, In ACM SIGPLAN Notices, vol.47, pp.11-22, 2012.

S. Unkule, C. Shaltz, and A. Qasem, Automatic restructuring of GPU kernels for exploiting inter-thread data locality, International Conference on Compiler Construction, pp.21-40, 2012.

M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, P. Tjerk et al., NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations, Computer Physics Communications, vol.181, pp.1477-1489, 2010.

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gomez, C. Tenllado et al., Polyhedral parallel code generation for CUDA, ACM Transactions on Architecture and Code Optimization (TACO), vol.9, p.54, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00786677

F. Mark-n-wegman and . Zadeck, Constant propagation with conditional branches, ACM Transactions on Programming Languages and Systems, vol.13, pp.181-210, 1991.

, Manuscript submitted to ACM GPU Code Optimization using Abstract Kernel

, Whitepaper 2012. NVIDIA Tesla K100

, Whitepaper 2016. NVIDIA Tesla P100

S. Wienke, P. Springer, C. Terboven, and . Mey, OpenACC-first experiences with real-world applications, Euro-Par, pp.859-870, 2012.

S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Commun. ACM, vol.52, pp.65-76, 2009.

S. Xu, Y. Xu, W. Xue, X. Shen, X. Huang et al., Taming the "Monster": Overcoming program optimization challenges on SW26010 through precise performance modeling, Parallel and Distributed Processing Symposium (IPDPS), 2018.

X. Zhang, G. Tan, S. Xue, J. Li, K. Zhou et al., Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.31-43, 2017.

Y. Zhang and J. Owens, A quantitative performance analysis model for GPU architectures, 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp.382-393, 2011.

K. Zhou, G. Tan, X. Zhang, C. Wang, and N. Sun, A performance analysis framework for exploiting GPU microarchitectural capability, Proceedings of the International Conference on Supercomputing. ACM, 15, 2017.