A. Aho, M. Lam, R. Sethi, and J. Ullman, Compilers: Principles, Techniques, and Tools, 2007.

A. V. Aho and S. C. Johnson, Optimal Code Generation for Expression Trees, Proceedings of Seventh Annual ACM Symposium on Theory of Computing (STOC '75), pp.207-217, 1975.
DOI : 10.1145/800116.803770

A. V. Aho, S. C. Johnson, and J. D. Ullman, Code Generation for Expressions with Common Subexpressions, J. ACM, vol.24, pp.146-160, 1977.
DOI : 10.1145/321992.322001

A. W. Appel and K. J. Supowit, Generalization of the SethiUllman Algorithm for Register Allocation, Softw. Pract. Exper, vol.17, issue.6, pp.417-421, 1987.

V. Bandishti, I. Pananilath, and U. Bondhugula, Tiling Stencil Computations to Maximize Parallelism, Proceedings of the International Conference on High Performance Computing, 2012.
DOI : 10.1109/sc.2012.107

P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker et al., Compiler-Directed Transformation for Higher-Order Stencils, Parallel and Distributed Processing Symposium (IPDPS), pp.313-323, 2015.
DOI : 10.1109/ipdps.2015.103

URL : https://cloudfront.escholarship.org/dist/prd/content/qt2vh6s0wb/qt2vh6s0wb.pdf?t=ooy3al

D. A. Berson, R. Gupta, and M. L. Soffa, Integrated Instruction Scheduling and Register Allocation Techniques, Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing (LCPC '98), pp.247-262, 1999.
DOI : 10.1007/3-540-48319-5_16

URL : http://www.cs.pitt.edu/~soffa/research/Comp/lcpc98.ps

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, A Practical Automatic Polyhedral Parallelizer and Locality Optimizer, Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08), pp.101-113, 2008.
DOI : 10.1145/1375581.1375595

URL : http://www.cse.ohio-state.edu/~bondhugu/publications/uday-pldi08.pdf

P. Briggs, K. D. Cooper, and L. Torczon, Rematerialization, Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation (PLDI '92), pp.311-321, 1992.
DOI : 10.1145/143103.143143

P. Briggs, K. D. Cooper, and L. Torczon, Improvements to Graph Coloring Register Allocation, ACM Trans. Program. Lang. Syst, vol.16, issue.3, pp.428-455, 1994.
DOI : 10.1145/177492.177575

URL : http://www.cs.rice.edu/~grosul/612s01/toplas94.pdf

G. J. Chaitin, Register Allocation & Spilling via Graph Coloring, Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82), pp.98-105, 1982.
DOI : 10.1145/872726.806984

J. M. Codina, J. Sanchez, and A. Gonzalez, A unified modulo scheduling and register allocation technique for clustered processors, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, pp.175-184, 2001.
DOI : 10.1109/pact.2001.953298

URL : http://upcommons.upc.edu/bitstream/2117/101361/1/00953298.pdf

Q. Colombet, B. Boissinot, P. Brisk, S. Hack, and F. Rastello, Graph-coloring and treescan register allocation using repairing, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pp.45-54, 2011.
DOI : 10.1145/2038698.2038708

URL : http://www1.cs.ucr.edu/faculty/philip/papers/conferences/cases11/cases11-treescan.pdf

R. De, L. Cruz, M. Araya-polo, and J. Cela, Introducing the Semi-stencil Algorithm, Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM'09), pp.496-506, 2010.

J. Steven, B. L. Deitz, L. Chamberlain, and . Snyder, Eliminating Redundancies in Sum-of-product Array Computations, Proceedings of the 15th International Conference on Supercomputing (ICS '01), pp.65-77, 2001.

L. Domagala, F. Duco-van-amstel, P. Rastello, and . Sadayappan, Register Allocation and Promotion Through Combined Instruction Scheduling and Loop Unrolling, Proceedings of the 25th International Conference on Compiler Construction, pp.143-151, 2016.

, ExaCT: Center for Exascale Simulation of Combustion in Turbulence: Proxy App Software, 2013.

M. Frigo and S. G. Johnson, The Design and Implementation of FFTW3, Proc. IEEE, vol.93, issue.2, pp.216-231, 2005.

K. Goto and R. A. Van-de-geijn, Anatomy of Highperformance Matrix Multiplication, ACM Trans. Math. Softw, vol.34, 2008.

H. Ramaswamy-govindarajan, C. Yang, J. N. Zhang, G. R. Amaral, and . Gao, Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs, Proceedings of the 15th International Parallel &Amp; Distributed Processing Symposium (IPDPS '01), pp.26-33, 2001.

T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege, Hybrid Hexagonal/Classical Tiling for GPUs, Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14), vol.66, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00911177

T. Gysi, T. Grosser, and T. Hoefler, MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures, Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15), pp.177-186, 2015.

M. Hall, J. Chame, C. Chen, J. Shin, G. Rudy et al., Loop Transformation Recipes for Code Generation and Auto-tuning, Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09), pp.50-64, 2010.

A. B. Hayes, L. Li, D. Chavarría-miranda, L. Shuaiwen, E. Z. Song et al., Orion: A Framework for GPU Occupancy Tuning, Proceedings of the 17th International Middleware Conference (Middleware '16), vol.18, p.13, 2016.

T. Henretty, R. Veras, F. Franchetti, L. Pouchet, J. Ramanujam et al., A Stencil Compiler for Shortvector SIMD Architectures, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13), pp.13-24, 2013.

, High-Performance Geometric Multigrid, HPGMG 2016, 2016.

H. Jia-wei and H. T. Kung, I/O Complexity: The Red-blue Pebble Game, Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC '81), pp.326-333, 1981.

M. Jin, H. Fu, Z. Lv, and G. Yang, Libra: An Automated Code Generation and Tuning Framework for Registerlimited Stencils on GPUs, Proceedings of the ACM International Conference on Computing Frontiers (CF '16), pp.92-99, 2016.

S. C. David-ryan-koes and . Goldstein, A Global Progressive Register Allocator, Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '06), pp.204-215, 2006.

S. Kral, F. Franchetti, J. Lorenz, C. W. Ueberhuber, and P. Wurzinger, FFT Compiler Techniques, Compiler Construction: 13th International Conference, pp.217-231, 2004.

C. Lattner and V. Adve, LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation, Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04), p.75, 2004.

A. Li, S. L. Song, A. Kumar, E. Z. Zhang, D. Chavarrã?a-miranda et al., Critical points based register-concurrency autotuning for GPUs, 2016 Design, Automation Test in Europe Conference Exhibition (DATE, pp.1273-1278, 2016.

. Guei-yuan, T. Lueh, and A. Gross, Fusion-based Register Allocation, ACM Trans. Program. Lang. Syst, vol.22, issue.3, pp.431-470, 2000.

H. Mössenböck and M. Pfeiffer, Linear Scan Register Allocation in the Context of SSA Form and Register Constraints, pp.229-246, 2002.

R. Motwani, K. V. Palem, V. Sarkar, and S. Reyen, Combining Register Allocation and Instruction Scheduling, 1995.

R. Teja-mullapudi, V. Vasista, and U. Bondhugula, PolyMage: Automatic Optimization for Image Processing Pipelines, Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15), pp.429-443, 2015.

C. Norris and L. L. Pollock, A scheduler-sensitive global register allocator, Supercomputing '93. Proceedings, pp.804-813, 1993.

, NVIDIA CUDA Compiler Driver NVCC. docs.nvidia.com/ cuda/cuda-compiler-driver-nvcc, NVCC 2017, 2017.

, NVIDIA Profiler, 2017.

S. Shlomit and . Pinter, Register Allocation with Instruction Scheduling, Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI '93), pp.248-257, 1993.

M. Poletto and V. Sarkar, Linear Scan Register Allocation, ACM Trans. Program. Lang. Syst, vol.21, pp.895-913, 1999.

F. Magno, Q. Pereira, and J. Palsberg, Register Allocation by Puzzle Solving, Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08), pp.216-226, 2008.

M. Ravishankar, J. Holewinski, and V. Grover, Forma: A DSL for Image Processing Applications to Target GPUs and Multi-core CPUs, Proc. 8th Workshop on General Purpose Processing Using GPUs, pp.109-120, 2015.

P. Singh-rawat, C. Hong, M. Ravishankar, V. Grover, L. Pouchet et al., Resource Conscious Reuse-Driven Tiling for GPUs, Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT '16), pp.99-111, 2016.

H. Rong, Tree Register Allocation, Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, pp.67-77, 2009.

V. Sarkar and R. Barik, Extended Linear Scan: An Alternate Foundation for Global Register Allocation, Proceedings of the 16th International Conference on Compiler Construction (CC'07), 2007.

. Springer-verlag, , pp.141-155

R. Sethi and J. D. Ullman, The Generation of Optimal Code for Arithmetic Expressions, J. ACM, vol.17, pp.715-728, 1970.

M. D. Smith, N. Ramsey, and G. Holloway, A Generalized Algorithm for Graph-coloring Register Allocation, Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04), pp.277-288, 2004.

M. Richard, . Stallman, and . Community, Using The GNU Compiler Collection: A GNU Manual For GCC, 2009.

K. Stock, M. Kong, T. Grosser, L. Pouchet, F. Rastello et al., A Framework for Enhancing Data Reuse via Associative Reordering, Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14, pp.65-76, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01016093

, SW4 2014. Seismic Wave Modelling (SW4)-Computational Infrastructure for Geodynamics, 2014.

S. Touati and C. Eisenbeis, Early Periodic Register Allocation on ILP Processors, Parallel Processing Letters, vol.14, issue.2, pp.287-313, 2004.
URL : https://hal.archives-ouvertes.fr/hal-00130623

S. Unkule, C. Shaltz, and A. Qasem, Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality, Proceedings of the 21st International Conference on Compiler Construction (CC'12), pp.21-40, 2012.

M. Wahib and N. Maruyama, Scalable Kernel Fusion for Memory-bound GPU Applications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14), pp.191-202, 2014.

M. Wahib and N. Maruyama, Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15), pp.259-270, 2015.

J. Wang, A. Krall, M. A. Ertl, and C. Eisenbeis, Software Pipelining with Register Allocation and Spilling, Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO 27, pp.95-99, 1994.

J. Wu, A. Belevich, E. Bendersky, M. Heffernan, C. Leary et al., gpucc: An Open-source GPGPU Compiler, Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16, pp.105-116, 2016.

X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun et al., Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs, Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48), pp.395-406, 2015.

J. Xue, On Tiling as a Loop Transformation, Parallel Processing Letters, vol.07, pp.409-424, 1997.