!. #%&'-!"#$-;-!)'-!"#$"&*+'-!-;-$"&-'-+"11'-+"!+" and . '-+"$+'-$"#0'-$"#$'-$"#-'-$, $0' +

G. De=f', F. De, G. , and F. '!g'-!e!f'!g,

*. +&,

. !"#$!!%-&"#$!'%-("#$!&%-;-$!&%,

-. +"#$!'%-'"#$!'% and . %&'%-,-.%*/%-012%&'%-012%*/%-34%&'%,

+. , -. '/*', and -. ,

*. %#,

N. E. Abel, P. P. Budnik, D. J. Kuck, Y. Muraoka, R. S. Northcote et al., TRANQUIL: a language for an array processing computer, AFIPS. ACM, pp.57-73, 1969.

A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools, 2006.

A. Aiken and D. Gay, Barrier inference, POPL, pp.342-354, 1998.

A. W. Appel, SSA is functional programming, SIGPLAN Notices, vol.33, pp.17-20, 1998.

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu, An adaptive performance modeling tool for gpu architectures, pp.105-114, 2010.

L. A. Belady, A study of replacement algorithms for a virtual storage computer, IBM Systems Journal, vol.5, pp.78-101, 1966.

G. Blelloch and S. Chatterjee, Vcode: A data-parallel intermediate language, FMPC. ACM, pp.471-480, 1990.

P. Boudier and G. Sellers, Memory system on Fusion APUs, AMD Fusion Developer Summit. AMD, 2011.

L. Bougé and J. Levaire, Control structures for data-parallel SIMD languages: semantics and implementation, Future Generation Computer Systems, vol.8, pp.363-378, 1992.

W. Bouknight, S. A. Denenberg, D. E. Mcintyre, J. M. Randall, A. H. Sameh et al., The Illiac IV system, Proceedings of the IEEE 60, vol.4, pp.369-388, 1972.

P. Briggs, K. D. Cooper, and L. Torczon, Rematerialization. In PLDI. ACM, pp.311-321, 1992.

K. Brockmann and R. Wanka, Efficient oblivious parallel sorting on the MasPar MP-1. ICSS 1, 0200.

Z. Budimlic, K. D. Cooper, T. J. Harvey, K. Kennedy, T. S. Oberg et al., Fast copy coalescing and live-range identification, PLDI. ACM, pp.25-32, 2002.

. Byunghyun, D. Jang, P. M. Schaa, D. Kaeli, and . Saahpc, Static memory access pattern analysis on a massively parallel GPU

S. Carrillo, J. Siegel, and X. Li, A control-structure splitting optimization for gpgpu, Computing frontiers, pp.147-150, 2009.

D. Cederman and P. Tsigas, GPU-quicksort: A practical quicksort algorithm for graphics processors, Journal of Experimental Algorithmics, vol.14, pp.4-24, 2009.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, pp.44-54, 2009.

J. Choi, R. Cytron, and J. Ferrante, Automatic construction of sparse data flow evaluation graphs, POPL. ACM, pp.55-66, 1991.

S. Collange, D. Defour, and Y. Zhang, Dynamic detection of uniform and affine vectors in GPGPU computations, HPPC, pp.46-55, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00396719

P. Cousot and N. Halbwachs, Automatic discovery of linear restraints among variables of a program, POPL. ACM, pp.84-96, 1978.

B. Coutinho, D. Sampaio, F. M. Pereira, and W. M. Jr, Divergence analysis and optimizations, PACT. IEEE, pp.320-329, 2011.

B. Coutinho, D. Sampaio, F. M. Pereira, and W. M. Jr, Profiling divergences in GPU applications, Concurrency and Computation: Practice and Experience, vol.25, pp.775-789, 2013.

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, Efficiently computing static single assignment form and the control dependence graph, TOPLAS, vol.13, pp.451-490, 1991.

F. Darema, D. A. George, V. A. Norton, and G. F. Pfister, A singleprogram-multiple-data computational model for epex/fortran, Parallel Computing, vol.7, pp.11-24, 1988.

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems, PACT. IEEE, pp.354-364, 2010.

C. A. Farrell and D. H. Kieronska, Formal specification of parallel SIMD execution, Theo. Comp. Science, vol.169, pp.39-65, 1996.

J. Ferrante, K. J. Ottenstein, and J. D. Warren, The program dependence graph and its use in optimization, TOPLAS, vol.9, pp.319-349, 1987.

M. Flynn, Some computer organizations and their effectiveness, IEEE Trans. Comput. C, vol.21, 1972.

W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, Dynamic warp formation and scheduling for efficient GPU control flow, pp.407-420, 2007.

M. Garland and D. B. Kirk, Understanding throughput-oriented architectures, Commun. ACM, vol.53, pp.58-66, 2010.

C. Gou and G. Gaydadjiev, Addressing gpu on-chip shared memory bank conflicts using elastic pipeline, International Journal of Parallel Programming, vol.41, pp.400-429, 2013.

V. Grover, B. Joannes, M. Aarts, and M. Murphy, Variance analysis for translating CUDA code for execution by a general purpose processor, 2009.

S. Hack and G. Goos, Optimal register allocation for SSA-form programs in polynomial time, Information Processing Letters, vol.98, pp.150-155, 2006.

T. D. Han and T. S. Abdelrahman, Reducing branch divergence in gpu programs, GPGPU-4. ACM, vol.3, pp.1-3, 2011.

R. Karrenberg and S. Hack, Whole-function vectorization, CGO. IEEE, pp.141-150, 2011.

R. Karrenberg and S. Hack, Improving performance of opencl on cpus, CC, pp.1-20, 2012.

R. Keryell, P. Materat, and N. Paris, POMP, or how to design a massively parallel machine with small developments, PARLE, pp.83-100, 1991.
URL : https://hal.archives-ouvertes.fr/hal-01166357

S. Kung, K. S. Arun, R. J. Gal-ezer, and D. V. Bhaskar-rao, Wavefront array processor: Language, architecture, and applications, IEEE Trans. Comput, vol.31, pp.1054-1066, 1982.

A. Lashgar and A. Baniasadi, Performance in GPU architectures: Potentials and distances, WDDD. IEEE, pp.75-81, 2011.

D. H. Lawrie, T. Layman, D. Baer, and J. M. Randal, Glypnir-a programming language for Illiac IV, Commun. ACM, vol.18, pp.157-164, 1975.

S. Lee, S. Min, and R. Eigenmann, Openmp to gpgpu: a compiler framework for automatic translation and optimization, PPoPP. ACM, pp.101-110, 2009.

Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart et al., Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators, ISCA. ACM, pp.129-140, 2011.

Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanovic, Convergence and scalarization for data-parallel architectures, CGO. ACM, pp.1-11, 2013.

R. Leissa, S. Hack, and I. Wald, Extending a c-like language for portable simd programming, PPOPP. ACM, pp.65-74, 2012.

J. Meng, D. Tarjan, and K. Skadron, Dynamic warp subdivision for integrated branch and memory divergence tolerance, ISCA. ACM, pp.235-246, 2010.

A. Miné, The octagon abstract domain, Higher Order Symbol. Comput, vol.19, pp.31-100, 2006.

S. Mu, X. Zhang, N. Zhang, J. Lu, Y. S. Deng et al., Ip routing processing with graphic processors, pp.93-98, 2010.

J. Nickolls and W. J. Dally, The gpu computing era, IEEE Micro, vol.30, pp.56-69, 2010.

J. Nickolls and D. Kirk, Computer Organization and Design, (Patterson and Hennessy) 4th Ed, 2009.

F. Nielson, H. R. Nielson, and C. Hankin, Principles of program analysis, 2005.

K. J. Ottenstein, R. A. Ballance, and A. B. Maccabe, The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages, PLDI. ACM, pp.257-271, 1990.

F. M. Pereira, , 2011.

R. H. Perrot, A language for array and vector processors, TOPLAS, vol.1, pp.177-195, 1979.

M. Pharr and W. R. Mark, ISPC: a SPMD compiler for high-performance cpu programming, 2012.

M. Poletto and V. Sarkar, Linear scan register allocation, TOPLAS, vol.21, pp.895-913, 1999.

T. Prabhu, S. Ramalingam, M. Might, and M. Hall, EigenCFA: Accelerating flow analysis with GPUs, 2011.

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk et al., Optimization principles and application performance evaluation of a multithreaded gpu using cuda, pp.73-82, 2008.

B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan et al., Programming model for a heterogeneous x86 platform, PLDI. ACM, pp.431-440, 2009.

M. Samadi, A. Hormati, M. Mehrara, and S. Mahlke, Adaptive inputaware compilation for graphics engines, 2012.

D. Sampaio, R. Martins, S. Collange, and F. M. Pereira, Divergence analysis with affine constraints, SBAC-PAD, pp.137-146, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00650235

D. N. Sampaio, E. Gedeon, F. M. Pereira, and S. Collange, Spill code placement for simd machines, SBLP. SBC, pp.12-26, 2012.

E. F. Sandes, . De, and A. C. Melo, Cudalign: using gpu to accelerate the comparison of megabase genomic sequences, PPoPP. ACM, pp.137-146, 2010.

B. Scholz, C. Zhang, and C. Cifuentes, User-input dependence analysis via graph reachability, 2008.

J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy et al., Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs, pp.111-119, 2010.

J. A. Stratton, C. Rodrigues, I. Sun, N. Obeid, L. Chang et al., The parboil report, 2012.

P. Tu and D. Padua, Efficient building and placing of gating functions, PLDI. ACM, pp.47-55, 1995.

M. Weiser, Program slicing, ICSE. IEEE, pp.439-449, 1981.

Y. Yang, P. Xiang, J. Kong, and H. Zhou, A GPGPU compiler for memory optimization and parallelism management, PLDI. ACM, pp.86-97, 2010.

E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen, Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping, ICS. ACM, pp.115-126, 2010.

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, On-the-fly elimination of dynamic irregularities for GPU computing, ASPLOS. ACM, pp.369-380, 2011.

Y. Zhang and J. D. Owens, A quantitative performance analysis model for GPU architectures, HPCA. ACM, pp.382-393, 2011.