F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar, Tarazu, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.61-74, 2012.
DOI : 10.1145/2189750.2150984

C. Akel, Y. Kashnikov, P. De-oliveira-castro, and W. Jalby, Is Source-Code Isolation Viable for Performance Characterization?, 2013 42nd International Conference on Parallel Processing, pp.977-984, 2013.
DOI : 10.1109/ICPP.2013.116

URL : https://hal.archives-ouvertes.fr/hal-00952290

M. Amini, Source-to-Source Automatic Program Transformations for GPUlike Hardware Accelerators, 2012.
URL : https://hal.archives-ouvertes.fr/pastel-00958033

M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton et al., PIPS is not (just) polyhedral software adding GPU code generation in PIPS, First International Workshop on Polyhedral Compilation Techniques (IMPACT 2011) in conjonction with CGO 2011, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00744312

M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier et al., Par4all: From convex array regions to heterogeneous computing, IMPACT 2012: Second International Workshop on Polyhedral Compilation Techniques HiPEAC 2012, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00744733

M. Arora, S. Nath, S. Mazumdar, S. B. Baden, and D. M. Tullsen, Redefining the role of the CPU in the era of CPU-GPU integration. Micro, IEEE, issue.6, pp.324-340, 2012.

C. Augonnet, S. Thibault, and R. Namyst, Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures, 3rd Workshop on Highly Parallel Processing on a Chip, 2009.
DOI : 10.1007/978-3-642-14122-5_9

URL : https://hal.archives-ouvertes.fr/inria-00421333

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A unified platform for task scheduling on heterogeneous multicore architectures, EuroPar 2009, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

S. Baghdadi, A. Größlinger, and A. Cohen, Putting Automatic Polyhedral Compilation for GPGPU to Work, Proceedings of the 15th Workshop on Compilers for Parallel Computers (CPC'10), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00551517

L. Bagnères and C. Bastoul, Switchable Scheduling for Runtime Adaptation of Optimization, Euro-Par 2014 Parallel Processing, pp.222-233, 2014.
DOI : 10.1007/978-3-319-09873-9_19

J. Muthu-manikandan-baskaran, P. Ramanujam, and . Sadayappan, Automatic C-to-CUDA code generation for affine programs, Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC'10/ETAPS'10, pp.244-263, 2010.

C. Bastoul, Chunky ANalyzer for Dependences in Loops. http://icps. u-strasbg, 2008.

C. Bastoul, Extracting polyhedral representation from high level languages, 2008.

C. Bastoul, Code generation in the polyhedral model is easier than you think, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004., pp.7-16, 2004.
DOI : 10.1109/PACT.2004.1342537

URL : https://hal.archives-ouvertes.fr/hal-00017260

C. Bastoul, Improving data locality in static control programs, Thèse de doctorat, 2004.

C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam, Putting Polyhedral Loop Transformations to Work, LCPC'16 International Workshop on Languages and Compilers for Parallel Computers, pp.209-225, 2003.
DOI : 10.1007/978-3-540-24644-2_14

URL : https://hal.archives-ouvertes.fr/inria-00071681

N. Bell and J. Hoberock, THRUST: a productivity-oriented library for CUDA, GPU Computing Gems, vol.7, 2011.
DOI : 10.1016/B978-0-12-811986-0.00033-9

E. Mehmet, L. N. Belviranli, R. Bhuyan, and . Gupta, A dynamic selfscheduling scheme for heterogeneous multiprocessor architectures, ACM Trans. Archit. Code Optim, vol.957, issue.4, pp.1-5720, 2013.

F. Bodin, T. Kisuki, P. Knijnenburg, M. O. Boyle, and E. Rohou, Iterative compilation in a non-linear optimisation space, Workshop on Profile and Feedback-Directed Compilation, 1998.
URL : https://hal.archives-ouvertes.fr/inria-00475919

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer, PLDI '08, pp.101-113, 2008.

U. Kumar and R. Bondhugula, Effective automatic parallelization and locality optimization using the polyhedral model, 2008.

M. Boyer, K. Skadron, S. Che, and N. Jayasena, Load balancing in a changing world, Proceedings of the ACM International Conference on Computing Frontiers, CF '13, pp.1-21, 2013.
DOI : 10.1145/2482767.2482794

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian et al., Brook for GPUs, ACM Transactions on Graphics, vol.23, issue.3, pp.777-786, 2004.
DOI : 10.1145/1015706.1015800

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
DOI : 10.1109/IISWC.2009.5306797

L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, Dynamic load balancing on single- and multi-GPU systems, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp.1-12, 2010.
DOI : 10.1109/IPDPS.2010.5470413

M. Cintra, R. Diego, and . Llanos, Toward efficient and robust software speculative parallelization on multiprocessors, ACM SIGPLAN Notices, pp.13-24, 2003.

A. Cohen, S. Girbal, and O. Temam, A Polyhedral Approach to Ease the Composition of Program Transformations, Euro-Par 2004 Parallel Processing, pp.292-303, 2004.
DOI : 10.1007/978-3-540-27866-5_38

URL : https://hal.archives-ouvertes.fr/hal-01257301

M. Cole, Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming, Parallel Computing, vol.30, issue.3, pp.389-406, 2004.
DOI : 10.1016/j.parco.2003.12.002

D. Couroussé, V. Lomüller, and H. Charles, Introduction to Dynamic Code Generation: An Experiment with Matrix Multiplication for the STHORM Platform, pp.103-122, 2014.
DOI : 10.1007/978-1-4614-8800-2_6

H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng, Automatic Library Generation for BLAS3 on GPUs, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.255-265, 2011.
DOI : 10.1109/IPDPS.2011.33

L. Dagum and R. Menon, OpenMP: an industry standard API for shared-memory programming, IEEE Computational Science and Engineering, vol.5, issue.1, pp.46-55, 1998.
DOI : 10.1109/99.660313

H. Deleau, C. Jaillet, and M. Krajecki, GPU4SAT: solving the SAT problem on GPU, PARA 2008 9th International Workshop on State?of? the?Art in Scientific and Parallel Computing, 2008.

P. Di and J. Xue, Model-Driven Tile Size Selection for DOACROSS Loops on GPUs, Euro-Par 2011 Parallel Processing, pp.401-412, 2011.
DOI : 10.1007/978-3-642-23397-5_40

G. Diamos, A. Kerr, and M. Kesavan, Translating GPU binaries to tiered SIMD architectures with ocelot, 2009.

G. Frederick-diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems, Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pp.353-364, 2010.

R. Dolbeau, S. Bihan, and F. Bodin, HMPP: A hybrid multi-core parallel programming environment, Workshop on General Purpose Processing on Graphics Processing Units, 2007.

A. Duran, E. Ayguadé, M. Rosa, J. Badia, L. Labarta et al., OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, pp.173-193, 2011.

J. Fang, A. L. Varbanescu, and H. Sips, A Comprehensive Performance Comparison of CUDA and OpenCL, 2011 International Conference on Parallel Processing, pp.216-225, 2011.
DOI : 10.1109/ICPP.2011.45

P. Feautrier, Array expansion, Proceedings of the 2nd International Conference on Supercomputing, ICS '88, pp.429-441, 1988.
URL : https://hal.archives-ouvertes.fr/hal-01099746

P. Feautrier, Dataflow analysis of array and scalar references, International Journal of Parallel Programming, vol.24, issue.4, 1991.
DOI : 10.1007/BF01407931

P. Feautrier, Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time, International Journal of Parallel Programming, vol.2, issue.4, 1992.
DOI : 10.1007/BF01379404

P. Feautrier, Toward automatic partitioning of arrays on distributed memory computers, Proceedings of the 7th international conference on Supercomputing , ICS '93, pp.175-184, 1993.
DOI : 10.1145/165939.165968

G. Fursin, A. Cohen, M. O. Boyle, and O. Temam, A Practical Method for Quickly Evaluating Program Optimizations, Proceedings of the International Conference on High Performance Embedded Architectures & Compilers, pp.29-46, 2005.
DOI : 10.1007/11587514_4

URL : https://hal.archives-ouvertes.fr/inria-00001054

G. Fursin, R. Miceli, A. Lokhmotov, M. Gerndt, M. Baboulin et al., Collective Mind: Towards Practical and Collaborative Auto-Tuning, Scientific Programming, pp.309-329, 2014.
DOI : 10.1155/2014/797348

URL : https://hal.archives-ouvertes.fr/hal-01054763

M. Garland, S. Le-grand, J. Nickolls, J. Anderson, J. Hardwick et al., Parallel Computing Experiences with CUDA, IEEE Micro, vol.28, issue.4, pp.13-27, 2008.
DOI : 10.1109/MM.2008.57

P. Gepner and M. F. Kowalik, Multi-Core Processors: New Way to Achieve High System Performance, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06), pp.9-13, 2006.
DOI : 10.1109/PARELEC.2006.54

S. Ghosh, T. Liao, H. Calandra, M. Barbara, and . Chapman, Experiences with OpenMP, PGI, HMPP and OpenACC Directives on ISO/TTI Kernels, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.691-700, 2012.
DOI : 10.1109/SC.Companion.2012.95

K. Naga, B. Govindaraju, Y. Lloyd, B. Dotsenko, J. Smith et al., High performance discrete fourier transforms on graphics processors, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008.

D. Grewe, F. Michael, and . Boyle, A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL, Compiler Construction, pp.286-305, 2011.
DOI : 10.1007/978-3-540-92990-1_4

T. Grosser, A. Größlinger, and C. Lengauer, Polly?performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, p.2012

D. Hefenbrock, J. Oberg, N. Thanh, R. Kastner, B. Scott et al., Accelerating Viola-Jones Face Detection to FPGA-Level Using GPUs, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, pp.11-18, 2010.
DOI : 10.1109/FCCM.2010.12

L. John, D. A. Hennessy, and . Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, 2011.

S. Hong and H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, ACM SIGARCH Computer Architecture News, vol.37, issue.3, pp.152-163, 2009.
DOI : 10.1145/1555815.1555775

D. Horn, Chapter 36 Stream reduction operations for GPGPU applications, 2005.

F. Irigoin, P. Jouvelot, and R. Triolet, Semantical interprocedural parallelization: An overview of the PIPS project, Proceedings of the 5th international conference on Supercomputing, pp.244-251, 1991.
URL : https://hal.archives-ouvertes.fr/hal-00984684

I. Lane, J. Kim, and J. Chong, HYDRA: a hybrid CPU/GPU speech recognition engine for real-time LVCSR, GPU Technology Conference, 2013.

A. Jimborean, Adapting the polytope model for dynamic and speculative parallelization
URL : https://hal.archives-ouvertes.fr/tel-00733850

A. Jimborean, P. Clauss, J. Dollinger, V. Loechner, and J. Caamaño, Dynamic and Speculative Polyhedral Parallelization Using Compiler-Generated Skeletons, International Journal of Parallel Programming, vol.30, issue.3, pp.529-545, 2014.
DOI : 10.1007/s10766-013-0259-4

URL : https://hal.archives-ouvertes.fr/hal-00825738

A. Jimborean, P. Clauss, B. Pradelle, L. Mastrangelo, and V. Loechner, Adapting the polyhedral model as a framework for efficient speculative parallelization, PPoPP '12, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00664353

A. Jimborean, L. Mastrangelo, V. Loechner, and P. Clauss, VMAD: An Advanced Dynamic Program Analysis and Instrumentation Framework
DOI : 10.1007/978-3-642-28652-0_12

T. A. Johnson, R. Eigenmann, and T. N. Vijaykumar, Speculative thread decomposition through empirical optimization, Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '07, 2007.
DOI : 10.1145/1229428.1229474

K. Karimi, G. Neil, F. Dickson, and . Hamze, A performance comparison of CUDA and OpenCL. arXiv preprint, 2010.

T. Karras and T. Aila, Fast parallel construction of high-quality bounding volume hierarchies, Proceedings of the 5th High-Performance Graphics Conference on, HPG '13, pp.89-99, 2013.
DOI : 10.1145/2492045.2492055

M. Ahmad-khan, H. Charles, and D. Barthou, Improving performance of optimized kernels through fast instantiations of templates, Concurr. Comput. : Pract. Exper, vol.21, issue.1, 2009.

M. Kicherer, F. Nowak, R. Buchty, and W. Karl, Seamlessly portable applications, ACM Transactions on Architecture and Code Optimization, vol.8, issue.4, pp.1-4220, 2012.
DOI : 10.1145/2086696.2086721

H. Kim, N. P. Johnson, J. W. Lee, S. A. Mahlke, and D. I. August, Automatic speculative DOALL for clusters, Proceedings of the Tenth International Symposium on Code Generation and Optimization, CHO '12, 2012.
DOI : 10.1145/2259016.2259029

J. Kim, S. Seo-lee, J. Nah, G. Jo, and J. Lee, SnuCL, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.341-352, 2012.
DOI : 10.1145/2304576.2304623

T. Komoda, S. Miwa, H. Nakamura, and N. Maruyama, Integrating Multi-GPU Execution in an OpenACC Compiler, 2013 42nd International Conference on Parallel Processing, 2013.
DOI : 10.1109/ICPP.2013.35

C. Lauterbach, Q. Mo, and D. Manocha, gProximity: Hierarchical GPU-based Operations for Collision and Distance Queries, Computer Graphics Forum, vol.2, issue.4, pp.419-428, 2010.
DOI : 10.1111/j.1467-8659.2009.01611.x

J. Lee, M. Samadi, Y. Park, and S. Mahlke, Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pp.245-256, 2013.

S. Lee, . Troya, R. Johnson, and . Eigenmann, Cetus ??? An Extensible Compiler Infrastructure for Source-to-Source Transformation, Languages and Compilers for Parallel Computing, pp.539-553, 2004.
DOI : 10.1007/978-3-540-24644-2_35

W. Victor, C. Lee, J. Kim, M. Chhugani, D. Deisher et al., Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, In ACM SIGARCH Computer Architecture News, vol.38, pp.451-460, 2010.

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford et al., A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pp.51-61, 2010.
DOI : 10.1145/1735688.1735698

URL : https://hal.archives-ouvertes.fr/inria-00551084

J. Levon and P. Elie, Oprofile: A system profiler for linux, 2004.

C. Li, F. Gava, and G. Hains, Implementation of Data-Parallel Skeletons: A Case Study Using a Coarse-Grained Hierarchical Model, 2012 11th International Symposium on Parallel and Distributed Computing, pp.26-33, 2012.
DOI : 10.1109/ISPDC.2012.12

Y. Li, J. Dongarra, and S. Tomov, A Note on Auto-tuning GEMM for GPUs, In Computational Science?ICCS, pp.884-892, 2009.
DOI : 10.1007/978-3-642-01970-8_89

W. Liu, J. Tuck, L. Ceze, W. Ahn, K. Strauss et al., POSH, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '06, pp.158-167, 2006.
DOI : 10.1145/1122971.1122997

W. Liu, J. Tuck, L. Ceze, K. Strauss, J. Renau et al., POSH, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '06
DOI : 10.1145/1122971.1122997

Y. Liu, E. Z. Zhang, and X. Shen, A cross-input adaptive framework for GPU program optimizations, Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, IPDPS '09, pp.1-10, 2009.

V. Loechner, PolyLib: A library for manipulating parameterized polyhedra, 1999.

V. Lomüller and H. Charles, Speculative runtime parallelization of loop nests: Towards greater scope and efficiency, 17th Workshop on Compilers for Parallel Computing, 2013.

C. Luk, S. Hong, and H. Kim, Qilin, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.45-55, 2009.
DOI : 10.1145/1669112.1669121

J. Meng, A. Vitali, K. Morozov, V. Kumaran, T. D. Vishwanath et al., GROPHECY, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-14, 2011.
DOI : 10.1145/2063384.2063402

Q. Meyer, F. Schonfeld, M. Stamminger, and R. Wanka, 3-SAT on CUDA: Towards a massively parallel SAT solver, 2010 International Conference on High Performance Computing & Simulation, pp.306-313, 2010.
DOI : 10.1109/HPCS.2010.5547116

C. Moore, Data processing in Exascale-class computing systems, The Salishan Conference on High Speed Computing, 2011.

N. Nethercote and J. Seward, Valgrind: A framework for heavyweight dynamic binary instrumentation, Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pp.89-100, 2007.

F. Noël, L. Hornof, C. Consel, and J. L. Lawall, Automatic, template-based run-time specialization: implementation and experimental study, Proceedings of the 1998 International Conference on Computer Languages (Cat. No.98CB36225), 1998.
DOI : 10.1109/ICCL.1998.674164

C. Nugteren and H. Corporaal, The boat hull model, ACM SIGPLAN Notices, vol.47, issue.8, pp.291-292, 2012.
DOI : 10.1145/2370036.2145859

C. Nugteren and H. Corporaal, Introducing 'Bones', Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pp.1-10, 2012.
DOI : 10.1145/2159430.2159431

. Nvidia, CUDA: performance of applications

. Nvidia and . Cuda, Compute Unified Device Architecture, 2007.

R. Westermann and P. Kipfer, Chapter 46 Improved GPU sorting, 2005.

A. David, M. J. Padua, and . Wolfe, Advanced compiler optimizations for supercomputers, Commun. ACM, vol.29, issue.12, pp.1184-1201, 1986.

K. Palem and A. Lingamneni, What to do about the end of Moore's law, probably!, Proceedings of the 49th Annual Design Automation Conference on, DAC '12, pp.924-929
DOI : 10.1145/2228360.2228525

R. A. Patel, Y. Zhang, J. Mak, A. Davidson, and J. D. Owens, Parallel lossless data compression on the GPU, 2012 Innovative Parallel Computing (InPar), pp.1-9, 2012.
DOI : 10.1109/InPar.2012.6339599

M. Peres, Reverse engineering power management on Nvidia GPUs-anatomy of an autonomic-ready system, ECRTS, Operating Systems Platforms for Embedded Real-Time applications 2013, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00853849

J. Planas, R. M. Badia, E. Ayguade, and J. Labarta, Self-Adaptive OmpSs Tasks in Heterogeneous Environments, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.138-149, 2013.
DOI : 10.1109/IPDPS.2013.53

S. Pop, A. Cohen, C. Bastoul, S. Girbal, G. Silber et al., GRAPHITE: Loop optimizations based on the polyhedral model for GCC, 2006.
URL : https://hal.archives-ouvertes.fr/hal-01257284

L. Pouchet, FM: the Fourier-Motzkin library

L. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos, Iterative optimization in the polyhedral model, ACM SIGPLAN Notices, vol.43, issue.6, pp.90-100, 2008.
DOI : 10.1145/1379022.1375594

URL : https://hal.archives-ouvertes.fr/inria-00419974

L. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, . Ramanujam et al., Loop transformations, ACM SIGPLAN Notices, vol.46, issue.1, pp.549-562, 2011.
DOI : 10.1145/1925844.1926449

URL : https://hal.archives-ouvertes.fr/hal-01257283

K. Manohar, K. Prabhu, and . Olukotun, Using thread-level speculation to simplify manual parallelization, PPoPP '03, 2003.

P. Benoit-pradelle, V. Clauss, and . Loechner, Adaptive runtime selection of parallel schedules in the polytope model, 19th High Performance Computing Symposium -HPC 2011, 2011.

A. Benoît-pradelle, P. Ketterlin, and . Clauss, Polyhedral parallelization of binary code, ACM Transactions on Architecture and Code Optimization, vol.8, issue.4, pp.39-2012

D. Quinlan and . Rose, Compiler support for object-oriented frameworks. Parallel Processing Letters, pp.215-226, 2000.

E. Raman, R. Rangan, I. David, and . August, Spice, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization , CGO '08, pp.175-184, 2008.
DOI : 10.1145/1356058.1356082

L. Rauchwerger, A. David, and . Padua, The LRPD test: Speculative runtime parallelization of loops with privatization and reduction parallelization. Parallel and Distributed Systems, IEEE Transactions on, vol.10, issue.2, pp.160-180, 1999.

G. Ruetsch and P. Micikevicius, Optimizing matrix transpose in CUDA, 2010.

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk et al., Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming , PPoPP '08, pp.73-82, 2008.
DOI : 10.1145/1345206.1345220

S. Ryoo, I. Christopher, . Rodrigues, S. Sam, . Stone et al., Program optimization space pruning for a multithreaded gpu, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization , CGO '08, pp.195-204, 2008.
DOI : 10.1145/1356058.1356084

M. Samadi, A. Hormati, J. Lee, and S. Mahlke, Paragon, Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pp.64-73, 2012.
DOI : 10.1145/2159430.2159438

A. Schrijver, Theory of linear and integer programming, 1986.

E. Schweitz, R. Lethin, A. Leung, and B. Meister, R-stream: A parametric high level compiler, HPEC, 2006.

K. Shirahata, H. Sato, and S. Matsuoka, Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters, 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp.733-740, 2010.
DOI : 10.1109/CloudCom.2010.55

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, A performance analysis framework for identifying potential benefits in GPGPU applications, Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pp.11-22, 2012.

F. Smith, D. Grossman, G. Morrisett, L. Hornof, and T. Jim, Compiling for template-based run-time code generation, Journal of Functional Programming, vol.13, issue.3, 2003.
DOI : 10.1017/S095679680200463X

A. Sukumaran-rajam, L. E. Campostrini, J. M. , M. Caamano, and P. Clauss, Speculative runtime parallelization of loop nests: Towards greater scope and efficiency, HIPS + LSPP, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01155172

W. Sun and R. Ricci, Augmenting operating systems with the GPU. arXiv preprint, 2013.

H. Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal

S. Tavarageri, . Pouchet, A. Ramanujam, P. Rountev, and . Sadayappan, Dynamic selection of tile sizes, 2011 18th International Conference on High Performance Computing, pp.1-10, 2011.
DOI : 10.1109/HiPC.2011.6152742

C. J. Thompson, S. Hahn, and M. Oskin, Using modern graphics architectures for general-purpose computing: a framework and analysis, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings., pp.306-317, 2002.
DOI : 10.1109/MICRO.2002.1176259

C. Tian, M. Feng, and R. Gupta, Speculative parallelization using state separation and multiple value prediction, Proceedings of the 2010 international symposium on Memory management, ISMM '10, 2010.
DOI : 10.1145/1806651.1806663

K. Hung, T. , and W. Luk, Axel: A heterogeneous cluster with fpgas and gpus, Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '10, pp.115-124, 2010.

S. Venkatasubramanian and R. W. Vuduc, Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems, Proceedings of the 23rd international conference on Conference on Supercomputing, ICS '09, pp.244-255, 2009.
DOI : 10.1145/1542275.1542312

S. Verdoolaege, isl: An Integer Set Library for the Polyhedral Model, Komei Fukuda, pp.299-302, 2010.
DOI : 10.1007/978-3-642-15582-6_49

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado et al., Polyhedral parallel code generation for CUDA, ACM Transactions on Architecture and Code Optimization, vol.9, issue.4, pp.1-5423, 2013.
DOI : 10.1145/2400682.2400713

URL : https://hal.archives-ouvertes.fr/hal-00786677

S. Verdoolaege and T. Grosser, Polyhedral extraction tool, International Workshop on Polyhedral Compilation Techniques (IMPACT'12), 2012.

S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe, Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions, Algorithmica, vol.48, issue.1, pp.37-66, 2007.
DOI : 10.1007/s00453-006-1231-0

V. Volkov, Better performance at lower occupancy, Technology Conference, 2010.

V. Volkov, J. Demmel, and . Lu, QR and cholesky factorizations using vector capabilities of GPUs, EECS Department, pp.2008-2057, 2008.

V. Volkov, W. James, and . Demmel, Benchmarking GPUs to tune dense linear algebra, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2008.
DOI : 10.1109/SC.2008.5214359

S. Wienke, M. Dieter-an-mey, and . Müller, Accelerators for Technical Computing: Is It Worth the Pain? A TCO Perspective, 2013.
DOI : 10.1007/978-3-642-38750-0_25

S. Wienke, P. Springer, C. Terboven, and D. Mey, OpenACC ??? First Experiences with Real-World Applications, Proceedings of the 18th International Conference on Parallel Processing, Euro-Par'12, pp.859-870, 2012.
DOI : 10.1007/978-3-642-32820-6_85

N. Wirth, A plea for lean software, Computer, vol.28, issue.2, pp.64-68, 1995.
DOI : 10.1109/2.348001