. .. Architectures, Automatic Tasks Adaptation for Heterogeneous

M. Guest, The scientific case for hpc in europe, vol.9, p.10, 2012.

J. D. Owens, A survey of general-purpose computation on graphics hardware, Eurographics 2005, State of the Art Reports, p.9, 2005.

J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio et al., The international exascale software project roadmap, Int. J. High Perform. Comput. Appl, vol.25, issue.1, p.12, 2011.

, Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, p.13, 2012.

, PGAS : Partitioned Global Address Space, p.14

G. Almási, P. Hargrove, I. Gabriel, and T. Zheng, Upc collectives library 2.0, Fifth Conference on Partitioned Global Address Space Programming Models, p.14, 2011.

W. Robert, J. Numrich, and . Reid, Co-array fortran for parallel programming, SIGPLAN Fortran Forum, vol.17, issue.2, p.14, 1998.

K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit et al., Titanium: A high-performance java dialect, ACM, p.14, 1998.

B. L. Chamberlain, D. Callahan, and H. P. Zima, Parallel programmability and the chapel language, The International Journal of High Performance Computing Applications, vol.21, issue.3, p.14, 2007.

, OpenAcc Directives for Accelerators, p.14

, OpenCL -the open standard for parallel programming of heterogeneous systems, p.15

J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with cuda, Queue, vol.6, issue.2, p.16, 2008.

J. L. Lions, Ariane 5, flight 501, report of the inquiry board, p.16, 1996.

, Ieee standard glossary of software engineering terminology, IEEE Std, vol.610, p.16, 1990.

H. Sara-abbaspour-asadollah, D. Hansson, S. Sundmark, and . Eldh, Towards classification of concurrency bugs based on observable properties, Proceedings of the First International Workshop on Complex faUlts and Failures in LargE Software Systems, COUFLESS '15, p.17, 2015.

G. Gopalakrishnan, P. D. Hovland, C. Iancu, S. Krishnamoorthy, I. Laguna et al., , p.18, 2017.

. Ddt-debugger, , p.19

G. J. Holzmann, The model checker spin, IEEE Transactions on Software Engineering, vol.23, p.19, 1997.

X. Qian, K. Sen, P. Hargrove, and C. Iancu, Sreplay: Deterministic subgroup replay for one-sided communication, p.19, 2016.

Q. Gao, F. Qin, and D. K. Panda, Dmtracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements, SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, p.19, 2007.

C. Lattner and V. Adve, LLVM: A compilation framework for lifelong program analysis and transformation, CGO, vol.19, p.125, 2004.

P. Huchant, M. Counilh, and D. Barthou, Automatic opencl task adaptation for heterogeneous architectures, Proceedings of the 22Nd International Conference on Euro-Par 2016: Parallel Processing, vol.9833, p.131, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01419366

P. Huchant, D. Barthou, and M. Counilh, Adaptive Partitioning for Iterated Sequences of Irregular OpenCL Kernels, SBAC-PAD, vol.21, p.131, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01888216

P. Huchant, E. Saillard, D. Barthou, H. Brunie, and P. Carribault, Parcoach extension for a full-interprocedural collectives verification, 2018 IEEE/ACM 2nd International Workshop on Software Correctness for HPC Applications (Correctness), vol.21, p.134, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01937316

D. Barthou, P. Huchant, E. Saillard, and P. Carribault, Multi-valued expression analysis for collective checking, Proceedings of the 25th International Conference on Euro-Par 2019: Parallel Processing, vol.21, p.134, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02390025

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, vol.26, p.133, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00384363

J. Lee, M. T. Tran, T. Odajima, T. Boku, M. Sato-;-alexander et al., An extension of xcalablemp pgas lanaguage for multi-node gpu clusters, Euro-Par 2011: Parallel Processing Workshops, p.25, 2012.

S. Henry, A. Denis, D. Barthou, M. Counilh, and R. Namyst, Toward OpenCL Automatic Multi-Device Support, Proceedings of the 20th International Euro-Par Conference on Parallel Processing, vol.8632, p.25, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01005765

E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proceedings of the 15th International Euro-Par Conference on Parallel Processing, p.25, 2009.

T. Gautier, J. Lima, N. Maillard, and B. Raffin, XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures, 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), p.25, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00799904

M. Ashwin, A. J. Aji, P. Peña, . Balaji, and . Wu-chun-feng, MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL, Parallel Computing, vol.58, p.25, 2016.

C. Luk, S. Hong, and H. Kim, Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping, Symp. on Microarchitecture, MICRO 42, vol.26, p.89, 2009.

P. Li, E. Brunet, F. Trahay, C. Parrot, G. Thomas et al., Automatic OpenCL Code Generation for Multi-device Heterogeneous Architectures, International Conference on Parallel Processing, vol.26, p.90, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01275482

Z. Wang, D. Grewe, and M. F. O'boyle, Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems, ACM Trans. Archit. Code Optim, vol.11, issue.4, p.26, 2014.

S. Lee and R. Eigenmann, OpenMPC: Extended OpenMP Programming and Tuning for GPUs, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, p.26, 2010.

J. Kim, H. Kim, J. Joo-hwan-lee, and . Lee, Achieving a single compute device image in OpenCL for multiple GPUs, Principles and practice of Parallel Prog., PPoPP '11, vol.26, p.71, 2011.

J. Lee, M. Samadi, Y. Park, and S. Mahlke, SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration, ACM Trans. Comput. Syst, vol.33, issue.3, p.89, 2015.

J. Lee, M. Samadi, and S. Mahlke, Orchestrating Multiple Data-Parallel Kernels on Multiple Devices, Parallel Arch. and Compilation Techniques, vol.26, p.133, 2015.

B. Pérez, J. Luis-bosque, and R. Beivide, Simplifying Programming and Load Balancing of Data Parallel Applications on Heterogeneous Systems, Proceedings of the 9th Annual Workshop on GPGPU, GPGPU '16, vol.26, p.90, 2016.

F. Zhang, B. Wu, J. Zhai, B. He, and W. Chen, FinePar: Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures, Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO, vol.26, p.91, 2017.

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, An efficient method of computing static single assignment form, Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '89, vol.52, p.101, 1989.

C. Oat, J. Barczak, and J. Shopf, Efficient spatial binning on the gpu, vol.38, p.59, 2008.

S. Seo, G. Jo, and J. Lee, Performance Characterization of the NAS Parallel Benchmarks in OpenCL, Workload Characterization, vol.38, p.78, 2011.

N. Nakasato, G. Ogiya, Y. Miki, M. Mori, and K. Nomoto, Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems. CoRR, abs/1206.1199, vol.40, p.78, 2012.

B. Creusillet and F. Irigoin, Interprocedural array region analyses, International Journal of Parallel Programming, vol.24, issue.6, p.50, 1996.
URL : https://hal.archives-ouvertes.fr/hal-00752611

D. Frenkel, Understanding molecular simulation : from algorithms to applications, p.58, 2002.

B. Pérez, E. Stafford, J. Bosque, . Luis, and R. Beivide, Energy efficiency of load balancing for data-parallel applications in heterogeneous systems, The Journal of Supercomputing, vol.73, issue.1, p.70, 2017.

R. Nozal, B. Pérez, and J. Bosque, Towards co-execution of massive dataparallel OpenCL kernels on CPU and Intel Xeon Phi, Proceedings of the 17th International Conference on Computational and Mathematical Methods in Science and Engineering, p.70, 2017.

K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer, An Automatic Inputsensitive Approach for Heterogeneous Task Partitioning, Intl Conf. on Supercomputing, p.89, 2013.

R. Sakai, F. Ino, and K. Hagihara, Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System, Fourth International Symposium on Computing and Networking (CANDAR), p.71, 2016.

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo et al., SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters, ACM Intl Conf. on Supercomputing, ICS '12, p.71, 2012.

P. Pandit and R. Govindarajan, Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices, Code Generation and Optimization, vol.72, p.90, 2014.

J. Lee, M. Samadi, and S. Mahlke, VAST: The illusion of a large memory space for GPUs, 23rd Intl Conf. on Parallel Architecture and Compilation Techniques (PACT), p.72, 2014.

. Charles-g-broyden, A class of methods for solving nonlinear simultaneous equations. Mathematics of computation, vol.19, p.77, 1965.

, AMD. Ati stream software development ket (sdk) v2.1, p.78, 2010.

S. Grauer-gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, Auto-tuning a highlevel language targeted to gpu codes, 2012 Innovative Parallel Computing (InPar), p.78, 2012.

A. Magni, C. Dubach, and M. Boyle, Automatic Optimization of Thread-coarsening for Graphics Processors, Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, p.88, 2014.

D. Grewe, F. P. Michael, and . O'boyle, A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL, Intl Conf. on Compiler Construction, p.88, 2011.

S. Seo, J. Lee, G. Jo, and J. Lee, Automatic OpenCL work-group size selection for multicore CPUs, Parallel Arch. and Compilation Techniques, p.89, 2013.

C. Lin, C. Hsieh, H. Chang, and P. Hsiung, Efficient Workload Balancing on Heterogeneous GPUs using MixedInteger Non-Linear Programming, Journal of Applied Research and Technology, vol.12, issue.6, p.89, 2014.

Z. Zhong, V. Rychkov, and A. Lastovetsky, Data Partitioning on Multicore and Multi-GPU Platforms Using Functional Performance Models, IEEE Transactions on Computers, vol.64, issue.9, p.89, 2015.

M. Boyer, K. Skadron, C. Shuai, and N. Jayasena, Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability, Computing Frontiers Conf., 2013, p.90

A. Navarro, F. Corbera, A. Rodriguez, A. Vilches, and R. Asenjo, Heterogeneous parallel_for Template for CPU-GPU Chips, International Journal of Parallel Programming, p.90, 2018.

B. Perez, E. Stafford, J. Bosque, R. Beivide, S. Mateo et al., Extending OmpSs for OpenCL Kernel Co-Execution in Heterogeneous Systems, Proceedings of the 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), p.90, 2017.

J. A. Martínez, E. M. Garzón, A. Plaza, and I. García, Automatic tuning of iterative computation on heterogeneous multiprocessors with adithe, J. Supercomput, vol.58, issue.2, p.90, 2011.

E. M. Garzón, J. J. Moreno, and J. A. Martínez, An approach to optimise the energy efficiency of iterative computation on integrated gpu-cpu systems, The Journal of Supercomputing, vol.73, p.90, 2016.

J. Shen, A. L. Varbanescu, H. Sips, M. Arntzen, and D. Simons, Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms, Computing Frontiers Conf, p.91, 2013.

J. Shen, A. Varbanescu, P. Zou, Y. Lu, and H. Sips, Improving performance by matching imbalanced workloads with heterogeneous platforms, p.91, 2014.

J. Shen, A. L. Varbanescu, Y. Lu, P. Zou, and H. Sips, Workload partitioning for accelerating applications on heterogeneous platforms, IEEE Transactions on Parallel and Distributed Systems, vol.27, issue.9, p.91, 2016.

Y. Cho, F. Negele, S. Park, B. Egger, and T. R. Gross, On-the-fly workload partitioning for integrated cpu/gpu architectures, Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT '18, vol.21, p.91, 2018.

Y. Zhang and E. Duesterwald, Barrier matching for programs with textually unaligned barriers, Proc. ACM SIGPLAN Symp. on Principles and Practice of Parallel Prog., PPoPP, vol.100, p.125, 2007.

E. Saillard, P. Carribault, and D. Barthou, Parcoach: Combining static and dynamic validation of mpi collective communications, The International Journal of High Performance Computing Applications, vol.28, issue.4, p.134, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01078762

E. Saillard, P. Carribault, and D. Barthou, Static Validation of Barriers and Worksharing Constructs in OpenMP Applications, International Workshop on OpenMP, vol.100, p.134, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01078759

E. Saillard, H. Brunie, P. Carribault, and D. Barthou, PARCOACH Extension for Hybrid Applications with Interprocedural Analysis, 9th International Workshop on Parallel Tools for High Performance Computing, vol.100, p.134, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01420655

J. Jaeger, E. Saillard, P. Carribault, and D. Barthou, Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH, EuroMPI, vol.100, p.134, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01252321

Y. Lin, Static nonconcurrency analysis of openmp programs, OpenMP Shared Memory Parallel Programming, p.101, 2008.

V. Springel, The cosmological simulation code gadget-2, Monthly Notices of the Royal Astronomical Society, vol.364, p.126, 2005.

A. J. Ropelewski, H. B. Nicholas, J. , and R. R. Mendez, MPI-PHYLIP: Parallelizing Computationally Intensive Phylogenetic Analysis Routines for the Analysis of Large Protein Families, PLOS ONE, vol.5, issue.11, p.126, 2010.

C. Amg, , vol.111, p.126, 2013.

, High-Performance Linpack benchmark, vol.111, p.126, 2016.

A. Michael, D. W. Heroux, . Doerfler, S. Paul, J. M. Crozier et al., Improving Performance via Mini-applications, vol.111, p.126, 2009.

. Nersc-ior, , vol.111, p.126, 2016.

A. Aiken and D. Gay, Barrier inference, Proc. ACM SIGPLAN-SIGACT Symp. on Principles of Prog. Lang., POPL, p.117, 1998.

B. Scholz, C. Zhang, and C. Cifuentes, User-input dependence analysis via graph reachability, Eighth IEEE International Working Conference on Source Code Analysis and Manipulation, p.120, 2008.

Y. Sui and J. Xue, SVF: Interprocedural Static Value-flow Analysis in LLVM, Proc. Int. Conf. on Comp. Construction, CC, vol.120, p.155, 2016.

B. Hardekopf and C. Lin, Flow-sensitive Pointer Analysis for Millions of Lines of Code, Proc. Symp. on Code Generation and Optimization, CGO, p.120, 2011.

Y. Sui, D. Ye, and J. Xue, Detecting Memory Leaks Statically with Full-Sparse Value-Flow Analysis, IEEE Trans. Softw. Eng, vol.40, issue.2, p.120, 2014.

S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel et al., FlowDroid: Precise Context, Flow, Field, Object-sensitive and Lifecycle-aware Taint Analysis for Android Apps, SIGPLAN Not, vol.49, issue.6, p.124, 2014.

O. Tripp, M. Pistoia, S. J. Fink, M. Sridharan, and O. Weisman, TAJ: Effective Taint Analysis of Web Applications, Proc. ACM SIGPLAN Conf. on Prog

. Lang, Design and Implementation, PLDI, p.124, 2009.

U. Shankar, K. Talwar, J. S. Foster, and D. Wagner, Detecting Format String Vulnerabilities with Type Qualifiers, Proceedings of the 10th Conference on USENIX Security Symposium, vol.10, p.124, 2001.

D. E. Denning and P. J. Denning, Certification of Programs for Secure Information Flow, Commun. ACM, vol.20, issue.7, p.124, 1977.

N. Heintze and J. G. Riecke, The SLam Calculus: Programming with Secrecy and Integrity, Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL'98, p.124, 1998.

D. Heng-yin, M. Song, C. Egele, E. Kruegel, and . Kirda, Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis, Proc. ACM Conf. on Comp. and Communications Security, CCS, p.124, 2007.

A. Sabelfeld and A. C. Myers, Language-based Information-flow Security, IEEE J.Sel. A. Commun, vol.21, issue.1, p.124, 2006.

I. Laguna and M. Schulz, Pinpointing Scale-dependent Integer Overflow Bugs in Large-scale Parallel Applications, Proc. Conf. for High Perf. Comp., Networking, Storage and Analysis, vol.19, p.124, 2016.

D. Ye, Y. Sui, and J. Xue, Accelerating Dynamic Detection of Uses of Undefined Values with Static Value-Flow Analysis, Proc. IEEE/ACM Symp. on Code Generation and Optimization, CGO, vol.154, p.124, 2014.

P. Feautrier, Dataflow Analysis of Array and Scalar References, International Journal of Parallel Programming, vol.20, issue.1, p.124, 1991.

A. Slowinska and H. Bos, Pointless Tainting?: Evaluating the Practicality of Pointer Tainting, Proc. ACM European Conf. on Comp. Systems, EuroSys'09, p.124, 2009.

T. Bao, Y. Zheng, Z. Lin, X. Zhang, and D. Xu, Strict Control Dependence and Its Effect on Dynamic Information Flow Analyses, Proc. Symp. on Software Testing and Analysis, ISSTA'10, p.124, 2010.

C. Cifuentes and B. Scholz, Parfait: Designing a Scalable Bug Checker, Proc. Workshop on Static Analysis, SAW'08, vol.124, p.125, 2008.

F. Stephen, T. K. Siegel, and . Zirkel, Automatic Formal Verification of MPI-based Parallel Programs, Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, p.125, 2011.

P. Ohly and W. Krotz-vogel, Automated MPI Correctness Checking: What if there was a magic option?, Proceedings of the 8th LCI International Conference on High-Performance Clustered Computing, p.125, 2007.

A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, R. Bronis, and . De-de-supinski, Martin Schulz, and Greg Bronevetsky. A Scalable and Distributed Dynamic Formal Verifier for MPI Programs, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, p.125, 2010.

T. Hilbrich, R. Bronis, F. De-supinski, M. S. Hänsel, M. Müller et al., Runtime MPI Collective Checking with Tree-based Overlay Networks, Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, p.125, 2013.

T. Hilbrich, R. Bronis, M. De-supinski, M. S. Schulz, and . Müller, A Graph Based Approach for MPI Deadlock Detection, Proceedings of the 23rd International Conference on Supercomputing, ICS '09, p.125, 2009.

D. C. Arnold, D. H. Ahn, B. R. De-supinski, G. L. Lee, B. P. Miller et al., Stack Trace Analysis for Large Scale Debugging, IEEE International Parallel and Distributed Processing Symposium, p.125, 2007.

J. Jesper-larsson-träff and . Worringen, Verifying collective MPI calls, PVM/MPI, p.125, 2004.

H. Ma, S. R. Diersen, L. Wang, C. Liao, D. Quinlan et al., Symbolic Analysis of Concurrency Errors in OpenMP Programs, In PARCO, p.125, 2013.

Y. Zhang, E. Duesterwald, and G. R. Gao, Languages and compilers for parallel computing. chapter Concurrency Analysis for Shared Memory Programs with Textually Unaligned Barriers, p.125, 2008.

, The GNU Compiler Collection, p.125

, The Intel Compiler, p.125

U. Banerjee, B. Bliss, Z. Ma, and P. Petersen, Unraveling data race detection in the intel thread checker, Int'l. Symp. on Computer Architecture, ISCA, p.125, 2008.

C. Terboven, Comparing Intel Thread Checker and Sun Thread Analyzer, Advances in Parallel Computing, vol.15, p.125, 2007.

X. E. Intel-inspector, , p.125, 2017.

, Cuda Memcheck, NVidia, p.126, 2017.

J. Price and S. Mcintosh-smith, Oclgrind: An Extensible OpenCL Device Simulator, Proceedings of the 3rd International Workshop on OpenCL, IWOCL'15, vol.12, p.126, 2015.

A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson, GPU-Verify: A Verifier for GPU Kernels, Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'12, p.126, 2012.

P. Collingbourne, C. Cadar, and P. H. Kelly, Symbolic Testing of OpenCL Code, Hardware and Software: Verification and Testing: 7th International Haifa Verification Conference, pp.203-218

E. Ali, UPC-SPIN: a framework for the model checking of UPC programs. In Proceedings of the fifth conference on partitioned global address space programming models, p.126, 2011.

J. Coyle, I. Roy, M. Kraeva, and G. R. Luecke, UPC-CHECK: a scalable tool for detecting run-time errors in Unified Parallel C, Computer Science -Research and Development, vol.28, issue.2, p.126, 2013.

S. Williams, Implementation and optimization of miniGMG-a compact geometric multigrid benchmark, p.126, 2014.

H. Jin, R. Hood, and P. Mehrotra, A practical study of UPC using the NAS Parallel Benchmarks, PGAS, p.126, 2009.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, IISWC, p.126, 2009.

, Exponential growth of supercomputing power as recorded by the TOP500 list, p.11

. .. Kernel, 26 2.2 Pseudocode of an application with an iterated sequence of m kernels, p.27

, Single device iterative application with m kernels written in different languages, p.28

, Transformation of a single-device iterative application into a multi-device application, p.30

, The real region required onto each device is highlighted in red. With a memory region analysis, only the required data is transferred from the host to each device, Partitioning Input Arrays

, With a memory analysis only the partial regions written by each device are transferred to the host

, Splitting data when there is a data dependency between two kernels, p.32

. .. Stencil-partitioning, 35 2.12 Memory regions of buffer B written by each sub-kernel when the stencil2D kernel is partitioned on 2 devices by splitting the second dimension of the NDRange. The gray area shows the overapproximation obtained by an interval analysis but in fact only elements, vol.36

, Memory regions of buffer B written by each sub-kernel when the stencil2D kernel is partitioned on 2 devices by splitting the first dimension of the NDRange. The gray area shows the overapproximation obtained by an interval analysis but in fact only elements in darker gray are actually written by each sub-kernel, p.37

. .. Spatial-binning-kernel, Performance is given as an average time per work-group, partitioning ratio as a percentage of the total number of work-groups. (c) Impact on performance of the offset (starting index) for SpMV kernel, with a fixed partitioning ratio of 1/4. (d) Impact of iteration number on performance for OTOO application, p.39

, Impact on amount of data to transfer of different partitioning strategies, p.40

. .. Kernels, At load time when a kernel is loaded, it is analysed by the Kernel Analyzer and a partition-ready kernel is generated by the Kernel Transformer. At runtime before a kernel is executed, the Dynamic Partitioner determine a partitioning for the kernel based on previous iteration if any and the Buffer Manager handle the necessary data transfers before sub-kernels execution, p.43

, Splitting one dimension of a 3D NDRange to distribute the parallel iteration space onto 4 devices

, General algorithm of the dynamic adaptation of a single-device iterative application with m kernels to n heterogeneous devices

. .. , In this example, the input buffer A is indirectly accessed through buffer IA and IA is annotated as increasing, p.51

S. .. Overview-of,

, 68 3.14 Performance obtained when partitioning SOTL on multiple GPUs, p.70

, Formulations of the partitioning problem

. Performance-of-aesencrypt, . Ep, . Montecarlo, . Otoo, . Polybench-on-conan et al., Uniform and Adaptive are using sub-kernels automatically obtained by our method, p.79

, Speedup per iteration of

, Performance of OTOO executed on conan (3GPUs+CPU) for 60 iterations, p.80

, Amount of data to transfer to device 2 before executing kernel k depending on the partitioning of h and k when these kernel are partitioned onto 3 devices, p.83

, Linear system solved at each iteration with the Adaptive w/ Comm strategy, p.85

, Time taken by computation and transfer when SOTL St is partitioned on conan using Adaptive w/o Comm strategy versus Adaptive w/ Comm strategy, vol.87

. .. Conan, 88 4.10 Splitting a NDRange into two sub-NDRanges, Total time per iteration when SOTL Dyn is partitioned on

, MPI examples of control-flow divergences that may lead to the execution of different sequences of collectives by different processes

.. .. ,

. .. , MPI Code 4 functions CFG (left) and the corresponding PPCFG (right), p.107

. .. Mpi, 110 5.8 Number of warnings added and removed with PARCOACH using the full-interprocedural method compared to PARCOACH using the intraprocedural analysis, p.113

. .. , Number of conditionals added and removed with PARCOACH using the full-interprocedural method compared to PARCOACH using the intraprocedural analysis, p.113

, Execution-Time of Hydro with and without runtime verification (domain size = 500x500, nstepmax=200)

, Examples of collective issues

, Enhanced SSA form of the MPI code Figure 6.1b and its corresponding PDCG

, Percentage of warnings and conditionals filtered by our analysis. 100% means that the analysis has shown the program is free of collective error. The total number of filtered warnings and conditionals is given at the top of each bar, p.127

.. .. Examples,

, 2 Two example of CFGs with instrumented collective highlighted in bold

B. Applications and . .. Description, , p.86

.. .. Framework-components,

B. Applications and . .. Statistics, 112 5.2 Number of warnings reported and conditionals responsible for a collective error for both intraprocedural and full-interprocedural analyses

, Value Flow Dependence rules are based on SVF [94] with our differences highlighted in red. Optimization rules eliminate spurious dependences and Collective Checking rule connects collectives to the conditionals governing their execution

B. Applications, . Characteristics, and . .. Omp=openmp, , p.127

, Multi-valued detection comparison between PARCOACH, SVF and Parfait, both combined with collective deadlock detection. FP = false positives