V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Computer Architecture Proceedings of the 27th International Symposium on, pp.248-259, 2000.

E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Task-Based FMM for Multicore Architectures, SIAM Journal on Scientific Computing, vol.36, issue.1, pp.66-93, 2014.
DOI : 10.1137/130915662

URL : https://hal.archives-ouvertes.fr/hal-00807368

J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan et al., Abstract Machine Models and Proxy Architectures for Exascale Computing, 2014 Hardware-Software Co-Design for High Performance Computing, 2014.
DOI : 10.1109/Co-HPC.2014.4

B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel et al., The PERCS High-Performance Interconnect, 2010 18th IEEE Symposium on High Performance Interconnects, 2010.
DOI : 10.1109/HOTI.2010.16

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, vol.23, issue.4, pp.187-198, 2011.
DOI : 10.1002/cpe.1631

URL : https://hal.archives-ouvertes.fr/inria-00384363

M. Baldauf, O. Fuhrer, M. J. Kurowski, G. De-morsier, M. Muellner et al., The cosmo priority project 'conservative dynamical core' final report, 2013.

. Barcelona-supercomputing and . Center, The OmpSs Programming Model

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall et al., Cilk: An Efficient Multithreaded Runtime System, Proceedings of PPoPP '95, 1995.
DOI : 10.1006/jpdc.1996.0107

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

E. G. Boman, K. D. Devine, V. J. Leung, S. Rajamanickam, L. A. Riesen et al., Zoltan2: Next generation combinatorial toolkit, 2012.

S. Borkar, Thousand core chips, Proceedings of the 44th annual conference on Design automation, DAC '07, pp.746-749, 2007.
DOI : 10.1145/1278480.1278667

S. Borkar and A. A. Chien, The future of microprocessors, Communications of the ACM, vol.54, issue.5, pp.67-77, 2011.
DOI : 10.1145/1941487.1941507

G. Bosilca, . Bouteiller, M. Danalis, . Faverge, T. Haidar et al., Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1432-1441, 2011.
DOI : 10.1109/IPDPS.2011.299

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., PaRSEC: Exploiting Heterogeneity to Enhance Scalability, Computing in Science & Engineering, vol.15, issue.6, pp.36-45, 2013.
DOI : 10.1109/MCSE.2013.98

J. Peter and . Braam, The lustre storage architecture, 2003.

M. S. Campobasso and M. B. Giles, Effects of Flow Instabilities on the Linear Analysis of Turbomachinery Aeroelasticity, Journal of Propulsion and Power, vol.19, issue.2, pp.250-259, 2014.
DOI : 10.2514/2.6106

N. Capit, G. D. Costa, Y. Georgiou, G. Huard, C. Martin et al., A batch scheduler with high level components, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., pp.776-783, 2005.
DOI : 10.1109/CCGRID.2005.1558641

URL : https://hal.archives-ouvertes.fr/hal-00005106

H. Philip, W. B. Carns, I. Ligon, R. B. Ross, and R. Thakur, PVFS: A parallel file system for linux clusters, Proceedings of the 4th Annual Linux Showcase and Conference, pp.317-327, 2000.

B. Catanzaro, S. Kamil, Y. Lee, K. Asanovi, J. Demmel et al., Sejits: Getting productivity and performance with selective embedded jit specialization, 2009.

C. Bryan, M. Catanzaro, K. Garland, and . Keutzer, Copperhead: compiling an embedded data parallel language, Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, pp.47-56, 2011.

E. Chan, G. Field, P. Van-zee, E. S. Bientinesi, G. Quintana-orti et al., SuperMatrix, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming , PPoPP '08, 2008.
DOI : 10.1145/1345206.1345227

A. Charara, H. Ltaief, D. Gratadour, D. Keyes, A. Sevin et al., Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi-GPU System, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014.
DOI : 10.1109/SC.2014.27

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra et al., X10, ACM SIGPLAN Notices, vol.40, issue.10, pp.519-538, 2005.
DOI : 10.1145/1103845.1094852

URL : https://hal.archives-ouvertes.fr/in2p3-00166974

H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn, MPIPP, Proceedings of the 20th annual international conference on Supercomputing , ICS '06, pp.353-360, 2006.
DOI : 10.1145/1183401.1183451

S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki et al., Scheduling threads for constructive cache sharing on CMPs, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures , SPAA '07, 2007.
DOI : 10.1145/1248377.1248396

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

P. Colella, D. T. Graves, D. Modiano, D. B. Serafini, and B. Van-straalen, Chombo software package for AMR applications, 2000.

O. Standards and C. , Openmp 4.0 application program interface, 2013.

Y. Cui, . Poyraz, . Zhou, . Callaghan, . Maechling et al., Accelerating CyberShake Calculations on the XE6/XK7 Platform of Blue Waters, 2013 Extreme Scaling Workshop (xsw 2013), pp.8-17, 2013.
DOI : 10.1109/XSW.2013.6

M. Deveci, S. Rajamanickam, J. Vitus, K. Leung, . Pedretti et al., Exploiting Geometric Partitioning in Task Mapping for Parallel Computers, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.15

S. Donfack, L. Grigori, W. D. Gropp, and V. Kale, Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.496-507, 2012.
DOI : 10.1109/IPDPS.2012.53

URL : https://hal.archives-ouvertes.fr/inria-00631348

M. Dorier, G. Antoniu, R. Ross, D. Kimpe, and S. Ibrahim, CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.27

URL : https://hal.archives-ouvertes.fr/hal-00916091

H. , C. Edwards, C. R. Trott, and D. Sunderland, Kokkos: Enabling manycore performance portability through polymorphic memory access patterns, Journal of Parallel and Distributed Computing, 2014.

S. Erdweg, T. Rendel, C. Kästner, and K. Ostermann, SugarJ, ACM SIGPLAN Notices, vol.46, issue.10, pp.391-406, 2011.
DOI : 10.1145/2076021.2048099

F. Pellegrini, Scotch and LibScotch 5.1 User's Guide. ScAlApplix project, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00410332

O. Fuhrer, M. Bianco, I. Bey, and C. Schr, Grid tools: Towards a library for hardware oblivious implementation of stencil based codes

K. Fürlinger, C. Glass, J. Gracia, A. Knüpfer, J. Tao et al., DASH: Data Structures and Algorithms with Support for Hierarchical Locality, Euro-Par Workshops, 2014.
DOI : 10.1007/978-3-319-14313-2_46

M. Garland, M. Kudlur, and Y. Zheng, Designing a unified programming model for heterogeneous machines, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.
DOI : 10.1109/SC.2012.48

M. Geimer, F. Wolf, B. J. Wylie, E. Abrahám, D. Becker et al., The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, pp.702-719, 2010.

B. Goglin, Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc), 2014 International Conference on High Performance Computing & Simulation (HPCS), 2014.
DOI : 10.1109/HPCSim.2014.6903671

URL : https://hal.archives-ouvertes.fr/hal-00985096

B. Goglin, J. Hursey, and J. M. Squyres, Netloc: Towards a Comprehensive View of the HPC System Topology, 2014 43rd International Conference on Parallel Processing Workshops, 2014.
DOI : 10.1109/ICPPW.2014.38

URL : https://hal.archives-ouvertes.fr/hal-01010599

L. Robert and . Henderson, Job scheduling under the portable batch system, Job scheduling strategies for parallel processing, pp.279-294, 1995.

B. Hess, C. Kutzner, D. Van-der-spoel, and E. Lindahl, GROMACS 4:?? Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation, Journal of Chemical Theory and Computation, vol.4, issue.3, pp.435-447, 2008.
DOI : 10.1021/ct700301q

T. Hoefler, E. Jeannot, and G. Mercier, Chapter 5: An overview of process mapping techniques and algorithms in high-performance computing, High Performance Computing on Complex Environments, pp.65-84, 2014.

T. Hoefler and M. Snir, Generic topology mapping strategies for large-scale parallel architectures, Proceedings of the international conference on Supercomputing, ICS '11, pp.75-84, 2011.
DOI : 10.1145/1995896.1995909

P. Hudak, Building domain-specific embedded languages [47] hwloc. Portable Hardware Locality, ACM Computing Surveys, vol.28, 1996.

E. Jeannot and G. Mercier, Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures, Euro-Par 2010 -Parallel Processing, 16th International Euro-Par Conference, pp.199-210, 2010.
DOI : 10.1007/978-3-642-15291-7_20

URL : https://hal.archives-ouvertes.fr/inria-00544346

E. Jeannot, G. Mercier, and F. Tessier, Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques, IEEE Transactions on Parallel and Distributed Systems, vol.25, issue.4, pp.993-1002, 2014.
DOI : 10.1109/TPDS.2013.104

URL : https://hal.archives-ouvertes.fr/hal-00803548

V. Laxmikant, S. Kale, and . Krishnan, Charm++: A portable concurrent object oriented system based on c++, Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pp.91-108, 1993.

A. Kamil and K. Yelick, Hierarchical Computation in the SPMD Programming Model, The 26th International Workshop on Languages and Compilers for Parallel Computing, 2013.
DOI : 10.1007/978-3-319-09967-5_1

G. Karypis, K. Schloegel, and V. Kumar, Parmetis. Parallel graph partitioning and sparse matrix ordering library, 2003.

A. Yar-khan, J. Kurzak, and J. Dongarra, QUARK Users' Guide: QUeueing And Runtime for Kernels, 2011.

J. Kim, W. J. Dally, S. Scott, and D. Abts, Technology-driven, highly-scalable dragonfly topology, Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pp.77-88, 2008.
DOI : 10.1109/isca.2008.19

A. Klöckner, N. Pinto, Y. Lee, B. C. Catanzaro, P. Ivanov et al., PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation, Parallel Computing, vol.38, issue.3, pp.157-174, 2012.
DOI : 10.1016/j.parco.2011.09.001

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson et al., ExaScale computing study: Technology challenges in achieving exascale systems, 2008.

M. Peter, J. Kogge, and . Shalf, Exascale computing trends: Adjusting to the " new normal " ' for computer architecture, Computing in Science and Engineering, vol.15, issue.6, pp.16-26, 2013.

W. Kramer, Is petascale completely done? what should we do now? joint-lab on petsacale computing workshophttps

X. Lapillonne and O. Fuhrer, Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science, Parallel Processing Letters, vol.24, issue.01, p.2014
DOI : 10.1142/S0129626414500030

J. Li, W. Keng-liao, A. Choudhary, R. Ross, R. Thakur et al., Parallel netCDF, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC '03, 2003.
DOI : 10.1145/1048935.1050189

H. Ltaief and R. Yokota, Data-Driven Execution of Fast Multipole Methods. CoRR, abs, 1203.

R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eckert, Generating devicespecific GPU code for local operators in medical imaging, Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.569-581, 2012.

R. Membarth, F. Hannig, J. Teich, and H. Köstler, Towards Domain-Specific Computing for Stencil Codes in HPC, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.1133-1138, 2012.
DOI : 10.1109/SC.Companion.2012.136

Q. Meng and M. Berzins, Scalable large-scale fluid-structure interaction solvers in the Uintah framework via hybrid task-based parallelism algorithms, Concurrency and Computation: Practice and Experience, vol.90, issue.3, pp.1388-1407, 2014.
DOI : 10.1002/cpe.3099

J. Nakashima, S. Nakatani, and K. Taura, Design and implementation of a customizable work stealing scheduler, Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '13, 2013.
DOI : 10.1145/2491661.2481433

L. Stephen, A. K. Olivier, K. B. Porterfield, M. Wheeler, J. F. Spiegel et al., OpenMP task scheduling strategies for multicore NUMA systems, International Journal of High Performance Computing Applications, vol.26, issue.2, pp.110-124, 2012.

A. Openmp, OpenMP Application Program Interface

C. Osuna, O. Fuhrer, T. Gysi, and M. Bianco, STELLA: A domain-specific language for stencil methods on structured grids, Poster Presentation at the Platform for Advanced Scientific Computing (PASC) Conference

S. Pall, M. J. Abraham, C. Kutzner, B. Hess, and E. Lindahl, Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS, Lecture Notes in Computer Science, p.page in press, 2014.
DOI : 10.1007/978-3-319-15976-8_1

M. Pericàs, K. Taura, and S. Matsuoka, Scalable analysis of multicore data reuse and sharing, Proceedings of the 28th ACM international conference on Supercomputing, ICS '14, 2014.
DOI : 10.1145/2597652.2597674

B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg et al., Efficient Task Placement and Routing in Dragonfly Networks, Proceedings of the 23rd ACM International Symposium on High-Performance Parallel and Distributed Computing, 2014.

S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar et al., GRO- MACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit, Bioinformatics, issue.7, pp.29845-854, 2013.

S. Ramos and T. Hoefler, Modeling communication in cache-coherent SMP systems, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.97-108, 2013.
DOI : 10.1145/2493123.2462916

F. Rathgeber, G. R. Markall, L. Mitchell, N. Loriant, D. A. Ham et al., PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.1116-1123, 2012.
DOI : 10.1109/SC.Companion.2012.134

E. Rodrigues, F. Madruga, P. Navaux, and J. Panetta, Multicore Aware Process Mapping and its Impact on Communication Overhead of Parallel Applications, Proceedings of the IEEE Symp. on Comp. and Comm, pp.811-817, 2009.

T. Rompf and M. Odersky, Lightweight modular staging, Communications of the ACM, vol.55, issue.6, pp.121-130, 2012.
DOI : 10.1145/2184319.2184345

F. P. Russell, M. R. Mellor, P. H. Kelly, and O. Beckmann, DESOLA: An active linear algebra library using delayed evaluation and runtime code generation, Science of Computer Programming, vol.76, issue.4, pp.227-242, 2011.
DOI : 10.1016/j.scico.2008.06.002

F. Schmuck and R. Haskin, GPFS: A shared-disk file system for large computing clusters, First USENIX Conference on File and Storage Technologies (FAST'02), 2002.

J. Shalf, S. S. Dosanjh, and J. Morrison, Exascale Computing Technology Challenges, International Meeting on High Performance Computing for Computational Science, pp.1-25, 2010.
DOI : 10.1109/MM.2009.5

M. Showerman, J. Enos, J. Fullop, P. Cassella, N. Naksinehaboon et al., Large scale system monitoring and analysis on blue waters using ovis, Proceedings of the 2014 Cray User's Group, 2014.

R. Thakur, W. Gropp, and E. Lusk, On implementing MPI-IO portably and with high performance, Proceedings of the sixth workshop on I/O in parallel and distributed systems , IOPADS '99, pp.23-32, 1999.
DOI : 10.1145/301816.301826

D. Unat, C. Chan, W. Zhang, J. Bell, and J. Shalf, Tiling as a durable abstraction for parallelism and data locality. Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, 2013.

G. Field, E. Van-zee, R. A. Chan, E. S. Van-de-geijn, G. Quintana-ortí et al., The libflame Library for Dense Matrix Computations, IEEE Des. Test, vol.11, issue.6, pp.56-63, 2009.

L. Todd, D. Veldhuizen, and . Gannon, Active libraries: Rethinking the roles of compilers and libraries. CoRR, math, 1998.

B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller et al., Scalable performance of the Panasas parallel file system, Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST), pp.17-33, 2008.

M. Wimmer and D. Cederman, Jesper Larsson Träff, and Philippas Tsigas. Work-stealing with configurable scheduling strategies, Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.315-316, 2013.

Y. Yan, J. Zhao, Y. Guo, and V. Sarkar, Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement, Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing, 2009.
DOI : 10.1007/978-3-642-13374-9_12

K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit et al., Titanium: A highperformance Java dialect, Workshop on Java for High-Performance Network Computing, 1998.

Q. Yi and D. J. Quinlan, Applying Loop Optimizations to Object-Oriented Abstractions Through General Classification of Array Semantics, Lecture Notes in Computer Science, vol.3602, pp.253-267, 2004.
DOI : 10.1007/11532378_19

B. Andy, . Yoo, A. Morris, M. Jette, and . Grondona, Slurm: Simple linux utility for resource management, Job Scheduling Strategies for Parallel Processing, pp.44-60, 2003.

Y. Zheng, A. Kamil, M. Driscoll, H. Shan, and K. Yelick, UPC++: A PGAS Extension for C++, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.115

S. Zhou, Lsf: Load sharing in large heterogeneous distributed systems, I Workshop on Cluster Computing, 1992.