F. Wu and S. Tom, Green500 list URL: https://www.green500. org/lists, 2017.

. Computer, URL: http://www.aics.riken, 2017.

A. Anthony, B. A. Allan, and J. M. Brandt, The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications, pp.154-165, 2014.

A. Emmanuel, A. Cédric, A. Jack, and D. , LU Factorization for Accelerator-based Systems, pp.217-224, 2011.

A. Emmanuel, A. Cédric, A. Jack, and D. , QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, IPDPS. IEEE, vol.20, pp.932-943, 2011.

A. Carl, Characterizing Node Orderings for Improved Performance, pp.1-6, 2015.

A. Steve, P. Beckman, and C. Jackie, Opportunities and Challenges of Exascale Computing URL: https : / / science . energy . gov, Tech. rep. U.S. Department of Energy, 2010.

A. Cédric, T. Samuel, and R. Namyst, Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures, Euro-Par Workshops. Lecture Notes in Computer Science, vol.6043, pp.56-65, 2009.

. Aug+11, A. Cédric, T. Samuel, N. Raymond, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience 23, pp.21-187, 2011.

B. Evripidis, G. Frédéric, and T. Denis, Some models for scheduling parallel programs with communication delays, Discrete Applied Mathematics, vol.72, issue.196, pp.5-24, 1997.

. Bha+13, B. Abhinav, M. Kathryn, S. H. Langer, K. E. Isaacsb?a+07 et al., There Goes the Neighborhood: Performance Degradation due to Nearby Jobs, Handbook on Scheduling: From Theory to Applications. International Handbooks on Information Systems, pp.1-41, 2007.

. Ble+14, B. Raphaël, G. Thierry, J. Vicente, F. Lima et al., Scheduling Data Flow Program in XKaapi: A New Affinity Based Algorithm for Heterogeneous Architectures, pp.Euro-Par

. Ble+15, B. Raphaël, K. Safia, M. Florence, M. Grégory et al., Scheduling independent tasks on multi-cores with GPU accelerators, Concurrency and Computation: Practice and Experience 27, pp.1625-1638, 2015.

B. Raphaël, H. Sascha, and K. Safia, Scheduling Independent Moldable Tasks on Multi-Cores with GPUs, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.9, pp.2689-2702, 2017.

B. George, B. Aurélien, and D. Anthony, DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, issue.1, pp.37-51, 2012.

B. Marin, D. Pierre-françois, J. Klaus, O. Christina, and T. Denis, A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling, Euro-Par Lecture Notes in Computer Science, vol.6271, issue.1, pp.157-167, 2010.

. Bou+10b, B. Azzedine, M. Jan, A. C. Correa, M. Alves-de et al., A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space, IEEE Transactions on Computers, vol.596, pp.808-821, 2010.

B. Marin, D. Pierre-françois, J. Klaus, R. Christina, and T. Denis, Approximation Algorithms for Multiple Strip Packing and Scheduling Parallel Jobs in Platforms, Discrete Mathematics, Algorithms and Applications, pp.553-586, 2011.

[. Peirce and B. , The Parallel Evaluation of General Arithmetic Expressions, Journal of the ACM, vol.21, issue.2, pp.201-206, 1974.

B. Peter, Scheduling Algorithms. Fifth Edition, 2007.

B. Javier, P. Judit, and D. Alejandro, Productive Programming of GPU Clusters with OmpSs, pp.557-568, 2012.

B. Alfredo, L. Julien, K. Jakub, D. Jack, B. Vincenzo et al., A class of parallel tiled linear algebra algorithms for multicore architectures Scheduling Unrelated Machines of Few Different Types URL: https, Parallel Computing, vol.35, issue.20, pp.38-53, 2009.

H. Philip, . Carns, H. Kevin, and W. E. Allcock, Understanding and Improving Computational Science Storage Access through Continuous Characterization, In: ACM Transactions on Storage, vol.7, issue.81, p.77, 2011.

. Aragon, Considering Time in Designing Large-Scale Systems for Scientific Computing, pp.1533-1545, 2016.

E. Grady, C. Jr, M. Randolph, G. , D. Stifler et al., Performance Bounds for Level-Oriented Two- Dimensional Packing Algorithms, In: SIAM Journal on Computing, vol.9, issue.4, pp.808-826, 1980.

E. David, R. M. Culler, D. A. Karp, and . Patterson, LogP: Towards a Realistic Model of Parallel Computation, pp.1-12, 1993.

C. Lin, Y. Deshi, and Z. Guochuan, Online Scheduling on a CPU-GPU Cluster, In: TAMC. Lecture Notes in Computer Science, vol.7876, pp.1-9, 2013.

D. Mehmet, R. Sivasankaran, and V. J. Leung, Exploiting Geometric Partitioning in Task Mapping for Parallel Computers, pp.27-36, 2014.

D. Pierre-françois, M. Grégory, and T. Denis, Scheduling Parallel Tasks Approximation Algorithms In: Handbook of Scheduling: Algorithms , Models, and Performance Analysis, Computer & Information Science Series. Chapman and Hall/CRC, 2004.

. Don+11, D. Jack, P. H. Beckman, and M. Terry, The International Exascale Software Project roadmap, International Journal of High Performance Computing Applications, vol.251, pp.3-60, 2011.

D. Matthieu, I. Shadi, A. Gabriel, and R. B. Ross, Using Formal Grammars to Predict I/O Behaviors in HPC: The Omnisc'IO Approach, IEEE Transactions on Parallel and Distributed Systems, vol.278, pp.2435-2449, 2016.

D. Maciej, Scheduling for Parallel Processing Computer Communications and Networks, 2009.

A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, Fast parallel sorting under LogP: experience with the CM-5, IEEE Transactions on Parallel and Distributed Systems, pp.791-805, 1996.
DOI : 10.1109/71.532111

E. [. Todd, J. C. Browne, and W. L. Barth, Understanding Application and System Performance Through System-Wide Monitoring, IPDPS Workshops. IEEE, pp.1702-1710, 2016.

E. Jeremy, G. H. Bauer, and B. Robert, Topology-Aware Job Scheduling Strategies for Torus Networks In: Cray User Group URL: https://cug.org/proceedings, pp.74-77, 2014.

E. Lionel, Théorie et pratique de l'ordonnancement d'applications sur les systèmes distribués, 2006.

F. Liya, Z. Fa, W. Gongming, and L. Zhiyong, An effective approximation algorithm for the Malleable Parallel Task Scheduling problem, In: Journal of Parallel and Distributed Computing, vol.725, pp.693-704, 2012.

S. Parkson and W. , Theory and Practice in Parallel Job Scheduling, In: JSSPP. Lecture Notes in Computer Science, vol.1291, pp.1-34, 1997.

J. Vicente, F. Lima, G. Thierry, M. Nicolas, and D. Vincent, Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs, pp.75-82, 2012.

K. Donald and . Friesen, Tighter Bounds for LPT Scheduling on Uniform Processors, In: SIAM Journal on Computing, vol.163, pp.554-560, 1987.

F. Steven and J. Wyllie, Parallelism in Random Access Machines, pp.114-118, 1978.

G. Ana, A. Guillaume, and B. Anne, Scheduling the I/O of HPC Applications Under Congestion, IPDPS. IEEE, pp.1013-1022, 2015.

. Gau+13, G. Thierry, J. Vicente, F. Lima, M. Nicolas et al., XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures, pp.1299-1308, 2013.

G. Thierry, B. Xavier, and L. Pigeon, KAAPI: A thread scheduling runtime system for data flow computations on cluster of multiprocessors, In: PASCO. ACM, pp.15-23, 2007.

G. Yiannis, J. Emmanuel, M. Guillaume, and A. Villiermet, Topology-aware Resource Management for HPC Applications, pp.1-17, 2017.

G. Yiannis, Contributions for Resource and Job Management in High Performance Computing URL: https, 2010.

G. Jordan, Algorithms for Compile-Time Memory Optimization URL: https, In: SODA. ACM/SIAM, pp.907-908, 1999.

M. Randolph, G. , R. Lewis, and G. , Bounds for Multiprocessor Scheduling with Resource Constraints, In: SIAM Journal on Computing, vol.4, issue.2, 1975.

[. Randolph, G. , D. Stifler, and J. , Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. | cit, pp.17-81

R. Lewis, G. , E. Leighton, L. , J. Karel et al., Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey, Annals of Discrete Mathematics, vol.52, issue.08, pp.287-326, 1979.

R. Lewis and G. , Bounds on Multiprocessing Timing Anomalies, In: SIAM Journal on Applied Mathematics, vol.17, issue.2, pp.416-429, 1969.

H. Sascha and A. , Reproducible MPI Benchmarking is Still Not as Easy as You Think, IEEE Transactions on Parallel and Distributed Systems, vol.2712, pp.3617-3630, 2016.

H. Everton, R. Bruno, F. François, G. Thierry, and J. Allard, Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations, Euro-Par Lecture Notes in Computer Science, vol.6272, issue.2, pp.235-246, 2010.

H. David, Ueber die stetige Abbildung einer Line auf ein Flächenstück, Mathematische Annalen, vol.383, pp.459-460, 1891.

S. Dorit, D. B. Hochbaum, and . Shmoys, Using Dual Approximation Algorithms for Scheduling Problems: Theoretical and Practical Results, Journal of the ACM, vol.34, issue.23, pp.144-162, 1987.

S. Dorit, D. B. Hochbaum, and . Shmoys, A Polynomial Approximation Scheme for Scheduling on Uniform Processors: Using the Dual Approximation Approach, In: SIAM Journal on Computing, vol.173, pp.539-551, 1988.

I. Florin, C. Jesús, and R. B. Ross, CLARISSE: A Middleware for Data-Staging Coordination and Control on Large-Scale HPC Platforms, pp.346-355, 2016.

I. Csanád, Scheduling Problems on Two Sets of Identical Machines, pp.277-294, 2003.

J. Nikhil, B. Abhinav, N. Xiang, G. Todd, and L. V. Kalé, Partitioning Low-diameter Networks to Eliminate Inter-job Interference, pp.439-448, 2017.

J. Klaus and P. Lorant, Linear-time Approximation Schemes for Scheduling Malleable Parallel Tasks URL: https, In: SODA. ACM/SIAM, pp.490-498, 1999.

K. Georgios, M. Cyriel, P. Bogdan, R. Germán, H. Torsten et al., Cost-Effective Diameter-Two Topologies: Analysis and Evaluation Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, pp.1-36, 2010.

J. Vitus, E. M. Leung, M. A. Arkin, and . Bender, Processor Allocation on Cplant: Achieving General Processor Locality Using One-Dimensional Allocation Strategies, pp.296-304, 2002.

J. Karel, L. , D. B. Shmoys, and T. Éva, Approximation Algorithms for Scheduling Unrelated Parallel Machines, In: Mathematical Programming, vol.46, issue.1, pp.259-271, 1990.

L. Walter and T. Prasoon, Scheduling Malleable and Nonmalleable Parallel Tasks URL: https, pp.167-176, 1994.

L. Giorgio, M. Fernando, . Mendonça, T. Denis, and W. Frédéric, Contiguity and Locality in Backfilling Scheduling, pp.586-595, 2015.

M. Florence, Scheduling for new computing platforms with GPUs URL: https, 2014.

G. M. Morton, A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing URL: https, Tech. rep. IBM Ltd, p.72, 1966.

M. Grégory, R. Christophe, T. Denis, P. Frédéric, D. Bernabé et al., A 3/2-Approximation Algorithm for Scheduling Independent Monotonic Malleable Tasks, Solving very large instances of the scheduling of independent tasks problem on the GPU, pp.401-412, 2007.

J. Antonio, P. José, M. , J. Antonio, and L. , Application-aware metrics for partition selection in cube-shaped topologies, Parallel Computing, vol.405, pp.129-139, 2014.

J. C. Phillips, J. E. Stone, and K. Schulten, Adapting a message-driven parallel application to GPU-accelerated clusters, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008.
DOI : 10.1109/SC.2008.5214716
URL : http://mc.stanford.edu/cgi-bin/images/8/8a/SC08_NAMD.pdf

R. Gurulingesh and N. Vincent, A PTAS for Assigning Sporadic Tasks on Two-type Heterogeneous Multiprocessors, pp.117-126, 2012.

. Son+10, S. Fengguang, L. Hatem, H. Bilel, and J. Dongarra, Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems, pp.1-11, 2010.

B. David, . Shmoys, and T. Éva, An approximation algorithm for the generalized assignment problem, In: Mathematical ProgrammingFeb, vol.62, issue.1, pp.461-474, 1993.

S. Fengguang, T. Stanimire, and J. Dongarra, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS. ACM, pp.365-376, 2012.

]. A. Ste97 and . Steinberg, A Strip-Packing Algorithm with Absolute Performance Bound 2, In: SIAM Journal on Computing, vol.26, issue.2, pp.401-409, 1997.

V. Evgeny, . Shchepin, and V. Nodari, An optimal rounding gives a better approximation for scheduling unrelated machines, Operations Research Letters, vol.33, issue.2, pp.127-133, 2005.

S. Clifford and J. Wein, On the existence of schedules that are nearoptimal for both makespan and total weighted completion time, Operations Research Letters, vol.21397, pp.115-122, 1997.

T. Stanimire, D. Jack, M. Baboulin-françois, T. Preeti, M. Venkatram et al., Towards dense linear algebra for hybrid GPU accelerated manycore systems Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers, Parallel Computing In: COMHPC@SC. IEEE, vol.36, issue.20, pp.232-240, 2010.

T. Haluk, H. Salim, and M. Wu, Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing, IEEE Transactions on Parallel and Distributed Systems, vol.133, pp.260-274, 2002.

T. Ozan, V. J. Leung, A. Kivilcim, and C. , PaCMap: Topology Mapping of Unstructured Communication Patterns onto Non-contiguous Allocations, ICS. ACM, pp.37-46, 2015.

T. John, J. L. Wolf, and P. S. Yu, Approximate Algorithms for Scheduling Parallelizable Tasks, pp.323-332, 1992.

L. Gabriel and V. , A Bridging Model for Parallel Computation, Communications of the ACM, vol.338, issue.7 8, 1990.

Y. Asim, K. Jakub, J. D. Quark, and . Users, Guide: QUeueing And Runtime for Kernels. Tech. rep. ICL-UT-11-02, p.30, 2011.

I. Aggregation-of-many and I. , 94 A11 List of Tables 3.1 Parameter settings used to generate scheduling instances . . . . . . 57 3.2 HEFT-like heuristics used for comparison, p.61