J. Dongarra, J. Kurzak, P. Luszczek, and S. Tomov, Dense Linear Algebra on Accelerated Multicore Hardware, High- Performance Scientific Computing, pp.123-146
DOI : 10.1007/978-1-4471-2437-5_5

M. Horton, S. Tomov, and J. Dongarra, A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures, 2011 Symposium on Application Accelerators in High-Performance Computing, pp.150-158, 2011.
DOI : 10.1109/SAAHPC.2011.18

F. Song and J. Dongarra, A scalable framework for heterogeneous GPUbased clusters, Proc. of ACM SPAA, pp.91-100

S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6
DOI : 10.1016/j.parco.2009.12.005

J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell et al., Productive Programming of GPU Clusters with OmpSs, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.557-568, 2012.
DOI : 10.1109/IPDPS.2012.58

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, vol.23, issue.4, pp.187-198, 2011.
DOI : 10.1002/cpe.1631

URL : https://hal.archives-ouvertes.fr/inria-00384363

C. Augonnet, J. Clet-ortega, S. Thibault, and R. Namyst, Data-Aware Task Scheduling on Multi-accelerator Based Platforms, 2010 IEEE 16th International Conference on Parallel and Distributed Systems, pp.291-298, 2010.
DOI : 10.1109/ICPADS.2010.129

URL : https://hal.archives-ouvertes.fr/inria-00523937

H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, vol.13, issue.3, pp.260-274, 2002.
DOI : 10.1109/71.993206

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier et al., DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, issue.12

E. Hermann, B. Raffin, F. C. Faure, T. Gautier, and J. Allard, Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations, Proc. of the 2010 Euro-Par, pp.235-246, 0723.
DOI : 10.1007/978-3-642-15291-7_23

URL : https://hal.archives-ouvertes.fr/inria-00502448

T. Gautier, J. V. Lima, N. Maillard, and B. Raffin, XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.1299-1308, 2013.
DOI : 10.1109/IPDPS.2013.66

URL : https://hal.archives-ouvertes.fr/hal-00799904

J. V. Lima, T. Gautier, N. Maillard, and V. Danjean, Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, pp.75-82
DOI : 10.1109/SBAC-PAD.2012.28

URL : https://hal.archives-ouvertes.fr/hal-00735470

M. Frigo, C. E. Leiserson, and K. H. Randall, The implementation of the cilk-5 multithreaded language, SIGPLAN Not, pp.212-223, 1998.

U. A. Acar, G. E. Blelloch, and R. D. Blumofe, The data locality of work stealing, Proc. of the ACM SPAA, pp.1-12, 2000.

P. Jetley, L. Wesolowski, F. Gioachin, L. V. Kalé, and T. R. Quinn, Scaling Hierarchical N-body Simulations on GPU Clusters, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.
DOI : 10.1109/SC.2010.49

R. Vasudevan, S. S. Vadhiyar, and L. V. Kalé, G-Charm, Proceedings of the 27th international ACM conference on International conference on supercomputing, ICS '13, pp.349-358
DOI : 10.1145/2464996.2465444

T. Gautier, X. Besseron, and L. Pigeon, KAAPI, Proceedings of the 2007 international workshop on Parallel symbolic computation, PASCO '07, pp.15-23, 2007.
DOI : 10.1145/1278177.1278182

URL : https://hal.archives-ouvertes.fr/hal-00647474

E. Ayguadé, R. Badia, F. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proc. of the, pp.851-862, 2009.
DOI : 10.1109/TPDS.2003.1214317

R. M. Badia, J. R. Herrero, J. Labarta, J. M. Pérez, E. S. Quintana-ortí et al., Parallelizing dense and banded linear algebra libraries using SMPSs, Concurrency and Computation: Practice and Experience, vol.14, issue.7, pp.2438-245618, 2009.
DOI : 10.1002/cpe.1463

J. Planas, R. Badia, E. Ayguade, and J. Labarta, Self-Adaptive OmpSs Tasks in Heterogeneous Environments, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.138-149, 2013.
DOI : 10.1109/IPDPS.2013.53

J. Bueno, X. Martorell, R. M. Badia, E. Ayguadé, and J. Labarta, Implementing OmpSs support for regions of data in architectures with multiple address spaces, Proceedings of the 27th international ACM conference on International conference on supercomputing, ICS '13, pp.359-368
DOI : 10.1145/2464996.2465017

J. Toss and T. Gautier, A New Programming Paradigm for GPGPU, Proc. of the 2012 Euro-Par, pp.895-907, 2012.
DOI : 10.1007/978-3-642-32820-6_88

URL : https://hal.archives-ouvertes.fr/hal-00796257

F. Broquedis, T. Gautier, and V. Danjean, libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms, pp.102-115, 2012.
DOI : 10.1007/978-3-642-30961-8_8

URL : https://hal.archives-ouvertes.fr/hal-00796253

A. Yarkhan, J. Kurzak, and J. Dongarra, Quark users' guide: Queueing and runtime for kernels, 2011.

F. Galilee, G. Cavalheiro, J. Roch, and M. Doreille, Athapascan-1: Online building data flow graph in a parallel language, Proc. of the 1998 PACT, pp.88-95, 1998.

G. Quintana-ortí, F. D. Igual, E. S. Quintana-ortí, and R. A. Van-de-geijn, Solving dense linear systems on platforms with multiple hardware accelerators, SIGPLAN Not, pp.121-130, 2009.

R. D. Blumofe and C. E. Leiserson, Space-Efficient Scheduling of Multithreaded Computations, SIAM Journal on Computing, vol.27, issue.1, pp.202-229, 1998.
DOI : 10.1137/S0097539793259471

Y. Guo, J. Zhao, V. Cave, and V. Sarkar, SLAW: A scalable locality-aware adaptive work-stealing scheduler, Proc. of the 24th IEEE IPDPS, pp.1-12, 2010.

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009.
DOI : 10.1016/j.parco.2008.10.002