J. V. Lima, T. Gautier, N. Maillard, and V. Danjean, Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, pp.75-82, 2012.
DOI : 10.1109/SBAC-PAD.2012.28
URL : https://hal.archives-ouvertes.fr/hal-00735470

T. Gautier, J. V. Lima, N. Maillard, and B. Raffin, XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013.
DOI : 10.1109/IPDPS.2013.66
URL : https://hal.archives-ouvertes.fr/hal-00799904

S. Lee and J. S. Vetter, Early evaluation of directive-based gpu programming models for productive exascale computing Storage and Analysis, ser. SC '12, Proceedings of the International Conference on High Performance Computing, Networking, pp.1-2311, 2012.

J. Eisenlohr, D. E. Hudak, K. Tomko, and T. C. Prince, Dense Linear Algebra Factorization in OpenMP and Cilk Plus on Intel's MIC Architecture: Development Experiences and Performance Analysis

V. B. Labarta, Prototype programming environment in Booster Node, delivrable D5.1, EU DEEP project Dynamical Exascale Entry Platform, Tech. Rep, vol.02, pp.7-2011

T. Cramer, D. Schmidl, M. Klemm, and D. Mey, OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison, Proceedings of the Many-core Applications Research Community (MARC) Symposium at RWTH Aachen University, pp.38-44, 2012.

S. J. Pennycook, C. J. Hughes, M. Smelyanskiy, and S. A. Jarvis, Exploring simd for molecular dynamics, using intel xeon processors and intel xeon phi coprocessors, Proc. of the 27th IEEE IPDPS, 2013.

C. J. Newburn, S. Dmitriev, R. Narayanaswamy, J. Wiegert, R. Murty et al., Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013.
DOI : 10.1109/IPDPSW.2013.251

A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov et al., Design and Implementation of the Linpack Benchmark for Single and Multi- Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor, Proc. of the 27th IEEE IPDPS, 2013.

E. A. Lee, The Problem with Threads, Computer, vol.39, issue.5, pp.33-42, 2006.
DOI : 10.1109/MC.2006.180

R. D. Blumofe and C. E. Leiserson, Space-Efficient Scheduling of Multithreaded Computations, SIAM Journal on Computing, vol.27, issue.1, pp.202-229, 1998.
DOI : 10.1137/S0097539793259471

A. Robison, M. Voss, and A. Kukanov, Optimization via Reflection on Work Stealing in TBB, 2008 IEEE International Symposium on Parallel and Distributed Processing, pp.1-8, 2008.
DOI : 10.1109/IPDPS.2008.4536188

J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia, Scheduling dense linear algebra operations on multicore processors, Concurrency and Computation: Practice and Experience, vol.35, issue.2, pp.15-44, 2010.
DOI : 10.1145/1377612.1377615

A. Duran, R. Ferrer, E. Ayguadé, R. M. Badia, and J. Labarta, A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks, International Journal of Parallel Programming, vol.26, issue.6, pp.292-305, 2009.
DOI : 10.1007/s10766-009-0101-1

]. E. Hermann, B. Raffin, F. C. Faure, T. Gautier, and J. Allard, Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations, Proc. of Euro-Par, pp.235-246, 2010.
DOI : 10.1007/978-3-642-15291-7_23
URL : https://hal.archives-ouvertes.fr/inria-00502448

J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell et al., Productive Programming of GPU Clusters with OmpSs, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.58

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

T. Gautier, X. Besseron, and L. Pigeon, KAAPI, Proceedings of the 2007 international workshop on Parallel symbolic computation, PASCO '07, 2007.
DOI : 10.1145/1278177.1278182
URL : https://hal.archives-ouvertes.fr/hal-00647474

M. Frigo, C. E. Leiserson, and K. H. Randall, The implementation of the Cilk-5 multithreaded language, Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, ser. PLDI '98, pp.212-223, 1998.

E. Ayguadé, R. Badia, F. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proc. of Euro-Par, pp.851-862, 2009.
DOI : 10.1109/TPDS.2003.1214317

F. Galilée, J. Roch, G. G. Cavalheiro, and M. Doreille, Athapascan-1: On-line building data flow graph in a parallel language, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192), pp.88-95, 1998.
DOI : 10.1109/PACT.1998.727176

F. Broquedis, T. Gautier, and V. Danjean, libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms, Proceedings of the 8th international conference on OpenMP in a Heterogeneous World, ser. IWOMP'12, pp.102-115, 2012.
DOI : 10.1007/978-3-642-30961-8_8
URL : https://hal.archives-ouvertes.fr/hal-00796253

A. Yarkhan, J. Kurzak, and J. Dongarra, QUARK Users' Guide: QUeueing And Runtime for Kernels, 2011.

M. Tchiboukdjian, N. Gast, and D. Trystram, Decentralized list scheduling, Annals of Operations Research, vol.18, issue.2, pp.1-23, 2012.
DOI : 10.1007/s10479-012-1149-7
URL : https://hal.archives-ouvertes.fr/hal-00796248

. Takaken, Source code for n queens problem

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., Faster, Cheaper, Better ? a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, W. mei W. Hwu, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00547847

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009.
DOI : 10.1016/j.parco.2008.10.002

B. Dumitrescu, M. Doreille, J. Roch, and D. Trystram, Twodimensional block partitionings for the parallel sparse cholesky factorization, Numerical Algorithms, vol.16, issue.1, pp.17-38, 1997.
DOI : 10.1023/A:1019122726788
URL : https://hal.archives-ouvertes.fr/inria-00073533

S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.232-240, 2010.
DOI : 10.1016/j.parco.2009.12.005

J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir et al., MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors, 2013.

M. Durand, F. Broquedis, T. Gautier, B. Raffin, I. et al., An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines, pp.141-155, 2013.
DOI : 10.1007/978-3-642-40698-0_11
URL : https://hal.archives-ouvertes.fr/hal-00867438