Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, pp.75-82, 2012. ,
DOI : 10.1109/SBAC-PAD.2012.28
URL : https://hal.archives-ouvertes.fr/hal-00735470
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013. ,
DOI : 10.1109/IPDPS.2013.66
URL : https://hal.archives-ouvertes.fr/hal-00799904
Early evaluation of directive-based gpu programming models for productive exascale computing Storage and Analysis, ser. SC '12, Proceedings of the International Conference on High Performance Computing, Networking, pp.1-2311, 2012. ,
Dense Linear Algebra Factorization in OpenMP and Cilk Plus on Intel's MIC Architecture: Development Experiences and Performance Analysis ,
Prototype programming environment in Booster Node, delivrable D5.1, EU DEEP project Dynamical Exascale Entry Platform, Tech. Rep, vol.02, pp.7-2011 ,
OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison, Proceedings of the Many-core Applications Research Community (MARC) Symposium at RWTH Aachen University, pp.38-44, 2012. ,
Exploring simd for molecular dynamics, using intel xeon processors and intel xeon phi coprocessors, Proc. of the 27th IEEE IPDPS, 2013. ,
Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013. ,
DOI : 10.1109/IPDPSW.2013.251
Design and Implementation of the Linpack Benchmark for Single and Multi- Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor, Proc. of the 27th IEEE IPDPS, 2013. ,
The Problem with Threads, Computer, vol.39, issue.5, pp.33-42, 2006. ,
DOI : 10.1109/MC.2006.180
Space-Efficient Scheduling of Multithreaded Computations, SIAM Journal on Computing, vol.27, issue.1, pp.202-229, 1998. ,
DOI : 10.1137/S0097539793259471
Optimization via Reflection on Work Stealing in TBB, 2008 IEEE International Symposium on Parallel and Distributed Processing, pp.1-8, 2008. ,
DOI : 10.1109/IPDPS.2008.4536188
Scheduling dense linear algebra operations on multicore processors, Concurrency and Computation: Practice and Experience, vol.35, issue.2, pp.15-44, 2010. ,
DOI : 10.1145/1377612.1377615
A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks, International Journal of Parallel Programming, vol.26, issue.6, pp.292-305, 2009. ,
DOI : 10.1007/s10766-009-0101-1
Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations, Proc. of Euro-Par, pp.235-246, 2010. ,
DOI : 10.1007/978-3-642-15291-7_23
URL : https://hal.archives-ouvertes.fr/inria-00502448
Productive Programming of GPU Clusters with OmpSs, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.58
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, pp.187-198, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00384363
KAAPI, Proceedings of the 2007 international workshop on Parallel symbolic computation, PASCO '07, 2007. ,
DOI : 10.1145/1278177.1278182
URL : https://hal.archives-ouvertes.fr/hal-00647474
The implementation of the Cilk-5 multithreaded language, Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, ser. PLDI '98, pp.212-223, 1998. ,
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proc. of Euro-Par, pp.851-862, 2009. ,
DOI : 10.1109/TPDS.2003.1214317
Athapascan-1: On-line building data flow graph in a parallel language, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192), pp.88-95, 1998. ,
DOI : 10.1109/PACT.1998.727176
libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms, Proceedings of the 8th international conference on OpenMP in a Heterogeneous World, ser. IWOMP'12, pp.102-115, 2012. ,
DOI : 10.1007/978-3-642-30961-8_8
URL : https://hal.archives-ouvertes.fr/hal-00796253
QUARK Users' Guide: QUeueing And Runtime for Kernels, 2011. ,
Decentralized list scheduling, Annals of Operations Research, vol.18, issue.2, pp.1-23, 2012. ,
DOI : 10.1007/s10479-012-1149-7
URL : https://hal.archives-ouvertes.fr/hal-00796248
Source code for n queens problem ,
Faster, Cheaper, Better ? a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, W. mei W. Hwu, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00547847
A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009. ,
DOI : 10.1016/j.parco.2008.10.002
Twodimensional block partitionings for the parallel sparse cholesky factorization, Numerical Algorithms, vol.16, issue.1, pp.17-38, 1997. ,
DOI : 10.1023/A:1019122726788
URL : https://hal.archives-ouvertes.fr/inria-00073533
Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.232-240, 2010. ,
DOI : 10.1016/j.parco.2009.12.005
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors, 2013. ,
An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines, pp.141-155, 2013. ,
DOI : 10.1007/978-3-642-40698-0_11
URL : https://hal.archives-ouvertes.fr/hal-00867438