E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak et al., Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, vol.180
DOI : 10.1088/1742-6596/180/1/012037

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou et al., LU factorization for accelerator-based systems, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp.217-224, 2011.
DOI : 10.1109/AICCSA.2011.6126599

URL : https://hal.archives-ouvertes.fr/hal-00654193

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.932-943, 2011.
DOI : 10.1109/IPDPS.2011.90

URL : https://hal.archives-ouvertes.fr/inria-00547614

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., Faster, Cheaper, Better ? a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00547847

M. Ament, G. Knittel, D. Weiskopf, and W. Straßer, A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.583-592, 2010.
DOI : 10.1109/PDP.2010.51

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures . Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures . Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Euro- Par, pp.851-862, 2009.
DOI : 10.1109/TPDS.2003.1214317

R. M. Badia, J. R. Herrero, J. Labarta, J. M. Pérez, E. S. Quintana-ortí et al., Parallelizing dense and banded linear algebra libraries using SMPSs, Concurrency and Computation: Practice and Experience, pp.2438-2456, 2009.
DOI : 10.1002/cpe.1463

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.3457

J. Bahi, R. Couturier, and L. Z. Khodja, Parallel Sparse Linear Solver GMRES for GPU Clusters with Compression of Exchanged Data, HeteroPar'11, 9-th Int. Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms, 2011.
DOI : 10.1007/978-3-642-29737-3_52

N. Bell and M. Garland, Efficient sparse matrix-vector multiplication on CUDA, 2008.

N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, 2009.
DOI : 10.1145/1654059.1654078

P. Bellens, J. M. Pérez, F. Cabarcas, A. Ramírez, R. M. Badia et al., CellSs: Scheduling Techniques to Better Exploit Memory Hierarchy, Scientific Programming, pp.77-95, 2009.
DOI : 10.1155/2009/561672

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar et al., Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1432-1441, 2011.
DOI : 10.1109/IPDPS.2011.299

G. Bosilca, A. Bouteiller, A. Danalis, T. Hérault, P. Lemarinier et al., DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, issue.1-2, pp.37-51, 2012.
DOI : 10.1016/j.parco.2011.10.003

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, Parallel tiled QR factorization for multicore architectures. Concurrency and Computation: Practice and Experience, pp.1573-1590, 2008.

A. Cevahir, A. Nukada, and S. Matsuoka, Fast Conjugate Gradients with Multiple GPUs, pp.893-903
DOI : 10.1007/978-3-642-01970-8_90

A. Cevahir, A. Nukada, and S. Matsuoka, High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning, Computer Science - Research and Development, vol.33, issue.10???11, pp.83-91, 2010.
DOI : 10.1007/s00450-010-0112-6

W. Jee, A. Choi, R. W. Singh, and . Vuduc, Model-driven autotuning of sparse matrix-vector multiply on GPUs, 2010.

R. Couturier and S. Domas, Sparse systems solving on GPUs with GMRES, The Journal of Supercomputing, vol.59, issue.3, pp.1504-1516, 2012.
DOI : 10.1007/s11227-011-0562-z

URL : https://hal.archives-ouvertes.fr/hal-00644456

S. Georgescu and H. Okuda, Conjugate gradients on multiple GPUs, International Journal for Numerical Methods in Fluids, vol.227, issue.10-12, pp.10-121254, 2010.
DOI : 10.1002/fld.2462

P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose, Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines, SIAM Journal on Scientific Computing, vol.35, issue.1
DOI : 10.1137/12086563X

P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, vol.40, issue.7, 2012.
DOI : 10.1016/j.parco.2013.06.001

A. Haidar, On the parallel scalability of hybrid linear solvers for large 3D problems, 2008.
URL : https://hal.archives-ouvertes.fr/tel-00347948

D. R. Timothy, E. Hartley, Ü. V. Saule, and . Çatalyürek, Improving performance of adaptive component-based dataflow middleware, Parallel Computing, vol.38, pp.6-7289, 2012.

M. Multi-gpu, parallelization for interactive physics simulations, Euro-Par, pp.235-246, 2010.

V. Laxmikant, S. Kalé, and . Krishnan, CHARM++: A portable concurrent object oriented system based on C++, OOPSLA, pp.91-108, 1993.

M. David, L. V. Kunzman, and . Kalé, Programming heterogeneous clusters with accelerators using object-based programming, Scientific Programming, vol.19, issue.1, pp.47-62, 2011.

J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia, Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience, pp.15-44, 2010.

R. Li, H. Klie, H. Sudan, and Y. Saad, Towards Realistic Reservoir Simulations on Manycore Platforms, SPE Journal, pp.1-23, 2010.

C. Luk, S. Hong, and H. Kim, Qilin, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.45-55, 2009.
DOI : 10.1145/1669112.1669121

A. Monakov and A. Avetisyan, Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, SAMOS, pp.289-297, 2009.
DOI : 10.1007/978-3-540-75444-2_37

A. Monakov, A. Lokhmotov, and A. Avetisyan, Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures, Lecture Notes in Computer Science, vol.5952, pp.111-125, 2010.
DOI : 10.1007/978-3-642-11515-8_10

A. Munshi, The OpenCL specification, khronos opencl working group, version 1.1, revision 44, 2011.

T. Oberhuber, A. Suzuki, and J. Vacata, New row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA, 1012.

G. Quintana-ortí, E. S. Quintana-ortí, E. Chan, F. G. Van-zee, and R. A. Van-de-geijn, Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 2008.
DOI : 10.1109/PDP.2008.37

G. Quintana-orti, F. D. Igual, E. S. Quintana-orti, and R. Van-de-geijn, Solving dense linear algebra problems on platforms with multiple hardware accelerators, FLAME Working Notes, p.32, 2008.

G. Quintana-ortí, F. D. Igual, E. S. Quintana-ortí, and R. A. Van-de-geijn, Solving dense linear systems on platforms with multiple hardware accelerators, ACM SIGPLAN Notices, vol.44, issue.4, pp.121-130, 2009.
DOI : 10.1145/1594835.1504196

J. Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism, 2007.

Y. Saad, Iterative Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, 2003.
DOI : 10.1137/1.9780898718003

H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, vol.13, issue.3, pp.260-274, 2002.
DOI : 10.1109/71.993206

G. Field, E. Van-zee, R. A. Chan, E. S. Van-de-geijn, G. Quintana-orti et al., The libflame Library for Dense Matrix Computations, Computing in Science and Engineering, vol.11, issue.6, pp.56-63, 2009.

M. Verschoor and A. C. Jalba, Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs, Parallel Computing, vol.38, issue.10-11, pp.552-575, 2012.
DOI : 10.1016/j.parco.2012.07.002

M. Wang, H. Klie, M. Parashar, and H. Sudan, Solving Sparse Linear Systems on NVIDIA Tesla GPUs, pp.864-873
DOI : 10.1007/978-3-642-01970-8_87

W. A. Wiggers, V. Bakker, A. B. Kokkeler, and G. J. Smit, Implementing the conjugate gradient algorithm on multi-core systems, 2007 International Symposium on System-on-Chip, pp.1-4, 2007.
DOI : 10.1109/ISSOC.2007.4427436

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick et al., Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pp.1-38, 2007.