Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, vol.180 ,
DOI : 10.1088/1742-6596/180/1/012037
LU factorization for accelerator-based systems, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp.217-224, 2011. ,
DOI : 10.1109/AICCSA.2011.6126599
URL : https://hal.archives-ouvertes.fr/hal-00654193
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.932-943, 2011. ,
DOI : 10.1109/IPDPS.2011.90
URL : https://hal.archives-ouvertes.fr/inria-00547614
Faster, Cheaper, Better ? a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00547847
A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.583-592, 2010. ,
DOI : 10.1109/PDP.2010.51
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures . Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009. ,
URL : https://hal.archives-ouvertes.fr/inria-00384363
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures . Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009. ,
URL : https://hal.archives-ouvertes.fr/inria-00384363
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Euro- Par, pp.851-862, 2009. ,
DOI : 10.1109/TPDS.2003.1214317
Parallelizing dense and banded linear algebra libraries using SMPSs, Concurrency and Computation: Practice and Experience, pp.2438-2456, 2009. ,
DOI : 10.1002/cpe.1463
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.3457
Parallel Sparse Linear Solver GMRES for GPU Clusters with Compression of Exchanged Data, HeteroPar'11, 9-th Int. Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms, 2011. ,
DOI : 10.1007/978-3-642-29737-3_52
Efficient sparse matrix-vector multiplication on CUDA, 2008. ,
Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, 2009. ,
DOI : 10.1145/1654059.1654078
CellSs: Scheduling Techniques to Better Exploit Memory Hierarchy, Scientific Programming, pp.77-95, 2009. ,
DOI : 10.1155/2009/561672
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1432-1441, 2011. ,
DOI : 10.1109/IPDPS.2011.299
DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, issue.1-2, pp.37-51, 2012. ,
DOI : 10.1016/j.parco.2011.10.003
Parallel tiled QR factorization for multicore architectures. Concurrency and Computation: Practice and Experience, pp.1573-1590, 2008. ,
Fast Conjugate Gradients with Multiple GPUs, pp.893-903 ,
DOI : 10.1007/978-3-642-01970-8_90
High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning, Computer Science - Research and Development, vol.33, issue.10???11, pp.83-91, 2010. ,
DOI : 10.1007/s00450-010-0112-6
Model-driven autotuning of sparse matrix-vector multiply on GPUs, 2010. ,
Sparse systems solving on GPUs with GMRES, The Journal of Supercomputing, vol.59, issue.3, pp.1504-1516, 2012. ,
DOI : 10.1007/s11227-011-0562-z
URL : https://hal.archives-ouvertes.fr/hal-00644456
Conjugate gradients on multiple GPUs, International Journal for Numerical Methods in Fluids, vol.227, issue.10-12, pp.10-121254, 2010. ,
DOI : 10.1002/fld.2462
Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines, SIAM Journal on Scientific Computing, vol.35, issue.1 ,
DOI : 10.1137/12086563X
Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, vol.40, issue.7, 2012. ,
DOI : 10.1016/j.parco.2013.06.001
On the parallel scalability of hybrid linear solvers for large 3D problems, 2008. ,
URL : https://hal.archives-ouvertes.fr/tel-00347948
Improving performance of adaptive component-based dataflow middleware, Parallel Computing, vol.38, pp.6-7289, 2012. ,
parallelization for interactive physics simulations, Euro-Par, pp.235-246, 2010. ,
CHARM++: A portable concurrent object oriented system based on C++, OOPSLA, pp.91-108, 1993. ,
Programming heterogeneous clusters with accelerators using object-based programming, Scientific Programming, vol.19, issue.1, pp.47-62, 2011. ,
Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience, pp.15-44, 2010. ,
Towards Realistic Reservoir Simulations on Manycore Platforms, SPE Journal, pp.1-23, 2010. ,
Qilin, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.45-55, 2009. ,
DOI : 10.1145/1669112.1669121
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, SAMOS, pp.289-297, 2009. ,
DOI : 10.1007/978-3-540-75444-2_37
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures, Lecture Notes in Computer Science, vol.5952, pp.111-125, 2010. ,
DOI : 10.1007/978-3-642-11515-8_10
The OpenCL specification, khronos opencl working group, version 1.1, revision 44, 2011. ,
New row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA, 1012. ,
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 2008. ,
DOI : 10.1109/PDP.2008.37
Solving dense linear algebra problems on platforms with multiple hardware accelerators, FLAME Working Notes, p.32, 2008. ,
Solving dense linear systems on platforms with multiple hardware accelerators, ACM SIGPLAN Notices, vol.44, issue.4, pp.121-130, 2009. ,
DOI : 10.1145/1594835.1504196
Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism, 2007. ,
Iterative Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, 2003. ,
DOI : 10.1137/1.9780898718003
Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, vol.13, issue.3, pp.260-274, 2002. ,
DOI : 10.1109/71.993206
The libflame Library for Dense Matrix Computations, Computing in Science and Engineering, vol.11, issue.6, pp.56-63, 2009. ,
Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs, Parallel Computing, vol.38, issue.10-11, pp.552-575, 2012. ,
DOI : 10.1016/j.parco.2012.07.002
Solving Sparse Linear Systems on NVIDIA Tesla GPUs, pp.864-873 ,
DOI : 10.1007/978-3-642-01970-8_87
Implementing the conjugate gradient algorithm on multi-core systems, 2007 International Symposium on System-on-Chip, pp.1-4, 2007. ,
DOI : 10.1109/ISSOC.2007.4427436
Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pp.1-38, 2007. ,