A fundamental turn toward concurrency in software, Dr. Dobb's Journal, vol.30, issue.3, 2005. ,
FFTW: an adaptive software architecture for the FFT, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), pp.1381-1384, 1998. ,
DOI : 10.1109/ICASSP.1998.681704
Model-driven autotuning of sparse matrix-vector multiply on GPUs, Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), 2010. ,
Petabricks: A language and compiler for algorithmic choice, ACM SIGPLAN Conference on Programming Language Design and Implementation, 2009. ,
Autotuning multigrid with PetaBricks, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, 2009. ,
DOI : 10.1145/1654059.1654065
Automated empirical optimizations of software and the atlas project, Parallel Computing, vol.27, issue.12, pp.3-35, 2001. ,
Benchmarking GPUs to tune dense linear algebra, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2008. ,
DOI : 10.1109/SC.2008.5214359
Dense linear algebra solvers for multicore with GPU accelerators, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010. ,
DOI : 10.1109/IPDPSW.2010.5470941
Programming matrix algorithms-by-blocks for thread-level parallelism, ACM Transactions on Mathematical Software, vol.36, issue.3, 2009. ,
DOI : 10.1145/1527286.1527288
A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009. ,
DOI : 10.1016/j.parco.2008.10.002
http://www.intel.com/software/products, Math Kernel Library (MKL) ,
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, 2009. ,
DOI : 10.1145/1654059.1654080
Graph Theory: An algorithmic Approach, 1975. ,
Achieving accurate and context-sensitive timing for code optimization. Software: Practice and Experience, pp.1621-1642, 2008. ,
Communication-optimal Parallel and Sequential QR and LU Factorizations, SIAM Journal on Scientific Computing, vol.34, issue.1, 2008. ,
DOI : 10.1137/080731992
URL : https://hal.archives-ouvertes.fr/hal-00870930
Tile QR factorization with parallel panel processing for multicore architectures, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010. ,
DOI : 10.1109/IPDPS.2010.5470443
URL : https://hal.archives-ouvertes.fr/inria-00548899
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, 2011. ,
DOI : 10.1109/IPDPS.2011.90
URL : https://hal.archives-ouvertes.fr/inria-00547614
DAGuE: A generic distributed dag engine for high performance computing, 2010. ,
Europe -38334 Montbonnot Saint-Ismier Centre de recherche INRIA Lille ? Nord Europe : Parc Scientifique de la Haute Borne -40, avenue Halley -59650 Villeneuve d'Ascq Centre de recherche INRIA Nancy ? Grand Est : LORIA, Technopôle de Nancy-Brabois -Campus scientifique 615, rue du Jardin Botanique -BP 101 -54602 Villers-lès-Nancy Cedex Centre de recherche INRIA Paris ? Rocquencourt : Domaine de Voluceau -Rocquencourt -BP 105 -78153 Le Chesnay Cedex Centre de recherche INRIA Rennes ? Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu -35042 Rennes Cedex Centre de recherche INRIA Saclay ? Île-de-France, des Vignes : 4, rue Jacques Monod -91893 Orsay Cedex Centre de recherche INRIA, pp.105-78153, 2004. ,