E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Task-Based FMM for Multicore Architectures, SIAM Journal on Scientific Computing, vol.36, issue.1, pp.66-93, 2014.
DOI : 10.1137/130915662

URL : https://hal.archives-ouvertes.fr/hal-00807368

L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, Journal of Computational Physics, vol.73, issue.2, pp.325-348, 1987.
DOI : 10.1016/0021-9991(87)90140-9

L. Greengard and V. Rokhlin, A new version of the Fast Multipole Method for the Laplace equation in three dimensions, Acta Numerica, vol.448, pp.229-269, 1997.
DOI : 10.1016/0009-2614(92)90053-P

L. Greengard and W. D. Gropp, A parallel version of the fast multipole method, Computers & Mathematics with Applications, vol.20, issue.7, pp.63-71, 1990.
DOI : 10.1016/0898-1221(90)90349-O

A. Chandramowlishwaran, S. Williams, L. Oliker, G. Lashuk, and R. Vuduc, Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp.1-15, 2010.
DOI : 10.1109/IPDPS.2010.5470415

F. A. Cruz, M. G. Knepley, and L. A. Barba, PetFMM-A dynamically load-balancing parallel fast multipole library, International Journal for Numerical Methods in Engineering, vol.19, issue.2, pp.403-428, 2011.
DOI : 10.1002/nme.2972

E. Darve, C. Cecka, and T. Takahashi, The fast multipole method on parallel clusters, multicore processors, and graphics processing units, Comptes Rendus M??canique, vol.339, issue.2-3, pp.185-193, 2011.
DOI : 10.1016/j.crme.2010.12.005

R. Yokota, T. Narumi, R. Sakamaki, S. Kameoka, S. Obi et al., Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence, Computer Physics Communications, vol.180, issue.11, pp.2066-2078, 2009.
DOI : 10.1016/j.cpc.2009.06.009

A. Nail, R. Gumerov, and . Duraiswami, Fast multipole methods on graphics processors, Journal of Computational Physics, vol.227, issue.18, pp.8290-8313, 2008.

T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori et al., 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence, pp.621-6212, 2009.

T. Takahashi, C. Cecka, W. Fong, and E. Darve, Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units, International Journal for Numerical Methods in Engineering, vol.16, issue.4, pp.105-133, 2012.
DOI : 10.1002/nme.3240

Q. Hu, A. Nail, R. Gumerov, and . Duraiswami, Scalable fast multipole methods on distributed heterogeneous architectures, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-36, 2011.
DOI : 10.1145/2063384.2063432

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

I. Lashuk, C. Aparna, H. Langston, T. Nguyen, R. Sampath et al., A massively parallel adaptive fast multipole method on heterogeneous architectures, Proceedings of the 2009 ACM/IEEE conference on Supercomputing, pp.1-11, 2009.

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., Faster, Cheaper, Better ? a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00547847

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou et al., LU factorization for accelerator-based systems, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp.217-224, 2011.
DOI : 10.1109/AICCSA.2011.6126599

URL : https://hal.archives-ouvertes.fr/hal-00654193

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.932-943, 2011.
DOI : 10.1109/IPDPS.2011.90

URL : https://hal.archives-ouvertes.fr/inria-00547614

G. Quintana-ortí, F. D. Igual, E. S. Quintana-ortí, and R. A. Van-de-geijn, Solving dense linear systems on platforms with multiple hardware accelerators, ACM SIGPLAN Notices, vol.44, issue.4, pp.121-130, 2009.
DOI : 10.1145/1594835.1504196

G. Quintana-ortí, E. S. Quintana-ortí, E. Chan, F. G. Van-zee, and R. A. Van-de-geijn, Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 2008.
DOI : 10.1109/PDP.2008.37

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, Parallel tiled QR factorization for multicore architectures. Concurrency and Computation: Practice and Experience, pp.1573-1590, 2008.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar et al., Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1432-1441, 2011.
DOI : 10.1109/IPDPS.2011.299

G. Field, E. Van-zee, R. A. Chan, E. S. Van-de-geijn, G. Quintana-orti et al., The libflame Library for Dense Matrix Computations, Computing in Science and Engineering, vol.11, issue.6, pp.56-63, 2009.

X. Lacoste, M. Faverge, P. Ramet, S. Thibault, and G. Bosilca, Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014.
DOI : 10.1109/IPDPSW.2014.9

URL : https://hal.archives-ouvertes.fr/hal-00925017

E. Agullo, A. Buttari, A. Guermouche, and F. Lopez, Multifrontal QR Factorization for Multicore Architectures over Runtime Systems, Euro-Par 2013 Parallel Processing, pp.521-532, 2013.
DOI : 10.1007/978-3-642-40047-6_53

URL : https://hal.archives-ouvertes.fr/hal-01220611

E. Agullo, L. Giraud, A. Guermouche, S. Nakov, and J. Roman, Taskbased Conjugate-Gradient for multi-GPUs platforms, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00767368

]. B. Lize, G. Sylvand, E. Agullo, and S. Thibault, A task-based H-matrix solver for acoustic and electromagnetic problems on multicore architectures, SciCADE, the International Conference on Scientific Computation and Differential Equations, 2013.

H. Ltaief and R. Yokota, Data-driven execution of fast multipole methods, Concurrency and Computation: Practice and Experience, vol.26, issue.11
DOI : 10.1002/cpe.3132

]. R. Kriemann, H -LU Factorization on Many-Core Systems Preprint 5, Max- Planck-Institut für Mathematik in den Naturwissenschaften Leipzig, 2014.

A. Duran, J. M. Perez, R. M. Ayguadé, E. Badia, and J. Labarta, Extending the OpenMP Tasking Model to Allow Dependent Tasks, OpenMP in a New Era of Parallelism, 4th International Workshop, pp.111-122, 2008.
DOI : 10.1007/978-3-540-79561-2_10

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures . Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability, Computing in Science and Engineering, vol.99, issue.1, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00930217

A. Yarkhan, J. Kurzak, and J. Dongarra, QUARK users' guide: QUeueing And Runtime for Kernels, 2011.

E. Chan, E. S. Quintana-orti, G. G. Quintana-orti, and R. Van-de-geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures , SPAA '07, pp.116-125, 2007.
DOI : 10.1145/1248377.1248397

T. Takahashi, C. Cecka, and E. Darve, Optimization of the parallel black-box fast multipole method on CUDA, 2012 Innovative Parallel Computing (InPar), pp.1-14, 2012.
DOI : 10.1109/InPar.2012.6339607

W. Fong and E. Darve, The black-box fast multipole method, Journal of Computational Physics, vol.228, issue.23, pp.8712-8725, 2009.
DOI : 10.1016/j.jcp.2009.08.031

M. Messner, M. Schanz, and E. Darve, Fast directional multilevel summation for oscillatory kernels based on Chebyshev interpolation, Journal of Computational Physics, vol.231, issue.4, pp.1175-1196, 2012.
DOI : 10.1016/j.jcp.2011.09.027

M. Messner, B. Bramas, O. Coulaud, and E. Darve, Optimized M2L Kernels for the Chebyshev Interpolation based Fast Multipole Method. ArXiv e-prints, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00746089

G. M. Morton, A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing, International Business Machines Company, 1966.

K. Kennedy and J. R. Allen, Optimizing Compilers for Modern Architectures: A Dependence-based Approach, 2002.

G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao et al., Fast implementation of DGEMM on Fermi GPU, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, p.35, 2011.
DOI : 10.1145/2063384.2063431

J. Choi, A. Chandramowlishwaran, K. Madduri, and R. Vuduc, A CPU, Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pp.64-64, 2014.
DOI : 10.1145/2588768.2576787

K. Nabors, F. Korsmeyer, F. Leighton, and J. White, Preconditioned, Adaptive, Multipole-Accelerated Iterative Methods for Three-Dimensional First-Kind Integral Equations of Potential Theory, SIAM Journal on Scientific Computing, vol.15, issue.3, pp.713-735, 1994.
DOI : 10.1137/0915046

C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault, StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators The 19th European MPI Users' Group Meeting, LNCS, vol.7490, 2012.