L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, Journal of Computational Physics, vol.7387, issue.2, pp.325-3480021, 1987.

L. Greengard and W. D. Gropp, A parallel version of the fast multipole method, Computers & Mathematics with Applications, vol.20, issue.7, pp.63-71, 1990.
DOI : 10.1016/0898-1221(90)90349-O

A. Chandramowlishwaran, S. Williams, L. Oliker, .. G. Lashuk, and R. Vuduc, Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp.1-15, 2010.
DOI : 10.1109/IPDPS.2010.5470415

F. A. Cruz, M. G. Knepley, and L. A. Barba, PetFMM-A dynamically load-balancing parallel fast multipole library, International Journal for Numerical Methods in Engineering, vol.19, issue.2, pp.403-428, 2011.
DOI : 10.1002/nme.2972

E. Darve, C. Cecka, and T. Takahashi, The fast multipole method on parallel clusters, multicore processors, and graphics processing units, Comptes Rendus M??canique, vol.339, issue.2-3, pp.185-193, 2011.
DOI : 10.1016/j.crme.2010.12.005

R. Yokota, T. Narumi, R. Sakamaki, S. Kameoka, S. Obi et al., Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence, Computer Physics Communications, vol.180, issue.11, pp.2066-2078, 2009.
DOI : 10.1016/j.cpc.2009.06.009

N. A. Gumerov and R. Duraiswami, Fast multipole methods on graphics processors, Journal of Computational Physics, vol.227, issue.18, pp.8290-8313, 2008.
DOI : 10.1016/j.jcp.2008.05.023
URL : http://drum.lib.umd.edu/bitstream/1903/7549/1/paper_gpu_fmm_revised_final.pdf

T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori et al., 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence, 1?62:12. [Online]. Available, 2009.

T. Takahashi, C. Cecka, W. Fong, E. Darve, Q. Hu et al., Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '11, pp.105-1331, 2011.
DOI : 10.1002/nme.3240

I. Lashuk, C. Aparna, H. Langston, T. Nguyen, R. Sampath et al., A massively parallel adaptive fast-multipole method on heterogeneous architectures, Proceedings of the 2009 ACM/IEEE conference on Supercomputing, pp.1-11, 2009.

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., Faster, Cheaper, Better ? a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00547847

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou et al., LU factorization for accelerator-based systems, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp.217-224, 2011.
DOI : 10.1109/AICCSA.2011.6126599
URL : https://hal.archives-ouvertes.fr/hal-00654193

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.932-943, 2011.
DOI : 10.1109/IPDPS.2011.90
URL : https://hal.archives-ouvertes.fr/inria-00547614

G. Quintana-ortí, F. D. Igual, E. S. Quintana-ortí, and R. A. Van-de-geijn, Solving dense linear systems on platforms with multiple hardware accelerators, ACM SIGPLAN Notices, vol.44, issue.4, pp.121-130, 2009.
DOI : 10.1145/1594835.1504196

G. Quintana-ortí, E. S. Quintana-ortí, E. Chan, F. G. Zee, and R. A. Van-de-geijn, Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 2008.
DOI : 10.1109/PDP.2008.37

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, Parallel tiled QR factorization for multicore architectures, Concurrency and Computation: Practice and Experience, pp.1573-1590, 2008.

J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia, Scheduling dense linear algebra operations on multicore processors, Concurrency and Computation: Practice and Experience, pp.15-44, 2010.
DOI : 10.1145/1377612.1377615
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.177.3294

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar et al., Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA PLASMA users' guide, parallel linear algebra software for multicore architectures, IPDPS Workshops. IEEE, pp.1432-1441, 2009.

. Inria, MAGMA users' guide, version 0.2, 2009.

F. G. Van-zee, E. Chan, R. A. Van-de-geijn, E. S. Quintana-orti, and G. Quintana-orti, The libflame library for dense matrix computations, Computing in Science and Engineering, vol.11, issue.6, pp.56-63, 2009.

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, 2011.
DOI : 10.1109/IPDPS.2011.90
URL : https://hal.archives-ouvertes.fr/inria-00547614

H. Ltaief and R. Yokota, Data-driven execution of fast multipole methods, Concurrency and Computation: Practice and Experience, vol.26, issue.11, 1203.
DOI : 10.1002/cpe.3132

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures, Concurrency and Computation: Practice and Experience, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

W. Fong and E. Darve, The black-box fast multipole method, Journal of Computational Physics, vol.228, issue.23, pp.8712-8725, 2009.
DOI : 10.1016/j.jcp.2009.08.031

L. Ying, G. Biros, and D. Zorin, A kernel-independent adaptive fast multipole algorithm in two and three dimensions, Journal of Computational Physics, vol.196, issue.2, pp.591-626, 2004.
DOI : 10.1016/j.jcp.2003.11.021

M. Messner, M. Schanz, and E. Darve, Fast directional multilevel summation for oscillatory kernels based on Chebyshev interpolation, Journal of Computational Physics, vol.231, issue.4, pp.1175-1196, 2012.
DOI : 10.1016/j.jcp.2011.09.027

E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp.851-862, 2009.
DOI : 10.1109/TPDS.2003.1214317

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier et al., DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, issue.1-2, pp.37-51, 2012.
DOI : 10.1016/j.parco.2011.10.003

G. F. Diamos and S. Yalamanchili, Harmony, Proceedings of the 17th international symposium on High performance distributed computing, HPDC '08, pp.197-200, 2008.
DOI : 10.1145/1383422.1383447

K. Fatahalian, T. Knight, M. Houston, M. Erez, D. Horn et al., Sequoia: Programming the memory hierarchy Scaling hierarchical N-body simulations on GPU clusters, ACM/IEEE SC'06 Conference SC'10 USB Key, 2006.

M. S. Warren and J. K. Salmon, A parallel hashed Oct-Tree N-body algorithm, Proceedings of the 1993 ACM/IEEE conference on Supercomputing , Supercomputing '93, pp.12-21, 1993.
DOI : 10.1145/169627.169640

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009.