B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon et al., Parallel programming with migratable objects: Charm++ in practice, SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.647-658, 2014.

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., Faster, Cheaper, Better -a Hybridization Methodology to Develop Linear Algebra Software for GPUs, GPU Computing Gems, vol.2, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00547847

E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost et al., Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model, IEEE Transactions on Parallel and Distributed Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01618526

C. Augonnet, S. Thibault, R. Namyst, and P. A. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. CCPE -Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

O. Aumage, E. Brunet, N. Furmento, and R. Namyst, NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks, Workshop on Communication Architecture for Clusters, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00122723

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, Legion: Expressing locality and independence with logical regions, SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2012.

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Luszczek et al., Dense linear algebra on distributed heterogeneous hardware with a symbolic dag approach. Scalable Computing and Communications: Theory and Practice pp, pp.699-735, 2013.

A. Denis, Scalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests, CCGrid 2019 -19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02103700

J. Dongarra, Architecture-Aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures, 2013.

E. Jeannot, Automatic multithreaded parallel program generation for message passing multiprocessors using parameterized task graphs, International Conference 'Parallel Computing, 2001.
URL : https://hal.archives-ouvertes.fr/inria-00100489

H. Kaiser, M. Brodowicz, and T. Sterling, Parallex an advanced parallel execution model for scaling-impaired applications, 2009 International Conference on Parallel Processing Workshops, pp.394-401, 2009.


J. Pje?ivac-grbovi?, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel et al., Performance analysis of mpi collective operations, Cluster Computing, vol.10, issue.2, pp.127-143, 2007.

P. Sanders, J. Speck, and J. L. Träff, Two-tree algorithms for full bandwidth broadcast, reduction and scan, Parallel Computing, vol.35, issue.12, pp.581-594, 2009.


E. Tejedor, M. Farreras, D. Grove, R. M. Badia, G. Almasi et al., A high-productivity task-based programming model for clusters. Concurrency and Computation: Practice and Experience, vol.24, pp.2421-2448, 2012.

J. L. Träff and A. Ripke, Optimal broadcast for fully connected processor-node networks, Journal of Parallel and Distributed Computing, vol.68, issue.7, pp.887-901, 2008.

U. Wickramasinghe and A. Lumsdaine, A survey of methods for collective communication optimization and tuning, 2016.