. Numanode, (memory=256G) Package:1 L3Cache:1(size=20M) L2Cache:8(size=256K) L1dCache:1(size=32K) Core, pp.121-123

A. Szalay, A. Bunn, J. Gray, I. Foster, and I. Raicu, The importance of data locality in distributed computing applications, NSF Workflow Workshop, 2006.

M. Steckermeier and F. Bellosa, Using locality information in userlevel scheduling, p.91058

S. Moreaud and B. Goglin, Impact of NUMA Effects on High- Speed Networking with Multi-Opteron Machines, Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems, pp.24-29, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00175747

F. Song, S. Moore, and J. Dongarra, Feedback-directed thread scheduling with memory considerations, Proceedings of the 16th international symposium on High performance distributed computing , HPDC '07, pp.97-106, 2007.
DOI : 10.1145/1272366.1272380

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

S. Kim, D. Chandra, and Y. Solihin, Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04)

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010.
DOI : 10.1109/PDP.2010.67

URL : https://hal.archives-ouvertes.fr/inria-00429889

J. Hursey and J. M. Squyres, Advancing application process affinity experimentation, Proceedings of the 20th European MPI Users' Group Meeting on, EuroMPI '13, pp.163-168, 2013.
DOI : 10.1145/2488551.2488603

T. Ma, G. Bosilca, A. Bouteiller, and J. J. Dongarra, Locality and Topology Aware Intra-node Communication among Multicore CPUs, Proceedings of the 17th European MPI Users Group Conference, pp.265-274, 2010.
DOI : 10.1007/978-3-642-15646-5_28

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

S. Moreaud, B. Goglin, R. Namyst, and E. G. , Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access, " in Recent Advances in the Message Passing Interface. The 17th European MPI User's Group Meeting, ser. Lecture Notes in Computer Science, 2010.

A. E. Eichenberger, C. Terboven, M. Wong, and D. Mey, The Design of OpenMP Thread Affinity, OpenMP in a Heterogeneous World - 8th International Workshop on OpenMP, 2012.
DOI : 10.1007/978-3-642-30961-8_2

J. Treibig, G. Hager, and G. Wellein, LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments, 2010 39th International Conference on Parallel Processing Workshops, pp.207-216, 2010.
DOI : 10.1109/ICPPW.2010.38

URL : http://arxiv.org/abs/1004.4431

F. Song, S. Moore, and J. Dongarra, Analytical modeling and optimization for affinity based thread scheduling on multicore systems, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289173

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

B. Putigny, B. Goglin, and D. Barthou, A benchmark-based performance model for memory-bound HPC applications, 2014 International Conference on High Performance Computing & Simulation (HPCS), pp.943-950, 2014.
DOI : 10.1109/HPCSim.2014.6903790

URL : https://hal.archives-ouvertes.fr/hal-00985598

D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser et al., LogP: Towards a Realistic Model of Parallel Computation, Principles Practice of Parallel Programming, pp.1-12, 1993.

H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn, MPIPP, Proceedings of the 20th annual international conference on Supercomputing , ICS '06, pp.353-360, 2006.
DOI : 10.1145/1183401.1183451

H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, Versatile, scalable, and accurate simulation of distributed applications and platforms, Journal of Parallel and Distributed Computing, vol.74, issue.10, pp.2899-2917, 2014.
DOI : 10.1016/j.jpdc.2014.06.008

URL : https://hal.archives-ouvertes.fr/hal-01017319

E. Rodrigues, F. Madruga, P. Navaux, and J. Panetta, Multi-core aware process mapping and its impact on communication overhead of parallel applications, 2009 IEEE Symposium on Computers and Communications, pp.811-817, 2009.
DOI : 10.1109/ISCC.2009.5202271

J. González-domínguez, G. L. Taboada, B. B. Fraguela, M. J. Martín, and J. Touriño, Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite, Computers & Electrical Engineering, vol.38, issue.2, pp.258-269, 2012.
DOI : 10.1016/j.compeleceng.2011.12.007

F. Broquedis, N. Furmento, B. Goglin, P. Wacrenier, and R. Namyst, ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures, International Journal of Parallel Programming, vol.62, issue.5-6, pp.418-439, 2010.
DOI : 10.1007/s10766-010-0136-3

URL : https://hal.archives-ouvertes.fr/inria-00496295

E. Jeannot, G. Mercier, and F. Tessier, Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques, IEEE Transactions on Parallel and Distributed Systems, vol.25, issue.4, pp.993-1002
DOI : 10.1109/TPDS.2013.104

URL : https://hal.archives-ouvertes.fr/hal-00803548

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier et al., DAGuE: A generic distributed DAG engine for High Performance Computing, extensions for Next-Generation Parallel Programming Models, pp.37-51, 2012.
DOI : 10.1016/j.parco.2011.10.003

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

B. Goglin, Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc), 2014 International Conference on High Performance Computing & Simulation (HPCS), pp.74-81, 2014.
DOI : 10.1109/HPCSim.2014.6903671

URL : https://hal.archives-ouvertes.fr/hal-00985096

B. Goglin, Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications, Proceedings of the Second International Symposium on Memory Systems, MEMSYS '16, pp.30-39, 2016.
DOI : 10.1109/I-SPAN.2008.13

URL : https://hal.archives-ouvertes.fr/hal-01330194