M. P. Forum, MPI: A Message-Passing Interface Standard, 1994.

D. Buntinas, B. Goglin, D. Goodell, G. Mercier, and S. Moreaud, Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis, 2009 International Conference on Parallel Processing, pp.462-469, 2009.
DOI : 10.1109/ICPP.2009.22

URL : https://hal.archives-ouvertes.fr/inria-00390064

S. Moreaud, B. Goglin, D. Goodell, and R. Namyst, Optimizing MPI communication within large multicore nodes with kernel assistance, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010.
DOI : 10.1109/IPDPSW.2010.5470849

URL : https://hal.archives-ouvertes.fr/inria-00451471

T. Ma, G. Bosilca, A. Bouteiller, and J. J. Dongarra, Locality and Topology Aware Intra-node Communication among Multicore CPUs, Proceedings of the 17th European MPI Users Group Conference, 2010.
DOI : 10.1007/978-3-642-15646-5_28

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. M. Squyres et al., Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs, 2011 International Conference on Parallel Processing, 2011.
DOI : 10.1109/ICPP.2011.29

URL : https://hal.archives-ouvertes.fr/inria-00602877

. Openmp, The OpenMP API specification for parallel programming

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur, Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming, International Journal of High Performance Computing Applications, vol.24, issue.1, pp.49-57, 2010.
DOI : 10.1177/1094342009360206

E. Jeannot and G. Mercier, Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures, Proceedings of the 16th International Euro-Par Conference, 2010.
DOI : 10.1007/978-3-642-15291-7_20

URL : https://hal.archives-ouvertes.fr/inria-00544346

E. Jeannot and G. Mercier, Improving MPI Applications Performance on Multicore Clusters with Rank Reordering, in: Recent Advances in the Message Passing Interface. The 18th European MPI User's Group Meeting, Lecture Notes in Computer Science, 2011.

N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz et al., Myrinet: a gigabit-per-second local area network, IEEE Micro, vol.15, issue.1, pp.29-36, 1995.
DOI : 10.1109/40.342015

M. Koop, W. Huang, K. Gopalakrishnan, and D. K. Panda, Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand, 2008 16th IEEE Symposium on High Performance Interconnects, 2008.
DOI : 10.1109/HOTI.2008.26

P. Geoffray, L. Prylli, B. Tourancheau, and . Bip-smp, High Performance Message Passing over a Cluster of Commodity SMPs, Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999.

B. Goglin, High Throughput Intra-Node MPI Communication with Open-MX, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp.173-180, 2009.
DOI : 10.1109/PDP.2009.20

URL : https://hal.archives-ouvertes.fr/inria-00331209

I. Myricom, Myrinet Express (MX): A High Performance, Low-Level, Message-Passing Interface for Myrinet, 2006.

D. Buntinas, G. Mercier, and W. Gropp, Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem, Parallel Computing, vol.33, issue.9, pp.634-644, 2006.
DOI : 10.1016/j.parco.2007.06.003

URL : https://hal.archives-ouvertes.fr/hal-00344327

D. Buntinas, G. Mercier, and W. Gropp, Data Transfers between Processes in an SMP System: Performance Study and Application to MPI, Parallel Processing, International Conference on, pp.487-496, 2006.

P. Lai, S. Sur, and D. K. Panda, Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems, Proceedings of the International Supercomputing Conference (ISC'10), 2010.
DOI : 10.1007/s00450-010-0115-3

R. Thakur, Improving the Performance of Collective Operations in MPICH, Proceedings of the 10th European PVM/MPI Users Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2003), pp.257-267, 2003.
DOI : 10.1007/978-3-540-39924-7_38

J. M. Squyres and A. Lumsdaine, The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms*, Proceedings, 18th ACM International Conference on Supercomputing, Workshop on Component Models and Systems for Grid Applications, pp.167-185, 2004.
DOI : 10.1007/0-387-23352-0_11

A. Grover and C. Leech, Accelerating Network Receive Processing (Intel I/O Acceleration Technology, Proceedings of the Linux Symposium (OLS2005), pp.281-288, 2005.

R. Huggahalli, R. Iyer, and S. Tetrick, Direct Cache Access for High Bandwidth Network I/O, ACM SIGARCH Computer Architecture News, vol.33, issue.2, pp.50-59, 2005.
DOI : 10.1145/1080695.1069976

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

K. Vaidyanathan, L. Chai, W. Huang, and D. K. Panda, Efficient asynchronous memory copy operations on multi-core systems and I/OAT, 2007 IEEE International Conference on Cluster Computing, pp.159-168, 2007.
DOI : 10.1109/CLUSTR.2007.4629228

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010.
DOI : 10.1109/PDP.2010.67

URL : https://hal.archives-ouvertes.fr/inria-00429889

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter et al., The Nas Parallel Benchmarks, International Journal of High Performance Computing Applications, vol.5, issue.3, pp.63-73, 1991.
DOI : 10.1177/109434209100500306

A. Plaat, H. E. Bal, R. F. Hofman, and T. Kielmann, Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects, Future Generation Computer Systems, vol.17, issue.6, pp.769-782, 2001.
DOI : 10.1016/S0167-739X(00)00103-5

T. Ma, A. Bouteiller, G. Bosilca, and J. J. Dongarra, Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW, Proceedings of the 18th European MPI Users Group Conference, 2011.
DOI : 10.1007/978-3-642-24449-0_28

R. Brightwell, T. Hudson, and K. Pedretti, SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008.
DOI : 10.1109/SC.2008.5218881

H. Jin, S. Sur, L. Chai, and D. K. Panda, Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems, 2007 IEEE International Conference on Cluster Computing, 2007.
DOI : 10.1109/CLUSTR.2007.4629263

L. Chai, P. Lai, H. Jin, and D. K. Panda, Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems, 2008 37th International Conference on Parallel Processing, 2008.
DOI : 10.1109/ICPP.2008.16

H. Jin, S. Sur, L. Chai, and D. K. Panda, LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster, Proceedings of the IEEE International Conference on Parallel Processing (ICPP-2005), 2005.

C. Yeoh, Cross Memory Attach, 2010.

T. Ma, T. Herault, G. Bosilca, and J. J. Dongarra, Process Distance-Aware Adaptive MPI Collective Communications, 2011 IEEE International Conference on Cluster Computing, 2011.
DOI : 10.1109/CLUSTER.2011.30

T. Ma, G. Bosilca, A. Bouteiller, and J. J. Dongarra, HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.91