F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010.
DOI : 10.1109/PDP.2010.67
URL : https://hal.archives-ouvertes.fr/inria-00429889

J. X. and X. Vera, Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior, IEEE Transactions on, vol.53, issue.5, pp.547-566, 2004.

C. Cascaval and D. A. Padua, Estimating cache misses and locality using stack distances, Intl. conf. on Supercomputing, pp.150-159, 2003.

D. Andrade, B. B. Fraguela, and R. Doallo, Accurate prediction of the behavior of multithreaded applications in shared caches, Parallel Computing, vol.39, issue.1, pp.36-57, 2013.
DOI : 10.1016/j.parco.2012.11.003

J. Lee, H. Wu, M. Ravichandran, and N. Clark, Thread Tailor: Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications, Intl. Symp. on Computer architecture, pp.270-279, 2010.

M. Papamarcos and J. Patel, A low-overhead coherence solution for multiprocessors with private cache memories, ACM SIGARCH Computer Architecture News, vol.12, issue.3, pp.348-354, 1984.
DOI : 10.1145/773453.808204

E. Zhang, Y. Jiang, and X. Shen, Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?, ACM SIGPLAN Notices, vol.45, issue.5, pp.203-212, 2010.
DOI : 10.1145/1837853.1693482

B. Putigny, Mbench: memory benchmarking framework for multicores

J. Mccalpin, Memory Bandwidth and Machine Balance in Current High Performance Computers, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pp.19-25, 1995.

A. Pesterev, N. Zeldovich, and R. T. Morris, Locating cache performance bottlenecks using data profiling, Proceedings of the 5th European conference on Computer systems, EuroSys '10, pp.335-348, 2010.
DOI : 10.1145/1755913.1755947

Y. Saad, Iterative methods for sparse linear systems, SIAM, 2007.
DOI : 10.1137/1.9780898718003

D. Barthou, O. Brand-foissac, R. Dolbeau, G. Grosdidier, C. Eisenbeis et al., Automated Code Generation for Lattice Quantum Chromodynamics and beyond, Journal of Physics: Conference Series, vol.510, 1401.
DOI : 10.1088/1742-6596/510/1/012005
URL : https://hal.archives-ouvertes.fr/hal-00926513

C. Hughes, V. Pai, P. Ranganathan, and S. Adve, Rsim: simulating shared-memory multiprocessors with ILP processors, Computer, vol.35, issue.2, pp.40-49, 2002.
DOI : 10.1109/2.982915
URL : https://scholarship.rice.edu/bitstream/1911/19959/1/Hug2002Feb1RsimSimula.PDF

R. Covington, S. Dwarkada, J. R. Jump, J. B. Sinclair, and S. Madala, The Efficient Simulation of Parallel Computer Systems, " in Intl, J. in Comp. Simulation, pp.31-58, 1991.

N. Nethercote and J. Seward, Valgrind: A Program Supervision Framework, 2003.

P. Mucci, S. Browne, C. Deane, and G. Ho, PAPI: A Portable Interface to Hardware Performance Counters, Proc. of the Department of Defense HPCMP Users Group Conference, pp.7-10, 1999.

S. Jarp, R. Jurga, and A. Nowak, Perfmon2: a leap forward in performance monitoring, Journal of Physics: Conference Series, vol.119, issue.4, p.42017, 2008.
DOI : 10.1088/1742-6596/119/4/042017

S. Williams, A. Waterman, and D. Patterson, Roofline, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009.
DOI : 10.1145/1498765.1498785

A. Ilic, F. Pratas, and L. Sousa, Cache-aware Roofline model: Upgrading the loft, IEEE Computer Architecture Letters, vol.13, issue.1, 2013.
DOI : 10.1109/L-CA.2013.6

J. Treibig, G. Hager, and G. Wellein, Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering, " in Intl. conf. on Parallel processing, ser. Euro-Par'12, pp.451-460, 2013.

J. Mccalpin, STREAM: Sustainable Memory Bandwidth in High Performance Computers, 1991.