hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010. ,
DOI : 10.1109/PDP.2010.67
URL : https://hal.archives-ouvertes.fr/inria-00429889
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior, IEEE Transactions on, vol.53, issue.5, pp.547-566, 2004. ,
Estimating cache misses and locality using stack distances, Intl. conf. on Supercomputing, pp.150-159, 2003. ,
Accurate prediction of the behavior of multithreaded applications in shared caches, Parallel Computing, vol.39, issue.1, pp.36-57, 2013. ,
DOI : 10.1016/j.parco.2012.11.003
Thread Tailor: Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications, Intl. Symp. on Computer architecture, pp.270-279, 2010. ,
A low-overhead coherence solution for multiprocessors with private cache memories, ACM SIGARCH Computer Architecture News, vol.12, issue.3, pp.348-354, 1984. ,
DOI : 10.1145/773453.808204
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?, ACM SIGPLAN Notices, vol.45, issue.5, pp.203-212, 2010. ,
DOI : 10.1145/1837853.1693482
Mbench: memory benchmarking framework for multicores ,
Memory Bandwidth and Machine Balance in Current High Performance Computers, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pp.19-25, 1995. ,
Locating cache performance bottlenecks using data profiling, Proceedings of the 5th European conference on Computer systems, EuroSys '10, pp.335-348, 2010. ,
DOI : 10.1145/1755913.1755947
Iterative methods for sparse linear systems, SIAM, 2007. ,
DOI : 10.1137/1.9780898718003
Automated Code Generation for Lattice Quantum Chromodynamics and beyond, Journal of Physics: Conference Series, vol.510, 1401. ,
DOI : 10.1088/1742-6596/510/1/012005
URL : https://hal.archives-ouvertes.fr/hal-00926513
Rsim: simulating shared-memory multiprocessors with ILP processors, Computer, vol.35, issue.2, pp.40-49, 2002. ,
DOI : 10.1109/2.982915
URL : https://scholarship.rice.edu/bitstream/1911/19959/1/Hug2002Feb1RsimSimula.PDF
The Efficient Simulation of Parallel Computer Systems, " in Intl, J. in Comp. Simulation, pp.31-58, 1991. ,
Valgrind: A Program Supervision Framework, 2003. ,
PAPI: A Portable Interface to Hardware Performance Counters, Proc. of the Department of Defense HPCMP Users Group Conference, pp.7-10, 1999. ,
Perfmon2: a leap forward in performance monitoring, Journal of Physics: Conference Series, vol.119, issue.4, p.42017, 2008. ,
DOI : 10.1088/1742-6596/119/4/042017
Roofline, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009. ,
DOI : 10.1145/1498765.1498785
Cache-aware Roofline model: Upgrading the loft, IEEE Computer Architecture Letters, vol.13, issue.1, 2013. ,
DOI : 10.1109/L-CA.2013.6
Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering, " in Intl. conf. on Parallel processing, ser. Euro-Par'12, pp.451-460, 2013. ,
STREAM: Sustainable Memory Bandwidth in High Performance Computers, 1991. ,