, Illustration of runtimes of parallel applications

, 14 3 Thread mapping schemes, m = 32 on two NUMA nodes, p.20

. .. Amdahl-scaling,

. .. Scalability-models, 40 12 Scheme of our model for shared bandwidth resources

A. Abel, F. Benz, J. Doerfert, B. Dörr, S. Hahn et al., Impact of resource sharing on performance and performance prediction: A survey, Proceedings of the 24th international conference on Concurrency Theory, pp.25-43, 2013.

C. Mendis, A. Renda, S. Amarasinghe, and M. Carbin, Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks, Proceedings of the 36th International Conference on Machine Learning (ICML), pp.4505-4515, 2019.

M. Gene and . Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, Spring Joint Computer Conference, pp.483-485, 1967.

D. Mark, . Hill, and . Michael-r-marty, Amdahl's law in the multicore era, Computer, vol.41, issue.7, pp.33-38, 2008.

L. Yavits, A. Morad, and R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, vol.40, issue.1, pp.1-16, 2014.

J. Neil and . Gunther, A simple capacity model of massively parallel transaction systems, 19th International Computer Measurement Group Conference, pp.1035-1035, 1993.

J. Neil and . Gunther, A general theory of computational scalability based on rational functions, 2008.

J. Neil, S. Gunther, S. Subramanyam, and . Parvu, A Methodology for Optimizing Multithreaded System Scalability on Multicores, Programming Multicore and Many-core Computing Systems, pp.363-384, 2017.

S. Ristov and M. Gusev, Superlinear speedup for matrix multiplication, Proceedings of the 34th International Conference on Information Technology Interfaces, pp.499-504, 2012.

S. Ristov, R. Prodan, M. Gusev, and K. Skala, Superlinear speedup in HPC systems: Why and when?, Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, pp.889-898, 2016.

K. Didier, D. Potop-butucaru, G. Iooss, A. Cohen, J. Souyris et al., Efficient parallelization of large-scale hard real-time applications, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01810176

S. Hammond, C. Vaughan, and C. Hughes, Evaluating the Intel Skylake Xeon processor for HPC workloads, 2018 International Conference on High Performance Computing & Simulation (HPCS), pp.342-349, 2018.

R. Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister et al., Memory hierarchies, pipelines, and buses for future architectures in time-critical embedded systems, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.28, issue.7, pp.966-978, 2009.

T. Lundqvist and P. Stenstrom, Timing anomalies in dynamically scheduled microprocessors, Proceedings of the 20th IEEE Real-Time Systems Symposium, pp.12-21, 1999.

J. Reineke, B. Wachter, S. Thesing, R. Wilhelm, I. Polian et al., A definition and classification of timing anomalies, 6th International Workshop on Worst-Case Execution Time Analysis (WCET'06). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2006.

A. Schranzhofer, J. Chen, and L. Thiele, Timing analysis for TDMA arbitration in resource sharing systems, 16th IEEE Real-Time and Embedded Technology and Applications Symposium, pp.215-224, 2010.

R. Pellizzoni, A. Schranzhofer, J. Chen, M. Caccamo, and L. Thiele, Worst case delay analysis for memory interference in multicore systems, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), pp.741-746, 2010.

S. Hahn, J. Reineke, and R. Wilhelm, Towards compositionality in execution time analysis: definition and challenges, ACM SIGBED Review, vol.12, issue.1, pp.28-36, 2015.

P. Radojkovi?, S. Girbal, A. Grasset, E. Quiñones, S. Yehia et al., On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments, ACM Transactions on Architecture and Code Optimization (TACO), vol.8, issue.4, p.34, 2012.

C. D. Antonopoulos, D. S. Nikolopoulos, and T. S. Papatheodorou, Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs, Proceedings of the 11th International Conference on High Performance Computing, HiPC'04, pp.286-296, 2004.

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn, Using OS Observations to Improve Performance in Multicore Systems, IEEE Micro, vol.28, issue.3, pp.54-66, 2008.

S. Zhuravlev, S. Blagodurov, and A. Fedorova, Addressing Shared Resource Contention in Multicore Processors via Scheduling, SIGARCH Computer Architecture News, vol.38, issue.1, pp.129-142, 2010.

M. Bhadauria, A. Sally, and . Mckee, An approach to resource-aware co-scheduling for CMPs, Proceedings of the 24th ACM International Conference on Supercomputing, pp.189-199, 2010.

H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura, Scalability-Based Manycore Partitioning, Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pp.107-116, 2012.

S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova, A Case for NUMA-Aware Contention Management on Multicore Systems, Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, 2011.

A. Snavely and D. M. Tullsen, Symbiotic Jobscheduling for a Simultaneous Multithreaded Processor, Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pp.234-244, 2000.

J. Feliu, J. Sahuquillo, S. Petit, and J. Duato, L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pp.123-132, 2013.

S. Eyerman and L. Eeckhout, Probabilistic Job Symbiosis Modeling for SMT Processor Scheduling, Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pp.91-102, 2010.

T. Mytkowicz, A. Diwan, M. Hauswirth, and P. Sweeney, Producing wrong data without doing anything obviously wrong!, ACM SIGARCH Computer Architecture News, vol.37, issue.1, pp.265-276, 2009.

A. Mazouz and D. Barthou, Study of variations of native program execution times on multi-core architectures, 2010 International Conference on Complex, Intelligent and Software Intensive Systems, pp.919-924, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00643731

. John-l-gustafson, Reevaluating Amdahl's law, Communications of the ACM, vol.31, issue.5, pp.532-533, 1988.

. Wikichip, Skylake (server) -Microarchitectures -Intel, p.23, 2019.

E. Rotem, Intel architecture, code name Skylake deep dive: A new architecture to manage power performance and energy efficiency, Presentation at Intel Developer Forum (IDF15), 2015.

A. Kumar and M. Trivedi, Intel Xeon scalable processor architecture deep dive, Presentation at Intel Press Workshops, 2017.

. Wikichip, Xeon Gold 6130 -Intel, p.27, 2019.

, Intel® Xeon® Processor Scalable Family Specification Update, Reference Number, pp.336065-336075, 2019.

M. Gottschlag and F. Bellosa, Mechanism to Mitigate AVX-Induced Frequency Reduction, 2018.

S. Touati, J. Worms, and S. Briais, The Speedup-Test: a statistical methodology for programme speedup analysis and computation. Concurrency and computation: practice and experience, vol.25, pp.1410-1426, 2013.

A. Mazouz, S. Touati, and D. Barthou, Analysing the variability of OpenMP programs performances on multicore architectures, Fourth workshop on programmability issues for heterogeneous multicores (MULTIPROG-2011), 2011.
URL : https://hal.archives-ouvertes.fr/inria-00637957

, OpenMP Architecture Review Board. OpenMP Application Programming Interface, 2015.

A. Mazouz and D. Barthou, Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures, 2011 International Conference on High Performance Computing & Simulation, pp.273-279, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00636845

T. Klug, M. Ott, J. Weidendorfer, and C. Trinitis, autopin -automated optimization of thread-to-core pinning on multicore systems, Transactions on high-performance embedded architectures and compilers III, pp.219-235, 2011.

G. Almaless and F. Wajsburt, On the scalability of image and signal processing parallel applications on emerging cc-NUMA many-cores, Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing, pp.1-8, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00742963

H. Ying, S. Murat-efe-guney, and . Shane, Tips to Measure the Performance of Matrix Multiplication Using Intel MKL, p.21, 2017.

D. Terpstra, H. Jagode, H. You, and J. Dongarra, Collecting performance data with PAPI-C, Tools for High Performance Computing, pp.157-173, 2009.

, Intel® 64 and IA32 Architectures Performance Monitoring Events, pp.335279-335280, 2017.

H. Wong, Intel Ivy Bridge Cache Replacement Policy, p.26, 2013.

A. Jaleel, B. Kevin, S. C. Theobald, J. Steely, and . Emer, High performance cache replacement using re-reference interval prediction (RRIP), ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.60-71, 2010.

A. David, J. Patterson, and . Hennessy, Computer Organization and Design MIPS Edition: The Hardware/Software Interface, vol.13, pp.978-0124077263, 2013.

J. Sztrik, Basic Queueing Theory. GlobeEdit, vol.13, pp.978-3639734713, 2016.

J. Charles, P. Jassi, S. Narayan, A. Ananth, A. Sadat et al., Evaluation of the Intel® Core? i7 Turbo Boost feature, 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.188-197, 2009.

M. Annavaram, E. Grochowski, and J. Shen, Mitigating Amdahl's law through EPI throttling, ACM SIGARCH Computer Architecture News, vol.33, issue.2, pp.298-309, 2005.