, Illustration of runtimes of parallel applications
, 14 3 Thread mapping schemes, m = 32 on two NUMA nodes, p.20
,
40 12 Scheme of our model for shared bandwidth resources ,
Impact of resource sharing on performance and performance prediction: A survey, Proceedings of the 24th international conference on Concurrency Theory, pp.25-43, 2013. ,
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks, Proceedings of the 36th International Conference on Machine Learning (ICML), pp.4505-4515, 2019. ,
Validity of the single processor approach to achieving large scale computing capabilities, Spring Joint Computer Conference, pp.483-485, 1967. ,
Amdahl's law in the multicore era, Computer, vol.41, issue.7, pp.33-38, 2008. ,
The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, vol.40, issue.1, pp.1-16, 2014. ,
A simple capacity model of massively parallel transaction systems, 19th International Computer Measurement Group Conference, pp.1035-1035, 1993. ,
, A general theory of computational scalability based on rational functions, 2008.
A Methodology for Optimizing Multithreaded System Scalability on Multicores, Programming Multicore and Many-core Computing Systems, pp.363-384, 2017. ,
Superlinear speedup for matrix multiplication, Proceedings of the 34th International Conference on Information Technology Interfaces, pp.499-504, 2012. ,
Superlinear speedup in HPC systems: Why and when?, Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, pp.889-898, 2016. ,
Efficient parallelization of large-scale hard real-time applications, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01810176
Evaluating the Intel Skylake Xeon processor for HPC workloads, 2018 International Conference on High Performance Computing & Simulation (HPCS), pp.342-349, 2018. ,
Memory hierarchies, pipelines, and buses for future architectures in time-critical embedded systems, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.28, issue.7, pp.966-978, 2009. ,
Timing anomalies in dynamically scheduled microprocessors, Proceedings of the 20th IEEE Real-Time Systems Symposium, pp.12-21, 1999. ,
A definition and classification of timing anomalies, 6th International Workshop on Worst-Case Execution Time Analysis (WCET'06). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2006. ,
Timing analysis for TDMA arbitration in resource sharing systems, 16th IEEE Real-Time and Embedded Technology and Applications Symposium, pp.215-224, 2010. ,
Worst case delay analysis for memory interference in multicore systems, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), pp.741-746, 2010. ,
Towards compositionality in execution time analysis: definition and challenges, ACM SIGBED Review, vol.12, issue.1, pp.28-36, 2015. ,
On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments, ACM Transactions on Architecture and Code Optimization (TACO), vol.8, issue.4, p.34, 2012. ,
Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs, Proceedings of the 11th International Conference on High Performance Computing, HiPC'04, pp.286-296, 2004. ,
Using OS Observations to Improve Performance in Multicore Systems, IEEE Micro, vol.28, issue.3, pp.54-66, 2008. ,
Addressing Shared Resource Contention in Multicore Processors via Scheduling, SIGARCH Computer Architecture News, vol.38, issue.1, pp.129-142, 2010. ,
An approach to resource-aware co-scheduling for CMPs, Proceedings of the 24th ACM International Conference on Supercomputing, pp.189-199, 2010. ,
Scalability-Based Manycore Partitioning, Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pp.107-116, 2012. ,
A Case for NUMA-Aware Contention Management on Multicore Systems, Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, 2011. ,
Symbiotic Jobscheduling for a Simultaneous Multithreaded Processor, Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pp.234-244, 2000. ,
L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pp.123-132, 2013. ,
Probabilistic Job Symbiosis Modeling for SMT Processor Scheduling, Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pp.91-102, 2010. ,
Producing wrong data without doing anything obviously wrong!, ACM SIGARCH Computer Architecture News, vol.37, issue.1, pp.265-276, 2009. ,
Study of variations of native program execution times on multi-core architectures, 2010 International Conference on Complex, Intelligent and Software Intensive Systems, pp.919-924, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00643731
Reevaluating Amdahl's law, Communications of the ACM, vol.31, issue.5, pp.532-533, 1988. ,
Skylake (server) -Microarchitectures -Intel, p.23, 2019. ,
Intel architecture, code name Skylake deep dive: A new architecture to manage power performance and energy efficiency, Presentation at Intel Developer Forum (IDF15), 2015. ,
Intel Xeon scalable processor architecture deep dive, Presentation at Intel Press Workshops, 2017. ,
Xeon Gold 6130 -Intel, p.27, 2019. ,
, Intel® Xeon® Processor Scalable Family Specification Update, Reference Number, pp.336065-336075, 2019.
Mechanism to Mitigate AVX-Induced Frequency Reduction, 2018. ,
The Speedup-Test: a statistical methodology for programme speedup analysis and computation. Concurrency and computation: practice and experience, vol.25, pp.1410-1426, 2013. ,
Analysing the variability of OpenMP programs performances on multicore architectures, Fourth workshop on programmability issues for heterogeneous multicores (MULTIPROG-2011), 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00637957
, OpenMP Architecture Review Board. OpenMP Application Programming Interface, 2015.
Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures, 2011 International Conference on High Performance Computing & Simulation, pp.273-279, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00636845
autopin -automated optimization of thread-to-core pinning on multicore systems, Transactions on high-performance embedded architectures and compilers III, pp.219-235, 2011. ,
On the scalability of image and signal processing parallel applications on emerging cc-NUMA many-cores, Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing, pp.1-8, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00742963
Tips to Measure the Performance of Matrix Multiplication Using Intel MKL, p.21, 2017. ,
Collecting performance data with PAPI-C, Tools for High Performance Computing, pp.157-173, 2009. ,
, Intel® 64 and IA32 Architectures Performance Monitoring Events, pp.335279-335280, 2017.
Intel Ivy Bridge Cache Replacement Policy, p.26, 2013. ,
High performance cache replacement using re-reference interval prediction (RRIP), ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.60-71, 2010. ,
Computer Organization and Design MIPS Edition: The Hardware/Software Interface, vol.13, pp.978-0124077263, 2013. ,
Basic Queueing Theory. GlobeEdit, vol.13, pp.978-3639734713, 2016. ,
Evaluation of the Intel® Core? i7 Turbo Boost feature, 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.188-197, 2009. ,
Mitigating Amdahl's law through EPI throttling, ACM SIGARCH Computer Architecture News, vol.33, issue.2, pp.298-309, 2005. ,