Skip to Main content Skip to Navigation

Revisiting Wide Superscalar Microarchitecture

Andrea Mondelli 1
1 PACAP - Pushing Architecture and Compilation for Application Performance
Inria Rennes – Bretagne Atlantique , IRISA-D3 - ARCHITECTURE
Abstract : For several decades, the clock frequency of general purpose processors was growing thanks to faster transistors and microarchitectures with deeper pipelines. However, about 10 years ago, technology hit leakage power and temperature walls. Since then, the clock frequency of high-end processors did not increase. Instead of increasing the clock frequency, processor makers integrated more cores on a single chip, enlarged the cache hierarchy and improved energy efficiency. Putting more cores on a single chip has increased the total chip throughput and benefits some applications with thread-level parallelism. However, most applications have low thread-level parallelism. So having more cores is not sufficient. It is important also to accelerate individual threads. Moreover, reducing the energy consumption has become a major objective when designing a high-performance microarchitecture. Some microarchitecture features have been introduced in superscalar cores mainly for reducing energy. An example of such feature is the loop buffer, which is now implemented in several superscalar microarchitectures. The purpose of a loop buffer is to save energy in the core's front-end (instruction cache, branch predictor, decoder, etc.) when executing a loop with a body small enough to fit in the loop buffer. If the clock frequency remains constant, the only possibility left for higher single-thread performance in future processors is to exploit more ILP. Certain microarchitecture improvements (e.g., better branch predictor) simultaneously improve performance and energy efficiency. However, in general, exploiting more ILP has a cost in silicon area, energy consumption, design effort, etc. Therefore, the microarchitecture is modified slowly, incrementally, taking advantage of technology scaling. And indeed, processor makers have made continuous efforts to exploit more, with better branch predictors, better data prefetchers, larger instruction windows, more physical registers, and so forth. In this thesis, we try to depict what future superscalar cores may look like in 10 years and explore the possibility of exploiting loop behaviors to reduce energy consumption beyond the front-end. Some propositions have been published for loop accelerators or for unconventional superscalar core back-ends. I argue that the instruction window and the issue width can be augmented by combining clustering and register write specialization A major difference with past research on clustered microarchitecture is that I assume wide issue clusters, whereas past research mostly focused on narrow issue clusters. Going from narrow issue to wide issue clusters is not just a quantitative change, it has a qualitative impact on the clustering problem, in particular on the steering policy. We propose, in the second part of this thesis, two independent and orthogonal energy optimizations exploiting loops. The first optimization detects redundant micro-ops producing the same result on every iteration and removes these micro-ops completely. The second optimization focuses on the energy consumed by load micro-ops, detecting situations where a load does not need to access the store queue or does not need to access the level-1 data cache.
Document type :
Complete list of metadata

Cited literature [136 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, January 18, 2018 - 2:30:08 PM
Last modification on : Wednesday, November 3, 2021 - 8:15:50 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01597752, version 2


Andrea Mondelli. Revisiting Wide Superscalar Microarchitecture. Hardware Architecture [cs.AR]. Université Rennes 1, 2017. English. ⟨NNT : 2017REN1S054⟩. ⟨tel-01597752v2⟩



Les métriques sont temporairement indisponibles