Hierarchical interconnects for on-chip clustering, Proceedings 16th International Parallel and Distributed Processing Symposium, p.173, 2002. ,
DOI : 10.1109/IPDPS.2002.1015559
Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of 27th International Symposium on Computer Architecture, pp.248-259, 2000. ,
Software optimization guide for amd family 16h processors, 2012. ,
Automatic program transformations for virtual memory computers, Proc. Nat. Computer Conf, pp.969-975, 1979. ,
The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling, IBM Journal of Research and Development, vol.11, issue.1, pp.8-24, 1967. ,
DOI : 10.1147/rd.111.0008
Exploiting the replication cache to improve performance for multiple-issue microprocessors, ACM SIGARCH Computer Architecture News, vol.33, issue.3, pp.63-71, 2005. ,
DOI : 10.1145/1101868.1101880
Evolution of thread-level parallelism in desktop applications, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.302-313, 2010. ,
DOI : 10.1145/1816038.1816000
Energy-efficient instruction set architecture for CMOS microprocessors, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, pp.298-305 ,
DOI : 10.1109/HICSS.1995.375384
Instruction buffering to reduce power in processors for signal processing, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp.417-424, 1997. ,
DOI : 10.1109/92.645068
Using dynamic cache management techniques to reduce energy in a high-performance processor, Proceedings of the 1999 international symposium on Low power electronics and design , ISLPED '99, pp.64-69, 1999. ,
DOI : 10.1145/313817.313856
Improving dynamic cluster assignment for clustered trace cache processors, Proceedings of the 30th Annual International Symposium on Computer Architecture, ISCA '03, pp.264-274, 2003. ,
William Yoder, and the TRIPS Team. Scaling to the end of silicon with edge architectures, Computer, issue.7, pp.3744-55, 2004. ,
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors, Proceedings . 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp.337-347, 2000. ,
DOI : 10.1145/360128.360165
Design challenges of technology scaling, IEEE Micro, vol.19, issue.4, pp.23-29, 1999. ,
DOI : 10.1109/40.782564
Decomposing the load-store queue by function for power reduction and scalability, IBM Journal of Research and Development, vol.50, issue.2.3, pp.287-297, 2006. ,
DOI : 10.1147/rd.502.0287
L1 Data Cache Power Reduction Using a Forwarding Predictor, Integrated Circuit and System Design. Power and Timing Modeling, Optimization, and Simulation, pp.116-125, 2010. ,
DOI : 10.1109/ISCA.1998.694768
URL : http://oa.upm.es/9392/1/INVE_MEM_2010_87639.pdf
A softwarehardware hybrid steering mechanism for clustered microarchitectures, IEEE International Symposium on Parallel and Distributed Processing, pp.1-12, 2008. ,
Memory Dependence Prediction Using Store Sets, Proceedings of the 25th Annual International Symposium on Computer Architecture, ISCA '98, pp.142-153, 1998. ,
Reducing the complexity of the issue logic, Proceedings of the 15th international conference on Supercomputing , ICS '01, pp.312-320, 2001. ,
DOI : 10.1145/377792.377854
VEAL, 35th International Symposium on Computer Architecture (ISCA), pp.389-400, 2008. ,
DOI : 10.1145/1394608.1382155
Memory ordering: A valuebased approach, Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA '04, p.90, 2004. ,
Improving the energy efficiency of big cores, Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pp.493-504, 2014. ,
Multiported bypass cache in a bypass network, 1999. ,
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Software Engineering "Best Practices, 2005. ,
DOI : 10.1109/9780471749127
A cost-effective clustered architecture, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), p.160, 1999. ,
DOI : 10.1109/PACT.1999.807517
URL : http://upcommons.upc.edu/bitstream/2117/100821/1/00807517.pdf
Dynamic cluster assignment mechanisms, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), pp.133-142, 2000. ,
DOI : 10.1109/HPCA.2000.824345
URL : http://upcommons.upc.edu/bitstream/2117/100590/1/00824345.pdf
Dynamic code partitioning for clustered architectures, International Journal of Parallel Programming, vol.29, issue.1, pp.59-79, 2001. ,
DOI : 10.1023/A:1026483904675
A comparative survey of load speculation architectures Runtime predictability of loops, Proceedings of the Workload Characterization, WWC '01, pp.91-98, 2000. ,
Investigating cache energy and latency break-even points in high performance processors, ACM SIGARCH Computer Architecture News, vol.35, issue.4, pp.13-20, 2007. ,
DOI : 10.1145/1327312.1327316
Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach, 1998. ,
Design of ion-implanted MOSFET's with very small physical dimensions. Solid-State Circuits, IEEE Journal, vol.9, issue.5, pp.256-268, 1974. ,
Dark silicon and the end of multicore scaling, Proceedings of the 38th Annual International Symposium on Computer Architecture, pp.365-376, 2011. ,
Inexpensive performance using the Am29000, Microprocessors and Microsystems, vol.14, issue.6, pp.397-406, 1990. ,
DOI : 10.1016/0141-9331(90)90112-9
The multicluster architecture: reducing cycle time through partitioning, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.149-159, 1997. ,
DOI : 10.1109/MICRO.1997.645806
URL : http://www.cs.utexas.edu/users/dburger/teaching/spring99/cs395t/papers/18_multicluster.ps
Issue logic for a 600-MHz out-of-order execution microprocessor, IEEE Journal of Solid-State Circuits, vol.33, issue.5, pp.707-712, 1998. ,
DOI : 10.1109/4.668985
The TigerSHARC DSP architecture, IEEE Micro, vol.20, issue.1, pp.66-76, 2000. ,
DOI : 10.1109/40.820055
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp.173-181, 1998. ,
DOI : 10.1109/MICRO.1998.742779
Focusing processor policies via critical-path prediction, Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA '01, pp.74-85, 2001. ,
DOI : 10.1145/384285.379253
URL : http://www.cs.berkeley.edu/~bodik/research/isca01a.ps
ARB: a hardware mechanism for dynamic reordering of memory references, IEEE Transactions on Computers, vol.45, issue.5, pp.552-571, 1996. ,
DOI : 10.1109/12.509907
40-Entry unified out-of-order scheduler and integer execution unit for the AMD Bulldozer x86-64 core, 2011 IEEE International Solid-State Circuits Conference, pp.80-82, 2011. ,
Energy efficient cache organizations for superscalar processors, Power-Driven Microarchitecture Workshop In Conjunction With ISCA98 in Barcelona, 1998. ,
Cache organizations for clustered microarchitectures, Proceedings of the 3rd workshop on Memory performance issues in conjunction with the 31st international symposium on computer architecture, WMPI '04, pp.46-55, 2004. ,
DOI : 10.1145/1054943.1054950
Processor Microarchitecture:An Implementation Perspective ,
A high-speed dynamic instruction scheduling scheme for superscalar processors, Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pp.225-236, 2001. ,
DOI : 10.1109/micro.2001.991121
Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example, IEEE Computer Architecture Letters, vol.1, issue.1, pp.2-2, 2002. ,
DOI : 10.1109/L-CA.2002.4
URL : http://www.cs.virginia.edu/~tcca/2002/gordonross_jan02.ps
LPA: A First Approach to the Loop Processor Architecture, High Performance Embedded Architectures and Compilers, pp.273-287, 2008. ,
DOI : 10.1007/978-3-540-77560-7_19
Intel's P6 uses decoupled superscalar design, Microprocessor Report, vol.9, issue.2, pp.9-15, 1995. ,
Revolver: Processor architecture for power efficient loop execution, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp.591-602, 2014. ,
DOI : 10.1109/HPCA.2014.6835968
The microarchitecture of the Pentium 4 processor, Intel Technology Journal, 2001. ,
Scheduling reusable instructions for power reduction, Proceedings Design, Automation and Test in Europe Conference and Exhibition, pp.148-153, 2004. ,
DOI : 10.1109/DATE.2004.1268841
URL : http://www.cse.psu.edu/~mdl/paper/date04_607_hu.pdf
Locality vs. Criticality, Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA '01, pp.132-143, 2001. ,
DOI : 10.1145/379240.379258
Instruction pre-processing in trace processors, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp.125-129, 1999. ,
DOI : 10.1109/HPCA.1999.744347
URL : http://www.ece.wisc.edu/~jes/papers/hpca99.jacobson.pdf
The Alpha 21264 microprocessor, IEEE Micro, vol.19, issue.2, pp.24-36, 1999. ,
DOI : 10.1109/40.755465
The filter cache: an energy efficient memory structure, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.184-193, 1997. ,
DOI : 10.1109/MICRO.1997.645809
URL : http://www.ece.northwestern.edu/~rjoseph/ece510-fall2005/papers/kin97filter.pdf
Physical experimentation with prefetching helper threads on intel's hyper-threaded processors, International Symposium on Code Generation and Optimization, pp.27-38, 2004. ,
Dynamic Characteristics of Loops, IEEE Transactions on Computers, vol.33, issue.2, pp.33125-132, 1984. ,
DOI : 10.1109/TC.1984.1676404
A first-order superscalar processor model, Proceedings. 31st Annual International Symposium on Computer Architecture, pp.338-349, 2004. ,
DOI : 10.1145/1028176.1006729
Inter-core prefetching for multicore processors using migrating helper threads, Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pp.393-404, 2011. ,
DOI : 10.1145/1950365.1950411
URL : http://cseweb.ucsd.edu/users/swanson/papers/ASPLOS2011Prefetching.pdf
Exploring the Design of the Cortex A15 Pro- cessor. https://www.arm.com/files, AT-Exploring_the_ Design_of_the_Cortex-A15.pdf, 2011. ,
Pin, ACM SIGPLAN Notices, vol.40, issue.6, pp.190-200, 2005. ,
DOI : 10.1145/1064978.1065034
Dynamic helper threaded prefetching on the sun ultrasparc cmp processor, Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp.93-104, 2005. ,
The multiflow trace scheduling compiler, The Journal of Supercomputing, vol.34, issue.1, pp.51-142, 1993. ,
DOI : 10.1109/2.19820
URL : http://www.eecg.toronto.edu/~tsa/crgpapers/lowney92multiflow.pdf
Instruction fetch energy reduction using loop caches for embedded applications with small tight loops, Proceedings of the 1999 international symposium on Low power electronics and design , ISLPED '99, pp.267-269, 1999. ,
DOI : 10.1145/313817.313944
Value locality and load value prediction, Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pp.138-147, 1996. ,
Will physical scalability sabotage performance gains?, Computer, vol.30, issue.9, pp.37-39, 1997. ,
Dynamic speculation and synchronization of data dependences, Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pp.181-193, 1997. ,
DOI : 10.1145/384286.264189
URL : https://minds.wisconsin.edu/bitstream/handle/1793/9468/file_1.pdf?sequence=1
Tartan: Evaluating spatial computation for whole program execution, Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp.163-174, 2006. ,
Compiler Support for Software Prefetching, 1998. ,
Architectural Support for Data-Driven Execution, ACM Transactions on Architecture and Code Optimization, vol.11, issue.4, pp.1-5225, 2015. ,
DOI : 10.1109/ICPP.2008.74
Design and evaluation of a compiler algorithm for prefetching, Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp.62-73, 1992. ,
An exploration of instruction fetch requirement in out-of-order superscalar processors, International Journal of Parallel Programming, vol.29, issue.1, pp.35-58, 2001. ,
DOI : 10.1023/A:1026431920605
An innovative low-power high-performance programmable signal processor for digital communications, IBM Journal of Research and Development, vol.47, issue.2.3, pp.47299-326, 2003. ,
DOI : 10.1147/rd.472.0299
URL : http://www.research.ibm.com/journal/rd/472/moreno.pdf
Exploring the potential of heterogeneous Von Neumann/Dataflow execution models, the 42nd Annual International Symposium, pp.298-310, 2015. ,
NVIDIA Tegra 4 family CPU architecture, 2013. ,
Reducing data cache energy consumption via cached load/store queue, Proceedings of the 2003 international symposium on Low power electronics and design , ISLPED '03, pp.252-257, 2003. ,
DOI : 10.1145/871506.871569
URL : http://www.cecs.uci.edu/conference_proceedings/islped_2003/nicolaescu_reducing.pdf
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp.308-315, 1998. ,
DOI : 10.1109/MICRO.1998.742792
Maximum Performance Computing with Dataflow Engines, Computing in Science & Engineering, vol.14, issue.4, pp.98-103, 2012. ,
DOI : 10.1109/MCSE.2012.78
Complexity-Effective Superscalar Processors, 1998. ,
DOI : 10.1145/384286.264201
Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading, IEEE International Solid-State Circuits Conference. Digest of Technical Papers, pp.334-472, 2002. ,
Critical issues regarding the trace cache fetch mechanism, 1997. ,
Reducing wire delay penalty through value prediction, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pp.317-326, 2000. ,
DOI : 10.1145/360128.360163
URL : http://upcommons.upc.edu/bitstream/2117/101126/1/00898081.pdf
Low power microarchitecture with instruction reuse, Proceedings of the 2008 conference on Computing frontiers , CF '08, pp.149-158, 2008. ,
DOI : 10.1145/1366230.1366259
Complexity-effective superscalar processors, Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pp.206-218, 1997. ,
DOI : 10.1145/384286.264201
URL : https://minds.wisconsin.edu/bitstream/handle/1793/11224/file_1.pdf?sequence=1
Reducing design complexity of the load/store queue, 22nd Digital Avionics Systems Conference. Proceedings (Cat. No.03CH37449), p.411, 2003. ,
DOI : 10.1109/MICRO.2003.1253245
Reducing instruction fetch energy with backwards branch control information and buffering, Proceedings of the 2003 international symposium on Low power electronics and design , ISLPED '03, pp.322-325, 2003. ,
DOI : 10.1145/871506.871586
Trace cache: a low latency approach to high bandwidth instruction fetching, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pp.24-34, 1996. ,
DOI : 10.1109/MICRO.1996.566447
URL : http://www.cs.utah.edu/classes/cs7810-rajeev/papers/rotenberg96.pdf
The Inhibition of Potential Parallelism by Conditional Jumps, IEEE Transactions on Computers, vol.21, issue.12, pp.1405-1411, 1972. ,
DOI : 10.1109/T-C.1972.223514
An empirical study of decentralized ilp execution models, Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VIII, pp.272-281, 1998. ,
Virtually split cache: An efficient mechanism to distribute instructions and data ,
A Complexity-effective microprocessor design with decoupled dispatch queues and prefetching, Parallel Comput, vol.35, issue.5, pp.255-268, 2009. ,
Trace processors, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.138-148, 1997. ,
DOI : 10.1109/MICRO.1997.645805
Trace processors, Proceedings of 30th Annual International Symposium on Microarchitecture, 1999. ,
DOI : 10.1109/MICRO.1997.645805
On high-bandwidth data cache design for multi-issue processors, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.46-56, 1997. ,
DOI : 10.1109/MICRO.1997.645796
URL : http://www.eecs.umich.edu/~jrivers/MICRO-30.ps.gz
"Jaguar" AMD's next generation low power x86 core, 2012 IEEE Hot Chips 24 Symposium (HCS), pp.1-20, 2012. ,
DOI : 10.1109/HOTCHIPS.2012.7476479
Reducing rename logic complexity for high-speed and low-power front-end architectures, IEEE Transactions on Computers, vol.55, issue.6, pp.672-685, 2006. ,
DOI : 10.1109/TC.2006.88
Cache designs for energy efficiency, Twenty-Eighth Annual Hawaii International Conference on System Sciences, pp.306-315, 1995. ,
Scalable hardware memory disambiguation for high ILP processors, International Symposium on Microarchitecture (MI- CRO), 2003. ,
Design tradeoffs for the Alpha EV8 conditional branch predictor, 29th Annual International Symposium on Computer Architecture, pp.295-306 ,
DOI : 10.1145/545214.545249
URL : http://courses.ece.uiuc.edu/ece512/papers/seznec.2002.isca.pdf
Modulo scheduling for a fullydistributed clustered VLIW architecture, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33, pp.124-133, 2000. ,
The design space of register renaming techniques, IEEE Micro, vol.20, issue.5, pp.70-83, 2000. ,
DOI : 10.1109/40.877952
Architectural Specialization for Inter-Iteration Loop Dependence Patterns, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp.583-595, 2014. ,
DOI : 10.1109/MICRO.2014.31
IBM POWER7 multicore server processor, IBM Journal of Research and Development, vol.55, issue.3, pp.1-129, 2011. ,
DOI : 10.1147/JRD.2011.2127330
Hardware/Software Helper Thread Prefetching on Heterogeneous Many Cores, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pp.214-221, 2014. ,
DOI : 10.1109/SBAC-PAD.2014.39
URL : https://hal.archives-ouvertes.fr/hal-01087752
Load latency tolerance in dynamically scheduled processors, Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 31, pp.148-159, 1998. ,
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp.273-284, 2006. ,
DOI : 10.1109/MICRO.2006.26
URL : http://www-static.cc.gatech.edu/~loh/Papers/micro2006-fnf.pdf
A case for (partially) TAgged GEometric history length branch prediction, Journal of Instruction Level Parallelism, 2006. ,
A study of branch prediction strategies, 25 years of the international symposia on Computer architecture (selected papers) , ISCA '98, pp.135-148, 1981. ,
DOI : 10.1145/285930.285980
Retrospective: implementing precise interrupts in pipelined processors, 25 years of the international symposia on Computer architecture (selected papers) , ISCA '98, p.42, 1998. ,
DOI : 10.1145/285930.285948
Micro-operation cache, Proceedings of the 2001 international symposium on Low power electronics and design , ISLPED '01, pp.801-811, 2003. ,
DOI : 10.1145/383082.383085
Scalable Store-Load Forwarding via Store Queue Index Prediction, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MI- CRO'05), pp.159-170, 2005. ,
Exploiting idle floating-point resources for integer execution, Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pp.118-129, 1998. ,
DOI : 10.1145/277650.277709
URL : http://www.ece.wisc.edu/~jes/papers/pldi98.sastry.ps
Dynamic instruction reuse, The 24th Annual International Symposium on Computer Architecture, pp.194-205, 1997. ,
DOI : 10.1145/264107.264200
URL : https://minds.wisconsin.edu/bitstream/handle/1793/9470/file_1.pdf?sequence=1
Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings., pp.383-394, 2002. ,
DOI : 10.1109/MICRO.2002.1176265
URL : ftp://ftp.irisa.fr:/local/caps/WSRS.pdf
Instruction issue logic for highperformance , interruptable pipelined processors, Proceedings of the 14th Annual International Symposium on Computer Architecture , ISCA '87, pp.27-34, 1987. ,
DOI : 10.1145/30350.30354
IBM POWER8 processor core microarchitecture, IBM Journal of Research and Development, vol.59, issue.1, pp.1-2, 2015. ,
DOI : 10.1147/JRD.2014.2376112
The performance potential of data dependence speculation and collapsing, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pp.238-247, 1996. ,
DOI : 10.1109/MICRO.1996.566465
A Criticality Analysis of Clustering in Superscalar Processors, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05), pp.55-66, 2005. ,
DOI : 10.1109/MICRO.2005.6
Design of a predictive filter cache for energy savings in high performance processor architectures, ICCD International Conference on Computer Design, pp.68-73, 2001. ,
Cortex- Suite: A Synthetic Brain Benchmark Suite, IISWC, 2014. ,
DOI : 10.1109/iiswc.2014.6983043
URL : http://cseweb.ucsd.edu/%7Embtaylor/papers/iiswc_2014_cortexsuite_thomas.pdf
Parallel operation in the control data 6600 Fall Joint Computer Conference , Part II: Very High Speed Computer Systems, AFIPS '64 (Fall, part II), Proceedings of the, pp.33-40, 1964. ,
Dynamic prediction of critical path instructions, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pp.185-195, 2001. ,
DOI : 10.1109/HPCA.2001.903262
Execution cache-based microarchitecture for power-efficient superscalar processors, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp.14-26, 2005. ,
DOI : 10.1109/TVLSI.2004.840406
URL : http://www.ece.cmu.edu/~dianam/journals/tvlsi05-2.pdf
An efficient algorithm for exploiting multiple arithmetic units, IBM J. Res. Dev, vol.11, issue.1, pp.25-33, 1967. ,
Achieving Out-of-Order performance with almost In-Order complexity, Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pp.3-12, 2008. ,
Reducing power in high-performance microprocessors, Proceedings of the 35th annual conference on Design automation conference , DAC '98, pp.732-737, 1998. ,
DOI : 10.1145/277044.277227
URL : http://herkules.informatik.tu-chemnitz.de/proceedings/dac-98/sun_sgi/../pdffiles/44_2.pdf
SD-VBS: The San Diego Vision Benchmark Suite, Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.55-64, 2009. ,
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences, Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pp.1-12, 1997. ,
DOI : 10.1145/264107.264119
The Mips R10000 superscalar microprocessor, IEEE Micro, vol.16, issue.2, pp.28-41, 1996. ,
Speculation techniques for improving load related instruction scheduling, Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA '99, pp.42-53, 1999. ,
DOI : 10.1109/isca.1999.765938
URL : http://home.austin.rr.com/yoaz/ISCA99A.pdf
Power-efficient instruction delivery through trace reuse, Proceedings of the 15th international conference on Parallel architectures and compilation techniques , PACT '06, pp.192-201, 2006. ,
DOI : 10.1145/1152154.1152185
URL : http://www.cs.virginia.edu/~pact2006/program/pact2006/pact29_yang8.pdf
Inherently lower-power high-performance superscalar architectures, Proceedings of the 12th ACM International Conference on Computing Frontiers, CF '15, pp.268-285, 2001. ,
DOI : 10.1109/12.910816
Simulating a multi-core x86_64 architecture with hardware isa extension supporting a data-flow execution model, 2014 2nd International Conference on Artificial Intelligence, Modelling and Simulation, pp.264-269, 2014. ,
Dataflow support in x86_64 multicore architectures through small hardware extensions, Digital System Design (DSD), 2015 Euromicro Conference on Digital System Design, pp.526-529, 2015. ,
Revisiting Clustered Microarchitecture for Future Superscalar Cores, ACM Transactions on Architecture and Code Optimization, vol.12, issue.3, pp.1-2822, 2015. ,
DOI : 10.1109/12.910816
URL : https://hal.archives-ouvertes.fr/hal-01193178
OmniScriptum GmbH & Co. KG, 2014. 3.8 IPC gain over the baseline for a 4-cluster back-end. Each cluster can issue and execute 2 micro-ops per cycle. Only SPEC INT averages are shown, p.62 ,
Top graph (a): impact of the loop buffer size Bottom graph (b): impact of using a Loop Reuse Table, with M axBody = 128, p.73 ,