A. Aggarwal and M. Franklin, Hierarchical interconnects for on-chip clustering, Proceedings 16th International Parallel and Distributed Processing Symposium, p.173, 2002.
DOI : 10.1109/IPDPS.2002.1015559

M. [. Agarwal, S. W. Hrishikesh, D. Keckler, and . Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of 27th International Symposium on Computer Architecture, pp.248-259, 2000.

[. Amd, Software optimization guide for amd family 16h processors, 2012.

. [. Abu-sufah, Automatic program transformations for virtual memory computers, Proc. Nat. Computer Conf, pp.969-975, 1979.

[. Anderson, F. J. Sparacio, and R. Tomasulo, The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling, IBM Journal of Research and Development, vol.11, issue.1, pp.8-24, 1967.
DOI : 10.1147/rd.111.0008

[. Allu and W. Zhang, Exploiting the replication cache to improve performance for multiple-issue microprocessors, ACM SIGARCH Computer Architecture News, vol.33, issue.3, pp.63-71, 2005.
DOI : 10.1145/1101868.1101880

[. Blake, G. Ronald, T. Dreslinski, K. Mudge, and . Flautner, Evolution of thread-level parallelism in desktop applications, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.302-313, 2010.
DOI : 10.1145/1816038.1816000

[. Bunda, W. Fussell, and . Athas, Energy-efficient instruction set architecture for CMOS microprocessors, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, pp.298-305
DOI : 10.1109/HICSS.1995.375384

. Bajwa, H. Hiraki, D. J. Kojima, . Gorny, . Nitta et al., Instruction buffering to reduce power in processors for signal processing, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp.417-424, 1997.
DOI : 10.1109/92.645068

[. Bellas, I. Hajj, and C. Polychronopoulos, Using dynamic cache management techniques to reduce energy in a high-performance processor, Proceedings of the 1999 international symposium on Low power electronics and design , ISLPED '99, pp.64-69, 1999.
DOI : 10.1145/313817.313856

R. Bhargava and L. K. John, Improving dynamic cluster assignment for clustered trace cache processors, Proceedings of the 30th Annual International Symposium on Computer Architecture, ISCA '03, pp.264-274, 2003.

D. Burger, S. W. Keckler, K. S. Mckinley, M. Dahlin, L. K. John et al., William Yoder, and the TRIPS Team. Scaling to the end of silicon with edge architectures, Computer, issue.7, pp.3744-55, 2004.

[. Baniasadi and A. Moshovos, Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors, Proceedings . 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp.337-347, 2000.
DOI : 10.1145/360128.360165

[. Borkar, Design challenges of technology scaling, IEEE Micro, vol.19, issue.4, pp.23-29, 1999.
DOI : 10.1109/40.782564

[. Baugh and C. Zilles, Decomposing the load-store queue by function for power reduction and scalability, IBM Journal of Research and Development, vol.50, issue.2.3, pp.287-297, 2006.
DOI : 10.1147/rd.502.0287

. Carazo, . Apolloni, . Castro, . Chaver, F. Pinuel et al., L1 Data Cache Power Reduction Using a Forwarding Predictor, Integrated Circuit and System Design. Power and Timing Modeling, Optimization, and Simulation, pp.116-125, 2010.
DOI : 10.1109/ISCA.1998.694768
URL : http://oa.upm.es/9392/1/INVE_MEM_2010_87639.pdf

[. Cai, J. M. Codina, J. Gonzalez, and A. Gonzalez, A softwarehardware hybrid steering mechanism for clustered microarchitectures, IEEE International Symposium on Parallel and Distributed Processing, pp.1-12, 2008.

Z. George, J. S. Chrysos, and . Emer, Memory Dependence Prediction Using Store Sets, Proceedings of the 25th Annual International Symposium on Computer Architecture, ISCA '98, pp.142-153, 1998.

R. Canal and A. González, Reducing the complexity of the issue logic, Proceedings of the 15th international conference on Supercomputing , ICS '01, pp.312-320, 2001.
DOI : 10.1145/377792.377854

[. Clark, A. Hormati, and S. Mahlke, VEAL, 35th International Symposium on Computer Architecture (ISCA), pp.389-400, 2008.
DOI : 10.1145/1394608.1382155

W. Harold, M. H. Cain, and . Lipasti, Memory ordering: A valuebased approach, Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA '04, p.90, 2004.

K. Czechowski, V. W. Lee, E. Grochowski, R. Ronen, R. Singhal et al., Improving the energy efficiency of big cores, Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pp.493-504, 2014.

[. Curtis, R. J. Murray, and H. Opie, Multiported bypass cache in a bypass network, 1999.

R. P. Colwell, The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Software Engineering "Best Practices, 2005.
DOI : 10.1109/9780471749127

[. Canal, J. Parcerisa, and A. Gonzalez, A cost-effective clustered architecture, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), p.160, 1999.
DOI : 10.1109/PACT.1999.807517
URL : http://upcommons.upc.edu/bitstream/2117/100821/1/00807517.pdf

J. [. Canal, A. Parcerisa, and . Gonzalez, Dynamic cluster assignment mechanisms, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), pp.133-142, 2000.
DOI : 10.1109/HPCA.2000.824345
URL : http://upcommons.upc.edu/bitstream/2117/100590/1/00824345.pdf

[. Canal, J. Parcerisa, and A. González, Dynamic code partitioning for clustered architectures, International Journal of Parallel Programming, vol.29, issue.1, pp.59-79, 2001.
DOI : 10.1023/A:1026483904675

B. Calder, G. Reinman, M. R. De-alba, and D. R. Kaeli, A comparative survey of load speculation architectures Runtime predictability of loops, Proceedings of the Workload Characterization, WWC '01, pp.91-98, 2000.

[. Deris and A. Baniasadi, Investigating cache energy and latency break-even points in high performance processors, ACM SIGARCH Computer Architecture News, vol.35, issue.4, pp.13-20, 2007.
DOI : 10.1145/1327312.1327316

]. G. Des98 and . Desoli, Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach, 1998.

. H. Dgr-+-74-]-r, F. H. Dennard, V. L. Gaensslen, E. Rideout, A. R. Bassous et al., Design of ion-implanted MOSFET's with very small physical dimensions. Solid-State Circuits, IEEE Journal, vol.9, issue.5, pp.256-268, 1974.

E. Hadi-esmaeilzadeh, R. Blem, . St, K. Amant, D. Sankaralingam et al., Dark silicon and the end of multicore scaling, Proceedings of the 38th Annual International Symposium on Computer Architecture, pp.365-376, 2011.

D. J. Everitt, Inexpensive performance using the Am29000, Microprocessors and Microsystems, vol.14, issue.6, pp.397-406, 1990.
DOI : 10.1016/0141-9331(90)90112-9

[. Farkas, N. Chow, and . Jouppi, The multicluster architecture: reducing cycle time through partitioning, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.149-159, 1997.
DOI : 10.1109/MICRO.1997.645806
URL : http://www.cs.utexas.edu/users/dburger/teaching/spring99/cs395t/papers/18_multicluster.ps

[. Farrell and T. Fischer, Issue logic for a 600-MHz out-of-order execution microprocessor, IEEE Journal of Solid-State Circuits, vol.33, issue.5, pp.707-712, 1998.
DOI : 10.1109/4.668985

Z. [. Fridman and . Greenfield, The TigerSHARC DSP architecture, IEEE Micro, vol.20, issue.1, pp.66-76, 2000.
DOI : 10.1109/40.820055

[. Friendly, S. Patel, and Y. N. Patt, Putting the fill unit to work: dynamic optimizations for trace cache microprocessors, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp.173-181, 1998.
DOI : 10.1109/MICRO.1998.742779

[. Fields, S. Rubin, and R. Bodík, Focusing processor policies via critical-path prediction, Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA '01, pp.74-85, 2001.
DOI : 10.1145/384285.379253
URL : http://www.cs.berkeley.edu/~bodik/research/isca01a.ps

[. Franklin and G. S. Sohi, ARB: a hardware mechanism for dynamic reordering of memory references, IEEE Transactions on Computers, vol.45, issue.5, pp.552-571, 1996.
DOI : 10.1109/12.509907

S. [. Golden, J. Arekapudi, and . Vinh, 40-Entry unified out-of-order scheduler and integer execution unit for the AMD Bulldozer x86-64 core, 2011 IEEE International Solid-State Circuits Conference, pp.80-82, 2011.

[. Ghose, B. Milind, and . Kamble, Energy efficient cache organizations for superscalar processors, Power-Driven Microarchitecture Workshop In Conjunction With ISCA98 in Barcelona, 1998.

[. González, F. Latorre, and A. González, Cache organizations for clustered microarchitectures, Proceedings of the 3rd workshop on Memory performance issues in conjunction with the 31st international symposium on computer architecture, WMPI '04, pp.46-55, 2004.
DOI : 10.1145/1054943.1054950

A. Gonzalez, F. Latorre, and G. Magklis, Processor Microarchitecture:An Implementation Perspective

K. Goshima, T. Nishino, Y. Kitamura, S. Nakashima, S. Tomita et al., A high-speed dynamic instruction scheduling scheme for superscalar processors, Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pp.225-236, 2001.
DOI : 10.1109/micro.2001.991121

S. [. Gordon-ross, F. Cotterell, and . Vahid, Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example, IEEE Computer Architecture Letters, vol.1, issue.1, pp.2-2, 2002.
DOI : 10.1109/L-CA.2002.4
URL : http://www.cs.virginia.edu/~tcca/2002/gordonross_jan02.ps

A. Garcia, J. Oliverio, E. Santana, P. Fernandez, M. Medina et al., LPA: A First Approach to the Loop Processor Architecture, High Performance Embedded Architectures and Compilers, pp.273-287, 2008.
DOI : 10.1007/978-3-540-77560-7_19

[. Gwennap, Intel's P6 uses decoupled superscalar design, Microprocessor Report, vol.9, issue.2, pp.9-15, 1995.

[. Hayenga, V. Reddy-kothinti, . Naresh, H. Mikko, and . Lipasti, Revolver: Processor architecture for power efficient loop execution, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp.591-602, 2014.
DOI : 10.1109/HPCA.2014.6835968

D. Hsu-+-01-]-glenn-hinton, M. Sager, D. Upton, D. C. Boggs, A. Kyker et al., The microarchitecture of the Pentium 4 processor, Intel Technology Journal, 2001.

[. Hu, . Vijaykrishnan, . Kim, M. Kandemir, and . Irwin, Scheduling reusable instructions for power reduction, Proceedings Design, Automation and Test in Europe Conference and Exhibition, pp.148-153, 2004.
DOI : 10.1109/DATE.2004.1268841
URL : http://www.cse.psu.edu/~mdl/paper/date04_607_hu.pdf

[. Ju, A. R. Lebeck, and C. Wilkerson, Locality vs. Criticality, Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA '01, pp.132-143, 2001.
DOI : 10.1145/379240.379258

[. Jacobson and J. Smith, Instruction pre-processing in trace processors, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp.125-129, 1999.
DOI : 10.1109/HPCA.1999.744347
URL : http://www.ece.wisc.edu/~jes/papers/hpca99.jacobson.pdf

K. Kessler, The Alpha 21264 microprocessor, IEEE Micro, vol.19, issue.2, pp.24-36, 1999.
DOI : 10.1109/40.755465

[. Kin, M. Gupta, and W. , The filter cache: an energy efficient memory structure, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.184-193, 1997.
DOI : 10.1109/MICRO.1997.645809
URL : http://www.ece.northwestern.edu/~rjoseph/ece510-fall2005/papers/kin97filter.pdf

D. Kim, S. S. Liao, P. H. Wang, J. Del-cuvillo, X. Tian et al., Physical experimentation with prefetching helper threads on intel's hyper-threaded processors, International Symposium on Code Generation and Optimization, pp.27-38, 2004.

[. Kobayashi, Dynamic Characteristics of Loops, IEEE Transactions on Computers, vol.33, issue.2, pp.33125-132, 1984.
DOI : 10.1109/TC.1984.1676404

[. Karkhanis and J. Smith, A first-order superscalar processor model, Proceedings. 31st Annual International Symposium on Computer Architecture, pp.338-349, 2004.
DOI : 10.1145/1028176.1006729

[. Kamruzzaman, S. Swanson, and D. M. Tullsen, Inter-core prefetching for multicore processors using migrating helper threads, Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pp.393-404, 2011.
DOI : 10.1145/1950365.1950411
URL : http://cseweb.ucsd.edu/users/swanson/papers/ASPLOS2011Prefetching.pdf

[. Lanier, Exploring the Design of the Cortex A15 Pro- cessor. https://www.arm.com/files, AT-Exploring_the_ Design_of_the_Cortex-A15.pdf, 2011.

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser et al., Pin, ACM SIGPLAN Notices, vol.40, issue.6, pp.190-200, 2005.
DOI : 10.1145/1064978.1065034

A. Lu, W. Das, K. Hsu, S. G. Nguyen, and . Abraham, Dynamic helper threaded prefetching on the sun ultrasparc cmp processor, Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp.93-104, 2005.

G. Lowney, M. Stefan, . Freudenberger, J. Thomas, W. Karzes et al., The multiflow trace scheduling compiler, The Journal of Supercomputing, vol.34, issue.1, pp.51-142, 1993.
DOI : 10.1109/2.19820
URL : http://www.eecg.toronto.edu/~tsa/crgpapers/lowney92multiflow.pdf

[. Lee, B. Moyer, and J. Arends, Instruction fetch energy reduction using loop caches for embedded applications with small tight loops, Proceedings of the 1999 international symposium on Low power electronics and design , ISLPED '99, pp.267-269, 1999.
DOI : 10.1145/313817.313944

H. Mikko, C. B. Lipasti, J. P. Wilkerson, and . Shen, Value locality and load value prediction, Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pp.138-147, 1996.

]. D. Mat97 and . Matzke, Will physical scalability sabotage performance gains?, Computer, vol.30, issue.9, pp.37-39, 1997.

[. Moshovos, S. E. Breach, T. N. Vijaykumar, and G. S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pp.181-193, 1997.
DOI : 10.1145/384286.264189
URL : https://minds.wisconsin.edu/bitstream/handle/1793/9468/file_1.pdf?sequence=1

M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C. Goldstein et al., Tartan: Evaluating spatial computation for whole program execution, Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp.163-174, 2006.

. Mcintosh, Compiler Support for Software Prefetching, 1998.

G. Matheou and P. Evripidou, Architectural Support for Data-Driven Execution, ACM Transactions on Architecture and Code Optimization, vol.11, issue.4, pp.1-5225, 2015.
DOI : 10.1109/ICPP.2008.74

C. Todd, M. S. Mowry, A. Lam, and . Gupta, Design and evaluation of a compiler algorithm for prefetching, Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp.62-73, 1992.

P. Michaud, A. Seznec, and S. Jourdan, An exploration of instruction fetch requirement in out-of-order superscalar processors, International Journal of Parallel Programming, vol.29, issue.1, pp.35-58, 2001.
DOI : 10.1023/A:1026431920605

V. J. Moreno, U. Zyuban, F. D. Shvadron, J. H. Neeser, M. S. Derby et al., An innovative low-power high-performance programmable signal processor for digital communications, IBM Journal of Research and Development, vol.47, issue.2.3, pp.47299-326, 2003.
DOI : 10.1147/rd.472.0299
URL : http://www.research.ibm.com/journal/rd/472/moreno.pdf

[. Nowatzki, V. Gangadhar, and K. Sankaralingam, Exploring the potential of heterogeneous Von Neumann/Dataflow execution models, the 42nd Annual International Symposium, pp.298-310, 2015.

[. Nvidia, NVIDIA Tegra 4 family CPU architecture, 2013.

[. Nicolaescu, . Veidenbaum, and . Nicolau, Reducing data cache energy consumption via cached load/store queue, Proceedings of the 2003 international symposium on Low power electronics and design , ISLPED '03, pp.252-257, 2003.
DOI : 10.1145/871506.871569
URL : http://www.cecs.uci.edu/conference_proceedings/islped_2003/nicolaescu_reducing.pdf

E. Özer, S. Banerjia, and T. M. Conte, Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp.308-315, 1998.
DOI : 10.1109/MICRO.1998.742792

V. [. Pell and . Averbukh, Maximum Performance Computing with Dataflow Engines, Computing in Science & Engineering, vol.14, issue.4, pp.98-103, 2012.
DOI : 10.1109/MCSE.2012.78

[. Palacharla, Complexity-Effective Superscalar Processors, 1998.
DOI : 10.1145/384286.264201

R. Pbb-+-02-]-r-p-preston, D. Badeau, S. Bailey, . Bell, W. J. Biro et al., Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading, IEEE International Solid-State Circuits Conference. Digest of Technical Papers, pp.334-472, 2002.

D. [. Patel, Y. Friendly, and . Patt, Critical issues regarding the trace cache fetch mechanism, 1997.

J. Parcerisa and A. González, Reducing wire delay penalty through value prediction, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pp.317-326, 2000.
DOI : 10.1145/360128.360163
URL : http://upcommons.upc.edu/bitstream/2117/101126/1/00898081.pdf

F. Pratas, G. Gaydadjiev, M. Berekovic, L. Sousa, and S. Kaxiras, Low power microarchitecture with instruction reuse, Proceedings of the 2008 conference on Computing frontiers , CF '08, pp.149-158, 2008.
DOI : 10.1145/1366230.1366259

[. Palacharla, N. P. Jouppi, and J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pp.206-218, 1997.
DOI : 10.1145/384286.264201
URL : https://minds.wisconsin.edu/bitstream/handle/1793/11224/file_1.pdf?sequence=1

C. L. Park, T. Ooi, and . Vijaykumar, Reducing design complexity of the load/store queue, 22nd Digital Avionics Systems Conference. Proceedings (Cat. No.03CH37449), p.411, 2003.
DOI : 10.1109/MICRO.2003.1253245

. Rivers, J. Asaad, J. Wellman, and . Moreno, Reducing instruction fetch energy with backwards branch control information and buffering, Proceedings of the 2003 international symposium on Low power electronics and design , ISLPED '03, pp.322-325, 2003.
DOI : 10.1145/871506.871586

[. Rotenberg, J. Bennett, and . Smith, Trace cache: a low latency approach to high bandwidth instruction fetching, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pp.24-34, 1996.
DOI : 10.1109/MICRO.1996.566447
URL : http://www.cs.utah.edu/classes/cs7810-rajeev/papers/rotenberg96.pdf

[. Riseman and C. Foster, The Inhibition of Potential Parallelism by Conditional Jumps, IEEE Transactions on Computers, vol.21, issue.12, pp.1405-1411, 1972.
DOI : 10.1109/T-C.1972.223514

M. Narayan-ranganathan and . Franklin, An empirical study of decentralized ilp execution models, Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VIII, pp.272-281, 1998.

B. B. Dyer-rolán, R. Fraguela, and . Doallo, Virtually split cache: An efficient mechanism to distribute instructions and data

W. Won, J. Ro, and . Gaudiot, A Complexity-effective microprocessor design with decoupled dispatch queues and prefetching, Parallel Comput, vol.35, issue.5, pp.255-268, 2009.

Q. [. Rotenberg, Y. Jacobson, J. Sazeides, and . Smith, Trace processors, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.138-148, 1997.
DOI : 10.1109/MICRO.1997.645805

[. Rotenberg, Trace processors, Proceedings of 30th Annual International Symposium on Microarchitecture, 1999.
DOI : 10.1109/MICRO.1997.645805

J. A. Rivers, G. S. Tyson, E. S. Davidson, and T. M. Austin, On high-bandwidth data cache design for multi-issue processors, Proceedings of 30th Annual International Symposium on Microarchitecture, pp.46-56, 1997.
DOI : 10.1109/MICRO.1997.645796
URL : http://www.eecs.umich.edu/~jrivers/MICRO-30.ps.gz

J. Rupley and . Jaguar, "Jaguar" AMD's next generation low power x86 core, 2012 IEEE Hot Chips 24 Symposium (HCS), pp.1-20, 2012.
DOI : 10.1109/HOTCHIPS.2012.7476479

[. Sangireddy, Reducing rename logic complexity for high-speed and low-power front-end architectures, IEEE Transactions on Computers, vol.55, issue.6, pp.672-685, 2006.
DOI : 10.1109/TC.2006.88

[. Su and A. Despain, Cache designs for energy efficiency, Twenty-Eighth Annual Hawaii International Conference on System Sciences, pp.306-315, 1995.

]. S. Sdb-+-03, R. Sethumadhavan, D. Desikan, C. R. Burger, S. W. Moore et al., Scalable hardware memory disambiguation for high ILP processors, International Symposium on Microarchitecture (MI- CRO), 2003.

[. Seznec, . Felix, Y. Krishnan, and . Sazeides, Design tradeoffs for the Alpha EV8 conditional branch predictor, 29th Annual International Symposium on Computer Architecture, pp.295-306
DOI : 10.1145/545214.545249
URL : http://courses.ece.uiuc.edu/ece512/papers/seznec.2002.isca.pdf

A. [. Sanchez and . Gonzalez, Modulo scheduling for a fullydistributed clustered VLIW architecture, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33, pp.124-133, 2000.

D. Sima, The design space of register renaming techniques, IEEE Micro, vol.20, issue.5, pp.70-83, 2000.
DOI : 10.1109/40.877952

B. Srinath, M. Ilbeyi, G. Tan, Z. Liu, C. Zhang et al., Architectural Specialization for Inter-Iteration Loop Dependence Patterns, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp.583-595, 2014.
DOI : 10.1109/MICRO.2014.31

. Sinharoy, W. J. Kalla, H. Starke, . Le, J. Cargnoni et al., IBM POWER7 multicore server processor, IBM Journal of Research and Development, vol.55, issue.3, pp.1-129, 2011.
DOI : 10.1147/JRD.2011.2127330

A. [. Swamy, A. Ketterlin, and . Seznec, Hardware/Software Helper Thread Prefetching on Heterogeneous Many Cores, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pp.214-221, 2014.
DOI : 10.1109/SBAC-PAD.2014.39
URL : https://hal.archives-ouvertes.fr/hal-01087752

T. Srikanth, A. R. Srinivasan, and . Lebeck, Load latency tolerance in dynamically scheduled processors, Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 31, pp.148-159, 1998.

[. Subramaniam and G. Loh, Fire-and-Forget: Load/Store Scheduling with No Store Queue at All, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp.273-284, 2006.
DOI : 10.1109/MICRO.2006.26
URL : http://www-static.cc.gatech.edu/~loh/Papers/micro2006-fnf.pdf

[. Seznec and P. Michaud, A case for (partially) TAgged GEometric history length branch prediction, Journal of Instruction Level Parallelism, 2006.

J. E. Smith, A study of branch prediction strategies, 25 years of the international symposia on Computer architecture (selected papers) , ISCA '98, pp.135-148, 1981.
DOI : 10.1145/285930.285980

J. E. Smith, Retrospective: implementing precise interrupts in pipelined processors, 25 years of the international symposia on Computer architecture (selected papers) , ISCA '98, p.42, 1998.
DOI : 10.1145/285930.285948

. Solomon, . Mendelson, . Ronen, Y. Orenstien, and . Almog, Micro-operation cache, Proceedings of the 2001 international symposium on Low power electronics and design , ISLPED '01, pp.801-811, 2003.
DOI : 10.1145/383082.383085

[. Sha, M. Martin, and A. Roth, Scalable Store-Load Forwarding via Store Queue Index Prediction, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MI- CRO'05), pp.159-170, 2005.

[. Sastry, S. Palacharla, and J. E. Smith, Exploiting idle floating-point resources for integer execution, Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pp.118-129, 1998.
DOI : 10.1145/277650.277709
URL : http://www.ece.wisc.edu/~jes/papers/pldi98.sastry.ps

G. [. Sodani and . Sohi, Dynamic instruction reuse, The 24th Annual International Symposium on Computer Architecture, pp.194-205, 1997.
DOI : 10.1145/264107.264200
URL : https://minds.wisconsin.edu/bitstream/handle/1793/9470/file_1.pdf?sequence=1

[. Seznec, O. Toullec, and . Rochecouste, Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings., pp.383-394, 2002.
DOI : 10.1109/MICRO.2002.1176265
URL : ftp://ftp.irisa.fr:/local/caps/WSRS.pdf

S. [. Sohi and . Vajapeyam, Instruction issue logic for highperformance , interruptable pipelined processors, Proceedings of the 14th Annual International Symposium on Computer Architecture , ISCA '87, pp.27-34, 1987.
DOI : 10.1145/30350.30354

J. Sinharoy, R. J. Van-norstrand, H. Eickemeyer, . Le, D. Leenstra et al., IBM POWER8 processor core microarchitecture, IBM Journal of Research and Development, vol.59, issue.1, pp.1-2, 2015.
DOI : 10.1147/JRD.2014.2376112

[. Sazeides, S. Vassiliadis, and J. E. Smith, The performance potential of data dependence speculation and collapsing, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pp.238-247, 1996.
DOI : 10.1109/MICRO.1996.566465

[. Salverda and C. Zilles, A Criticality Analysis of Clustering in Superscalar Processors, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05), pp.55-66, 2005.
DOI : 10.1109/MICRO.2005.6

[. Tang, . Gupta, and . Nicolau, Design of a predictive filter cache for energy savings in high performance processor architectures, ICCD International Conference on Computer Design, pp.68-73, 2001.

S. Thomas, C. Gohkale, E. Tanuwidjaja, T. Chong, D. Lau et al., Cortex- Suite: A Synthetic Brain Benchmark Suite, IISWC, 2014.
DOI : 10.1109/iiswc.2014.6983043
URL : http://cseweb.ucsd.edu/%7Embtaylor/papers/iiswc_2014_cortexsuite_thomas.pdf

E. James and . Thornton, Parallel operation in the control data 6600 Fall Joint Computer Conference , Part II: Very High Speed Computer Systems, AFIPS '64 (Fall, part II), Proceedings of the, pp.33-40, 1964.

D. [. Tune, D. M. Liang, B. Tullsen, and . Calder, Dynamic prediction of critical path instructions, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pp.185-195, 2001.
DOI : 10.1109/HPCA.2001.903262

D. [. Talpes and . Marculescu, Execution cache-based microarchitecture for power-efficient superscalar processors, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp.14-26, 2005.
DOI : 10.1109/TVLSI.2004.840406
URL : http://www.ece.cmu.edu/~dianam/journals/tvlsi05-2.pdf

]. R. Tom67 and . Tomasulo, An efficient algorithm for exploiting multiple arithmetic units, IBM J. Res. Dev, vol.11, issue.1, pp.25-33, 1967.

[. Tseng and Y. N. Patt, Achieving Out-of-Order performance with almost In-Order complexity, Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pp.3-12, 2008.

D. Tiwari, S. Singh, G. Rajgopal, R. Mehta, F. Patel et al., Reducing power in high-performance microprocessors, Proceedings of the 35th annual conference on Design automation conference , DAC '98, pp.732-737, 1998.
DOI : 10.1145/277044.277227
URL : http://herkules.informatik.tu-chemnitz.de/proceedings/dac-98/sun_sgi/../pdffiles/44_2.pdf

. Vaj-+-09-]-sravanthi-kota, I. Venkata, D. Ahn, A. Jeon, C. Gupta et al., SD-VBS: The San Diego Vision Benchmark Suite, Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.55-64, 2009.

[. Vajapeyam and T. Mitra, Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences, Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pp.1-12, 1997.
DOI : 10.1145/264107.264119

]. K. Yea96 and . Yeager, The Mips R10000 superscalar microprocessor, IEEE Micro, vol.16, issue.2, pp.28-41, 1996.

A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, Speculation techniques for improving load related instruction scheduling, Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA '99, pp.42-53, 1999.
DOI : 10.1109/isca.1999.765938
URL : http://home.austin.rr.com/yoaz/ISCA99A.pdf

[. Yang and A. Orailoglu, Power-efficient instruction delivery through trace reuse, Proceedings of the 15th international conference on Parallel architectures and compilation techniques , PACT '06, pp.192-201, 2006.
DOI : 10.1145/1152154.1152185
URL : http://www.cs.virginia.edu/~pact2006/program/pact2006/pact29_yang8.pdf

[. Zyuban and P. Giorgi, Inherently lower-power high-performance superscalar architectures, Proceedings of the 12th ACM International Conference on Computing Frontiers, CF '15, pp.268-285, 2001.
DOI : 10.1109/12.910816

]. N. Hps-+-14, A. Ho, M. Portero, A. Solinas, A. Scionti et al., Simulating a multi-core x86_64 architecture with hardware isa extension supporting a data-flow execution model, 2014 2nd International Conference on Artificial Intelligence, Modelling and Simulation, pp.264-269, 2014.

]. A. Mhs-+-15, N. Mondelli, A. Ho, M. Scionti, A. Solinas et al., Dataflow support in x86_64 multicore architectures through small hardware extensions, Digital System Design (DSD), 2015 Euromicro Conference on Digital System Design, pp.526-529, 2015.

P. Michaud, A. Mondelli, and A. Seznec, Revisiting Clustered Microarchitecture for Future Superscalar Cores, ACM Transactions on Architecture and Code Optimization, vol.12, issue.3, pp.1-2822, 2015.
DOI : 10.1109/12.910816
URL : https://hal.archives-ouvertes.fr/hal-01193178

A. Mondelli and .. Analisi-e-valutazione-di-schemi-di-replicazione-per-memorie-cache, OmniScriptum GmbH & Co. KG, 2014. 3.8 IPC gain over the baseline for a 4-cluster back-end. Each cluster can issue and execute 2 micro-ops per cycle. Only SPEC INT averages are shown, p.62

=. Loop and .. , Top graph (a): impact of the loop buffer size Bottom graph (b): impact of using a Loop Reuse Table, with M axBody = 128, p.73