C. Bienia, S. Kumar, J. Singh, and K. Li, The PARSEC benchmark suite: Characterization and architectural implications, PACT. ACM, pp.72-81, 2008.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, IEEE Workload Characterization Symposium, vol.0, pp.44-54, 2009.

A. Eltantawy, M. Tor, and . Aamodt, MIMD Synchronization on SIMT Architectures, 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016.

R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago et al., Tarantula: a vector extension to the alpha architecture, 29th Annual International Symposium on Computer Architecture. IEEE, pp.281-292, 2002.

R. Espasa, M. Valero, and J. E. Smith, Out-of-order vector architectures, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pp.160-170, 1997.

W. L. Wilson, I. Fung, G. Sham, T. Yuan, and . Aamodt, Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware, ACM Transactions on Architecture and Code Optimization (TACO), vol.6, p.7, 2009.

A. Haidar, A. Abdelfatah, S. Tomov, and J. Dongarra, High-performance Cholesky factorization for GPU-only execution, Proceedings of the General Purpose GPUs, pp.42-52, 2017.

S. Hily and A. Seznec, Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading, Proceedings. Fifth International Symposium On. IEEE, pp.64-67, 1999.
URL : https://hal.archives-ouvertes.fr/inria-00073298

, Intel 64 and IA-32 architectures optimization reference manual, 2017.

S. Kalathingal, S. Collange, B. N. Swamy, and A. Seznec, DITVA: Dynamic Inter-Thread Vectorization Architecture, J. Parallel and Distrib. Comput, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01655904

R. Karrenberg and S. Hack, Whole-function vectorization, CGO. IEEE, pp.141-150, 2011.

J. Kim, S. Jiang, C. Torng, M. Wang, S. Srinath et al., Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs, Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, pp.759-773, 2017.

Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, W. Stephen et al., Exploring the design space of SPMD divergence management on data-parallel architectures, 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, pp.101-113, 2014.

S. Li, J. Ho-ahn, R. D. Strong, B. Jay, . Brockman et al., McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures, 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp.469-480, 2009.

J. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol.28, pp.39-55, 2008.

G. Long, D. Franklin, S. Biswas, P. Ortiz, J. Oberg et al., Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors, Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.337-348, 2010.

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser et al., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, SIGPLAN Not, vol.40, pp.190-200, 2005.

D. S. Mcfarlin, C. Tucker, and C. Zilles, Discerning the Dominant Out-of-Order Performance Advantage: Is It Speculation or Dynamism, Proc. of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13, pp.241-252, 2013.

M. Mckeown, J. Balkind, and D. Wentzlaff, Execution Drafting: Energy Efficiency Through Computation Deduplication, Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47, pp.432-444, 2014.

T. Milanez, S. Collange, F. M. Pereira, W. Meira, and R. Ferreira, Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads, Parallel Comput, vol.40, pp.548-558, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01087054

S. Mittal and J. S. Vetter, A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Comput. Surv, vol.47, 2015.

J. Nickolls, J. William, and . Dally, The GPU Computing Era, IEEE Micro, vol.30, pp.56-69, 2010.

A. Pajuelo, A. González, and M. Valero, Speculative dynamic vectorization, 29th Annual International Symposium on Computer Architecture. IEEE, pp.271-280, 2002.

A. Pajuelo, A. González, and M. Valero, Control-flow independence reuse via dynamic vectorization, 19th IEEE International Parallel and Distributed Processing Symposium, p.10, 2005.

M. Pharr and . William-r-mark, ispc: A SPMD compiler for high-performance CPU programming, Innovative Parallel Computing (InPar), 2012.

N. Prémillieu and A. Seznec, SYRANT: SYmmetric Resource Allocation on Not-taken and Taken Paths, ACM Transactions on Architecture and Code Optimization (TACO) -HIPEAC Papers, vol.8, p.4, 2012.

N. Prémillieu and A. Seznec, Efficient Out-of-Order Execution of Guarded ISAs, ACM Transactions on Architecture and Code Optimization, vol.11, pp.1-21, 2014.

E. Safi, A. Moshovos, and A. Veneris, Two-Stage, Pipelined Register Renaming, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.19, pp.1926-1931, 2011.

A. Seznec, A New Case for the TAGE Branch Predictor, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44, pp.117-127, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00639193

M. Faissal, T. F. Sleiman, and . Wenisch, Efficiently scaling out-of-order cores for simultaneous multithreading, In ACM SIGARCH Computer Architecture News, vol.44, pp.431-443, 2016.

N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole et al., The ARM Scalable Vector Extension, IEEE Micro, vol.37, issue.2, pp.26-39, 2017.

S. Vajapeyam, T. Joseph, and . Mitra, Dynamic Vectorization: A Mechanism for Exploiting Far-Flung ILP in Ordinary Programs, 26th International Symposium on Computer Architecture, pp.16-27, 1999.

H. Perry, H. Wang, R. Wang, K. Kling, J. P. Ramakrishnan et al., Register renaming and scheduling for dynamic execution of predicated code, International Symposium on High-Performance Computer Architecture (HPCA), pp.15-25, 2001.

H. Wong, M. Tor, and . Aamodt, The Performance Potential for Single Application Heterogeneous Systems, 8th Workshop on Duplicating, Deconstructing, and Debunking, 2009.