A. Al-dujaili, F. Deragisch, A. Hagiescu, and W. Wong, Guppy: A GPU-like soft-core processor, 2012 International Conference on Field-Programmable Technology, pp.57-60, 2012.
DOI : 10.1109/FPT.2012.6412112

K. Andryc, M. Merchant, and R. Tessier, FlexGrip: A soft GPGPU for FPGAs, 2013 International Conference on Field-Programmable Technology (FPT), pp.230-237, 2013.
DOI : 10.1109/FPT.2013.6718358

A. Bakhoda, G. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
DOI : 10.1109/ISPASS.2009.4919648

URL : http://www.stuffedcow.net/files/gpgpusim.ispass09.pdf

R. Balasubramanian, V. Gangadhar, Z. Guo, C. Ho, C. Joseph et al., Enabling GPGPU Low-Level Hardware Explorations with MIAOW, ACM Transactions on Architecture and Code Optimization, vol.12, issue.2, 2015.
DOI : 10.1109/DSN.2004.1311877

N. Brunie, S. Collange, and G. Diamos, Simultaneous Branch and Warp Interweaving for Sustained GPU Performance, 39th Annual International Symposium on Computer Architecture (ISCA). Portland, OR, United States, pp.49-60, 2012.
DOI : 10.1145/2366231.2337166

URL : https://hal.archives-ouvertes.fr/ensl-00649650

J. Bush, P. Dexter, N. Timothy, A. Miller, and . Carpenter, Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp.173-182, 2015.
DOI : 10.1109/ISPASS.2015.7095803

G. Chrysos, Intel® Xeon Phi? Coprocessor-the Architecture, Intel Whitepaper, 2014.

. Sylvain-collange, Stack-less SIMT reconvergence at low cost, 2011.

S. Collange and N. Brunie, Path list traversal: a new class of SIMT flow tracking mechanisms, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01533085

M. Sylvain-collange, D. Daumas, D. Defour, and . Parello, Barra: a Parallel Functional Simulator for GPGPU, IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp.351-360, 2010.

A. Eltantawy, M. Tor, and . Aamodt, MIMD synchronization on SIMT architectures, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
DOI : 10.1109/MICRO.2016.7783714

X. Gong, R. Ubal, and D. Kaeli, Multi2Sim Kepler: A detailed architectural GPU simulator, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp.269-278, 2017.
DOI : 10.1109/ISPASS.2017.7975298

L. John, . Hennessy, A. David, and . Patterson, Computer architecture: a quantitative approach, 2011.

S. Kalathingal, S. Collange, . Bharath-narasimha, A. Swamy, and . Seznec, Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016.
DOI : 10.1109/SBAC-PAD.2016.11

URL : https://hal.archives-ouvertes.fr/hal-01356202

H. Kim, J. Lee, B. Nagesh, J. Lakshminarayana, J. Sim et al., Macsim: A CPU-GPU heterogeneous simulation framework user guide, 2012.

J. Kingyens and J. Gregory-steffan, A GPU-inspired soft processor for high-throughput acceleration, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp.1-8, 2010.
DOI : 10.1109/IPDPSW.2010.5470679

C. Kozyrakis and D. Patterson, Overcoming the limitations of conventional vector processors, ACM SIGARCH Computer Architecture News, vol.31, issue.2, pp.399-409, 2003.
DOI : 10.1145/871656.859664

Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart et al., Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators, ACM Transactions on Computer Systems (TOCS), vol.31, issue.6, 2013.

Y. Lee, A. Waterman, R. Avizienis, H. Cook, C. Sun et al., A 45nm 1.3 GHz 16.7 doubleprecision GFLOPS/W RISC-V processor with vector accelerators, European Solid State Circuits Conference (ESSCIRC), pp.2014-2054, 2014.

J. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol.28, issue.2, pp.39-5531, 2008.
DOI : 10.1109/MM.2008.31

S. Liu, J. E. Lindholm, Y. Ming, . Siu, W. Brett et al., Operand collector architecture, US Patent, vol.7834, p.881, 2010.

J. Meng and K. Skadron, A reconfigurable simulator for largescale heterogeneous multicore architectures, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, pp.119-120, 2011.
DOI : 10.1109/ispass.2011.5762722

URL : http://www.cs.virginia.edu/%7Eskadron/Papers/ispass_mv5_abstract.pdf

J. Meng, D. Tarjan, and K. Skadron, Dynamic warp subdivision for integrated branch and memory divergence tolerance, 2010.
DOI : 10.1145/1815961.1815992

J. Nickolls and W. J. Dally, The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010.
DOI : 10.1109/MM.2010.41

A. Severance, J. Edwards, H. Omidian, and G. Lemieux, Soft vector processors with streaming pipelines, Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays, FPGA '14, pp.117-126, 2014.
DOI : 10.1145/2554688.2554774

URL : http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf

A. Waterman, Y. Lee, A. David, K. Patterson, and . Asanovi?, The RISC-V Instruction Set Manual, 2014.
DOI : 10.1109/hotchips.2013.7478332