S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu, An adaptive performance modeling tool for GPU architectures, ACM SIGPLAN Notices, vol.45, issue.5, pp.105-114, 2010.
DOI : 10.1145/1837853.1693470

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
DOI : 10.1109/ISPASS.2009.4919648
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.507.8371

S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Functional Simulator for GPGPU, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp.351-360, 2010.
DOI : 10.1109/MASCOTS.2010.43

S. Hong and H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, ACM SIGARCH Computer Architecture News, vol.37, issue.3, pp.152-163, 2009.
DOI : 10.1145/1555815.1555775

K. Z. Ibrahim and F. Bodin, Efficient SIMDization and Data Management of the Lattice QCD Computation on the Cell Broadband Engine, Scientific Programming, vol.17, issue.1-2, pp.153-172, 2009.
DOI : 10.1155/2009/634756

Y. Kim and A. Shrivastava, CuMAPz, Proceedings of the 48th Design Automation Conference on, DAC '11, pp.128-133, 2011.
DOI : 10.1145/2024724.2024754

G. Ruetsch and P. Micikevicius, Optimizing matrix transpose in cuda, 2009.

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng et al., Program optimization space pruning for a multithreaded gpu, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization , CGO '08, pp.195-204, 2008.
DOI : 10.1145/1356058.1356084

H. Wong, M. Papadopoulou, M. Sadooghi-alvandi, and A. Moshovos, Demystifying GPU microarchitecture through microbenchmarking, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp.235-246, 2010.
DOI : 10.1109/ISPASS.2010.5452013
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.189.5309

Y. Zhang and J. D. Owens, A quantitative performance analysis model for GPU architectures, 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp.249-6399
DOI : 10.1109/HPCA.2011.5749745