I. Sung, J. A. Stratton, and W. W. Hwu, DL: A data layout transformation system for heterogeneous computing, 2012 Innovative Parallel Computing (InPar), pp.513-522, 2012.
DOI : 10.1109/InPar.2012.6339606

B. Jang, D. Schaa, P. Mistry, and D. Kaeli, Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures, Proc. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011.
DOI : 10.1109/TPDS.2010.107

S. Che, J. W. Sheaffer, and K. Skadron, Dymaxion, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.13-13, 2011.
DOI : 10.1145/2063384.2063401

L. Karlsson, Blocked in-place transposition with application to storage format conversion, 2009.

F. Gustavson, L. Karlsson, and B. , Kagstr? om. Parallel and cache-efficient in-place matrix storage format conversion, ACM Transactions on Mathematical Software

G. Ruetsch and P. Micikevicius, Optimizing matrix transpose in CUDA, 2009.

S. Che, J. W. Sheaffer, and K. Skadron, Dymaxion, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.13-13, 2011.
DOI : 10.1145/2063384.2063401

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, Analyzing CUDA Workloads Using a Detailed GPGPU Simulator, Ispass 2009: Ieee International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.

S. Computing, Available: https://developer.nvidia.com/gpu-computing- sdk 12 Parboil Benchmarks, Available