DL: A data layout transformation system for heterogeneous computing, 2012 Innovative Parallel Computing (InPar), pp.513-522, 2012. ,
DOI : 10.1109/InPar.2012.6339606
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures, Proc. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011. ,
DOI : 10.1109/TPDS.2010.107
Dymaxion, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.13-13, 2011. ,
DOI : 10.1145/2063384.2063401
Blocked in-place transposition with application to storage format conversion, 2009. ,
Kagstr? om. Parallel and cache-efficient in-place matrix storage format conversion, ACM Transactions on Mathematical Software ,
Optimizing matrix transpose in CUDA, 2009. ,
Dymaxion, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.13-13, 2011. ,
DOI : 10.1145/2063384.2063401
Analyzing CUDA Workloads Using a Detailed GPGPU Simulator, Ispass 2009: Ieee International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009. ,
Available: https://developer.nvidia.com/gpu-computing- sdk 12 Parboil Benchmarks, Available ,