OpenTuner: An Extensible Framework for Program Autotuning, Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14), pp.303-316, 2014. ,
An adaptive performance modeling tool for GPU architectures, ACM Sigplan Notices, vol.45, pp.105-114, 2010. ,
Automatic C-to-CUDA code generation for affine programs, Compiler Construction, pp.244-263, 2010. ,
DOI : 10.1007/978-3-642-11970-5_14
A practical automatic polyhedral parallelizer and locality optimizer, In ACM SIGPLAN Notices, vol.43, pp.101-113, 2008. ,
DOI : 10.1145/1379022.1375595
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures, Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS '11), pp.676-687, 2011. ,
DOI : 10.1109/ipdps.2011.70
, CUDA occupancy calculator
OpenMP: an industry standard API for shared-memory programming, IEEE computational science and engineering, vol.5, pp.46-55, 1998. ,
DOI : 10.1109/99.660313
Report on the sunway taihulight system. PDF). www. netlib. org, 2016. ,
Hybrid Hexagonal/Classical Tiling for GPUs, Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14), vol.66, 2014. ,
DOI : 10.1145/2581122.2544160
URL : https://hal.archives-ouvertes.fr/hal-00911177
Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pp.24-31, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00786812
High-performance Code Generation for Stencil Computations on GPU Architectures, Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12), pp.311-320, 2012. ,
DOI : 10.1145/2304576.2304619
URL : http://www.cs.ucla.edu/%7Epouchet/doc/ics-article.12.pdf
GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis, 2018. ,
DOI : 10.1145/3192366.3192397
URL : https://hal.archives-ouvertes.fr/hal-01955475
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, In ACM SIGARCH Computer Architecture News, vol.37, pp.152-163, 2009. ,
DOI : 10.1145/1555754.1555775
URL : http://www.cc.gatech.edu/%7Ehyesoon/hong_isca09.pdf
GPURoofline: a model for guiding performance optimizations on GPUs, European Conference on Parallel Processing, pp.920-932, 2012. ,
DOI : 10.1007/978-3-642-32820-6_90
Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs, Code Generation and Optimization (CGO), pp.1-10, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00789958
COMPASS: A Framework for Automated Performance Modeling and Prediction, ACM International Conference on Supercomputing (ICS15, 2015. ,
OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing, HPDC '14: Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, 2014. ,
DOI : 10.1109/waccpd.2014.7
Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters, Proceedings of the 2010 IEEE International Conference on Cluster Computing, pp.207-216, 2010. ,
DOI : 10.1109/cluster.2010.26
Optimizing tensor contraction expressions for hybrid CPU-GPU execution, Cluster computing, vol.16, pp.131-155, 2013. ,
DOI : 10.1007/s10586-011-0179-2
Automatic optimization of thread-coarsening for graphics processors, Proceedings of the 23rd international conference on Parallel architectures and compilation, pp.455-466, 2014. ,
DOI : 10.1145/2628071.2628087
URL : https://www.pure.ed.ac.uk/ws/files/19958629/magni14pact.pdf
A large-scale cross-architecture evaluation of threadcoarsening, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.11, 2013. ,
DOI : 10.1145/2503210.2503268
, Nervana maxas
, CUDA Binary Utilities, 2018.
Micro-benchmarking the GT200 GPU, 2009. ,
Fusing Convolution Kernels Through Tiling, Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp.43-48, 2015. ,
Resource Conscious Reuse-Driven Tiling for GPUs, International Conference on Parallel Architectures and Compilation Techniques, pp.99-111, 2016. ,
Cache-conscious wavefront scheduling, Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp.72-83, 2012. ,
, Many-body methods in chemistry and physics: MBPT and coupled-cluster theory, 2009.
A performance analysis framework for identifying potential benefits in GPGPU applications, In ACM SIGPLAN Notices, vol.47, pp.11-22, 2012. ,
Automatic restructuring of GPU kernels for exploiting inter-thread data locality, International Conference on Compiler Construction, pp.21-40, 2012. ,
NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations, Computer Physics Communications, vol.181, pp.1477-1489, 2010. ,
Polyhedral parallel code generation for CUDA, ACM Transactions on Architecture and Code Optimization (TACO), vol.9, p.54, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00786677
Constant propagation with conditional branches, ACM Transactions on Programming Languages and Systems, vol.13, pp.181-210, 1991. ,
, Manuscript submitted to ACM GPU Code Optimization using Abstract Kernel
, Whitepaper 2012. NVIDIA Tesla K100
, Whitepaper 2016. NVIDIA Tesla P100
OpenACC-first experiences with real-world applications, Euro-Par, pp.859-870, 2012. ,
Roofline: an insightful visual performance model for multicore architectures, Commun. ACM, vol.52, pp.65-76, 2009. ,
Taming the "Monster": Overcoming program optimization challenges on SW26010 through precise performance modeling, Parallel and Distributed Processing Symposium (IPDPS), 2018. ,
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.31-43, 2017. ,
A quantitative performance analysis model for GPU architectures, 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp.382-393, 2011. ,
A performance analysis framework for exploiting GPU microarchitectural capability, Proceedings of the International Conference on Supercomputing. ACM, 15, 2017. ,