Skip to Main content Skip to Navigation
Conference papers

XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server

Abstract : In the last ten years, GPUs have dominated the market considering the computing/power metric and numerous research works have provided Basic Linear Algebra Subprograms implementations accelerated on GPUs. Several software libraries have been developed for exploiting performance of systems with accelerators, but the real performance may be far from the platform peak performance. This paper presents XKBlas that aims to improve performance of BLAS-3 kernels on multi-GPU systems. At low level, we model computation as a set of tasks accessing data on different resources. At high level, the API design favors non-blocking calls as uniform concept to overlap latency, even by fine grain computation. Unit benchmark of BLAS-3 kernels showed that XKBlas outperformed most implementations including the overhead of dynamic task's creation and scheduling. XKBlas outperformed BLAS implementations such as cuBLAS-XT, PaRSEC, BLASX and Chameleon/StarPU.
Complete list of metadata
Contributor : Thierry Gautier Connect in order to contact the contributor
Submitted on : Tuesday, September 28, 2021 - 10:21:26 AM
Last modification on : Friday, January 7, 2022 - 11:08:14 AM
Long-term archiving on: : Wednesday, December 29, 2021 - 6:15:44 PM


Files produced by the author(s)




Thierry Gautier, Joao Vicente Ferreira Lima. XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server. PDP 2020 - 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Mar 2020, Västerås, Sweden. pp.1-8, ⟨10.1109/PDP50117.2020.00008⟩. ⟨hal-03121583⟩



Les métriques sont temporairement indisponibles