A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

Bérenger Bramas

doi:10.7717/peerj-cs.769

Article Dans Une Revue PeerJ Computer Science Année : 2021

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

(1, 2)

1
2

Bérenger Bramas

Fonction : Auteur
PersonId : 739336
IdHAL : berenger-bramas
ORCID : 0000-0003-0281-9709
IdRef : 192518178

Compilation pour les Architectures MUlti-coeurS

Laboratoire des sciences de l'ingénieur, de l'informatique et de l'imagerie

Résumé

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the wellknown Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(logN) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

peerj-cs-769.pdf (521.46 Ko)

Origine : Publication financée par une institution

Bérenger Bramas : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03227631

Soumis le : vendredi 19 novembre 2021-13:33:05

Dernière modification le : jeudi 11 avril 2024-13:08:14

Dates et versions

hal-03227631 , version 1 (17-05-2021)

hal-03227631 , version 2 (19-11-2021)

Identifiants

HAL Id : hal-03227631 , version 2
DOI : 10.7717/peerj-cs.769

Citer

Bérenger Bramas. A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE). PeerJ Computer Science, 2021, ⟨10.7717/peerj-cs.769⟩. ⟨hal-03227631v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSERM UNIV-RENNES1 CNRS INRIA ENGEES IRISA INSA-STRASBOURG INRIA2 INC-CNRS UR1-MATH-STIC UR1-UFR-ISTIC SITE-ALSACE UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

223 Consultations

1127 Téléchargements

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager