Autotuning Convolutions is Easier Than You Think

Nicolas Tollenaere; Guillaume Iooss; Stéphane Pouget; Hugo Brunie; Christophe Guillon; Albert Cohen; P. Sadayappan; Fabrice Rastello

doi:10.1145/3570641

Article Dans Une Revue ACM Transactions on Architecture and Code Optimization Année : 2022

Autotuning Convolutions is Easier Than You Think

(1) , (1) , (2) , (1) , (1) , (3) , (4) , (1)

1
2
3
4

Nicolas Tollenaere

Fonction : Auteur

Compiler Optimization and Run-time Systems

Guillaume Iooss

Fonction : Auteur

Compiler Optimization and Run-time Systems

Stéphane Pouget

Fonction : Auteur

University of California [Los Angeles]

Hugo Brunie

Fonction : Auteur

Compiler Optimization and Run-time Systems

Christophe Guillon

Fonction : Auteur
PersonId : 174697
IdHAL : christophe-guillon
ORCID : 0000-0001-6378-3303
IdRef : 177999357

Compiler Optimization and Run-time Systems

Albert Cohen

Fonction : Auteur

Google France

P. Sadayappan

Fonction : Auteur

University of Utah

Fabrice Rastello

Fonction : Auteur
PersonId : 2883
IdHAL : fabrice-rastello
IdRef : 149155727

Compiler Optimization and Run-time Systems

Résumé

A wide range of scientific and machine learning applications depend on highly optimized implementations of tensor computations. Exploiting the full capacity of a given processor architecture remains a challenging task, due to the complexity of the microarchitectural features that come into play when seeking near-peak performance. Among the state-of-the-art techniques for loop transformations for performance optimization, AutoScheduler [34] tends to outperform other systems. It often yields higher performance as compared to vendor libraries, but takes a large number of runs to converge, while also involving a complex training environment. In this paper, we define a structured configuration space that enables much faster convergence to high-performance code versions, using only random sampling of candidates. We focus on two-dimensional convolutions on CPUs. Compared to state-of-the-art libraries, our structured search space enables higher performance for typical tensor shapes encountered in convolution stages in deep learning pipelines. Compared to auto-tuning code generators like AutoScheduler, it prunes the search space while increasing the density of efficient implementations. We analyze the impact on convergence speed and performance distribution, on two Intel x86 processors and one ARM AArch64 processor. We match or outperform the performance of the state-of-the-art oneDNN library and TVM’s AutoScheduler, while reducing the autotuning effort by at least an order of magnitude.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Guillaume Iooss : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03844272

Soumis le : mardi 8 novembre 2022-16:07:21

Dernière modification le : jeudi 4 avril 2024-21:38:55

Dates et versions

hal-03844272 , version 1 (08-11-2022)

Licence

Paternité

Identifiants

HAL Id : hal-03844272 , version 1
DOI : 10.1145/3570641

Citer

Nicolas Tollenaere, Guillaume Iooss, Stéphane Pouget, Hugo Brunie, Christophe Guillon, et al.. Autotuning Convolutions is Easier Than You Think. ACM Transactions on Architecture and Code Optimization, 2022, pp.1-23. ⟨10.1145/3570641⟩. ⟨hal-03844272⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LIG LIG_SRCPR INRIA2 LIG-SRCPR-CORSE MIAI ANR LIG_SIDCH

87 Consultations

0 Téléchargements

Autotuning Convolutions is Easier Than You Think

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager