Improving load/store queues usage in scientific computing

Abstract : Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.
Type de document :
Communication dans un congrès
International Conference on Parallel Processing (ICPP 2004), Aug 2004, Montréal, Canada. IEEE, pp.38-45, 2004, 〈〉. 〈10.1109/ICPP.2004.1327902〉
Liste complète des métadonnées

Littérature citée [7 références]  Voir  Masquer  Télécharger
Contributeur : Sid Touati <>
Soumis le : lundi 31 octobre 2011 - 15:14:33
Dernière modification le : jeudi 11 janvier 2018 - 06:21:30
Document(s) archivé(s) le : mercredi 1 février 2012 - 02:22:36


Fichiers produits par l'(les) auteur(s)




Christophe Lemuet, William Jalby, Sid Touati. Improving load/store queues usage in scientific computing. International Conference on Parallel Processing (ICPP 2004), Aug 2004, Montréal, Canada. IEEE, pp.38-45, 2004, 〈〉. 〈10.1109/ICPP.2004.1327902〉. 〈inria-00637256〉



Consultations de la notice


Téléchargements de fichiers