Communication Avoiding Gaussian Elimination - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2008

Communication Avoiding Gaussian Elimination

Laura Grigori
James W. Demmel
  • Fonction : Auteur
  • PersonId : 849003
H. Xiang
  • Fonction : Auteur

Résumé

This paper presents CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional (2D) cyclic layout. The algorithm is based on a new pivoting strategy, referred to as ca-pivoting, that is shown to be stable in practice. The ca-pivoting strategy leads to a significant decrease in the number of messages exchanged during the factorization of a block-column relatively to conventional algorithms, and thus CALU overcomes the latency bottleneck of the LU factorization as in current implementations like ScaLAPACK and HPL. The experimental part of this paper focuses on the evaluation of the performance of CALU on two computational systems, an IBM POWER 5 system with 888 compute processors distributed among 111 compute nodes, and a Cray XT4 system with 9660 dual-core AMD Opteron processors. We compare CALU with ScaLAPACK PDGETRF routine that computes the LU factorization. Our experiments show that CALU leads to a reduction in the parallel time of the LU factorization. The gain depends on the size of the matrices and on the characteristics of the computer architecture. In particular the effect is found to be significant in the cases when the latency time is an important factor of the overall time, as for example when a small matrix is executed on large number of processors. The factorization of a block-column, referred to as TSLU, reaches a performance of 215 GFLOPs/s on 64 processors of the IBM POWER 5 system, and a performance of 240 GFLOPs/s on 64 processors of the Cray XT4 system. It represents 44% and 36% of the theoretical peak performances on these systems. TSLU outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on the IBM POWER 5 system and up to a factor of 5.58 on the Cray XT4 system. On square matrices of order 10000, CALU outperforms PDGETRF by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4. It represents 40% and 23% of the peak performance on these systems. The best improvement obtained by CALU is a speedup of 2.29 on IBM POWER 5 and a speedup of 1.81 on Cray XT4.
Fichier principal
Vignette du fichier
RR-6523.pdf (345.44 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

inria-00277901 , version 1 (07-05-2008)
inria-00277901 , version 2 (13-05-2008)

Identifiants

  • HAL Id : inria-00277901 , version 2

Citer

Laura Grigori, James W. Demmel, H. Xiang. Communication Avoiding Gaussian Elimination. [Research Report] RR-6523, INRIA. 2008. ⟨inria-00277901v2⟩
278 Consultations
429 Téléchargements

Partager

Gmail Facebook X LinkedIn More