28967 articles – 22394 references  [version française]

hal-00716478, version 1

High performance checksum computation for fault-tolerant MPI over InfiniBand

Alexandre Denis (Author to contact preferably, http://runtime.bordeaux.inria.fr/adenis/) 12, François TRAHAY (, http://www-public.it-sudparis.eu/~trahay_f/) a3, Yutaka Ishikawa b4

the 19th European MPI Users' Group Meeting (EuroMPI 2012) 7490 (2012)

Abstract: With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on network communication performance and we propose a mechanism to amortize their cost on InfiniBand. We have implemented our mechanisms in the NEWMADELEINE communication library. Our evaluation shows that our mechanisms to ensure message integrity do not impact noticeably the application performance, which is an improvement over the state of the art MPI implementations.

  • a –  Télécom & Management SudParis
  • b –  University of Tokyo
  • 1:  RUNTIME (INRIA Bordeaux - Sud-Ouest)
  • INRIA – CNRS : UMR5800 – Université Sciences et Technologies - Bordeaux I – École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)
  • 2:  Laboratoire Bordelais de Recherche en Informatique (LaBRI)
  • CNRS : UMR5800 – Université Sciences et Technologies - Bordeaux I – École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB) – Université Victor Segalen - Bordeaux II
  • 3:  Département Informatique (INF)
  • Institut Mines-Télécom – Télécom SudParis
  • 4:  Computer Science Department (CST)
  • University of Tokyo
  • Collaboration : Grid'5000
  • Domain : Computer Science/Networking and Telecommunication
 
  • hal-00716478, version 1
  • oai:hal.inria.fr:hal-00716478
  • From: 
  • Submitted on: Tuesday, 10 July 2012 16:09:23
  • Updated on: Thursday, 6 September 2012 11:58:23