Transparent Message-Passing Parallel Applications Checkpointing in Kerrighed
Résumé
Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution time. Since the number of nodes in clusters is growing, the probability of a node failure during the execution of an application increases and the application execution time may be greater than the cluster mean time between failures (MTBF). To avoid restarting application from the beginning, some fault tolerant mechanisms such as checkpoint/restart are needed. Currently, checkpoint/restart mechanisms are either implemented directly in the application source code by applications programmers or are integrated in communication environments such as MPI or PVM. We propose in this paper a new approach in which checkpoint/restart mechanisms for parallel applications are implemented in a cluster single system image operating system. While this kernel level approach is more complex to implement than other approaches, it is more general because it does not require any modification, compilation or relinking of the applications whatever the communication environment they rely on. Our approach has been implemented in single system image operating system based on. Performance results are presented in this paper.