HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

Pierre Riteau 1 Adrien Lebre 2, 3 Christine Morin 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
2 ASCOLA - Aspect and composition languages
LINA - Laboratoire d'Informatique de Nantes Atlantique, Département informatique - EMN, Inria Rennes – Bretagne Atlantique
Abstract : Computer clusters are today the reference architecture for high-performance computing. The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters. Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts. In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in a distributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.
Complete list of metadata

Contributor : Adrien Lebre Connect in order to contact the contributor
Submitted on : Friday, October 16, 2009 - 11:51:43 AM
Last modification on : Wednesday, April 27, 2022 - 3:47:23 AM

Links full text



Pierre Riteau, Adrien Lebre, Christine Morin. Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems. 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID '09), May 2009, Shanghai, China. ⟨10.1109/CCGRID.2009.29⟩. ⟨inria-00424542⟩



Record views