Smooth and Efficient Integration of High-Availability in a Parallel Single Level Store System
Résumé
A parallel single level store (PSLS) system integrates a shared virtual memory and a parallel file system thus providing programmers with a global address space including both memory and file data. Parallel single level store systems implemented in a cluster thus represent an attractive support for long running parallel applications combining both the natural shared memory programming model and a large and efficient file system. However the need to tolerate failures in such a system increases with the size of applications. In this paper we present the smooth integration of a backward error recovery high-availability support into a parallel single level store system. Our system is able to tolerate multiple transient failures, a single permanent one, and power cut failures affecting the whole cluster without requiring any specific hardware. For this purpose, our highly-available parallel single level store system relies on a high degree of integration (and reusability) of high-availability and standard supports. We focus on the parallel file system management at checkpointing and recovery time and especially on the mirror management. A prototype integrating our high-availability support has been implemented and we show some performance results in the paper.
Loading...