Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

Bogdan Nicolae 1, *
* Auteur correspondant
1 Exascale Systems
DRL - IBM Research Ireland
Abstract : With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.
Type de document :
Communication dans un congrès
IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, May 2013, Boston, United States. pp.19-28, 2013, 〈10.1109/IPDPS.2013.14〉
Liste complète des métadonnées

Littérature citée [27 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00781532
Contributeur : Bogdan Nicolae <>
Soumis le : dimanche 2 juin 2013 - 02:11:34
Dernière modification le : lundi 4 avril 2016 - 09:49:56
Document(s) archivé(s) le : mardi 3 septembre 2013 - 10:40:34

Fichier

paper.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Bogdan Nicolae. Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal. IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, May 2013, Boston, United States. pp.19-28, 2013, 〈10.1109/IPDPS.2013.14〉. 〈hal-00781532v2〉

Partager

Métriques

Consultations de la notice

263

Téléchargements de fichiers

156