AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing

Bogdan Nicolae 1, * Franck Cappello 1, 2, 3
* Auteur correspondant
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.
Type de document :
Communication dans un congrès
HPDC '13: 22th International ACM Symposium on High-Performance Parallel and Distributed Computing, Jun 2013, New York, United States. pp.155-166, 2013, 〈10.1145/2462902.2462918〉
Liste complète des métadonnées

Littérature citée [27 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00809847
Contributeur : Bogdan Nicolae <>
Soumis le : mercredi 10 avril 2013 - 00:57:11
Dernière modification le : jeudi 5 avril 2018 - 12:30:12
Document(s) archivé(s) le : jeudi 11 juillet 2013 - 04:11:09

Fichier

paper.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Bogdan Nicolae, Franck Cappello. AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing. HPDC '13: 22th International ACM Symposium on High-Performance Parallel and Distributed Computing, Jun 2013, New York, United States. pp.155-166, 2013, 〈10.1145/2462902.2462918〉. 〈hal-00809847〉

Partager

Métriques

Consultations de la notice

729

Téléchargements de fichiers

233