Impact of over-decomposition on coordinated checkpoint/rollback protocol

Xavier Besseron; Thierry Gautier

doi:10.1007/978-3-642-29740-3_36

Communication Dans Un Congrès Année : 2011

Impact of over-decomposition on coordinated checkpoint/rollback protocol

(1, 2) , (1)

1
2

Xavier Besseron

Fonction : Auteur

PrograMming and scheduling design fOr Applications in Interactive Simulation

Ohio State University [Columbus]

Thierry Gautier

Fonction : Auteur

PrograMming and scheduling design fOr Applications in Interactive Simulation

Résumé

Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Grégory Mounié : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00796863

Soumis le : mardi 5 mars 2013-11:06:19

Dernière modification le : jeudi 4 avril 2024-21:15:31

Dates et versions

hal-00796863 , version 1 (05-03-2013)

Identifiants

HAL Id : hal-00796863 , version 1
DOI : 10.1007/978-3-642-29740-3_36

Citer

Xavier Besseron, Thierry Gautier. Impact of over-decomposition on coordinated checkpoint/rollback protocol. Workshop on Resiliency in High-Performance Computing, 17-th International European Conference On Parallel and Distributed Computing, 2011, Bordeaux, France. ⟨10.1007/978-3-642-29740-3_36⟩. ⟨hal-00796863⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LIG LIG_SRCPR LIG_SRCPR_MOAIS INRIA2 LIG_SIDCH

109 Consultations

0 Téléchargements

Impact of over-decomposition on coordinated checkpoint/rollback protocol

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager