HydEE : Vers un protocole de recouvrement arrière hiérarchique pour les machines exascales De l'exploitation du déterminisme des émissions dans les protocoles de recouvrement arrière

Abstract : The move towards exascale super-computers requires new fault tolerance solutions. Regarding parallel message passing applications, existing rollback-recovery protocols are not suited. To be able to deal with very large scale applications and high failure rate, a protocol should be able to confine failures consequences to a small subset of the processes, while providing good failure free performance, and logging a limited amount of data, especially in memory. To fulfill these needs, we propose HydEE, a hierarchical rollback-recovery protocol that combines coordinated checkpointing and message logging. HydEE leverages the send-determinism of scienfitic parallel applications to tolerate multiple failures without relying on a stable storage. Our experiments show that for most applications, saving less than 15% of the messages payload in memory is enough to limit the rollbacks after a failure to less than 15% of the processes.
Document type :
Journal articles
Complete list of metadatas

Cited literature [15 references]  Display  Hide  Download

https://hal.inria.fr/hal-01952884
Contributor : Amina Guermouche <>
Submitted on : Wednesday, December 12, 2018 - 3:44:32 PM
Last modification on : Wednesday, June 12, 2019 - 1:34:38 AM
Long-term archiving on : Wednesday, March 13, 2019 - 2:30:44 PM

Identifiers

  • HAL Id : hal-01952884, version 1

Citation

Thomas Ropars, Amina Guermouche, Franck Cappello. HydEE : Vers un protocole de recouvrement arrière hiérarchique pour les machines exascales De l'exploitation du déterminisme des émissions dans les protocoles de recouvrement arrière. Techniques et sciences informatiques, 2012, 31 (8-10), pp.1049-1078. ⟨hal-01952884⟩

Share

Metrics

Record views

67

Files downloads

497