, , 2016.
Optimal checkpointing period with replicated execution on heterogeneous platforms, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-02082847
Toward exascale resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, issue.2, pp.210-224, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00668016
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Optimization of multi-level checkpoint model for large scale HPC applications, 2014. ,
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel Distributed Systems, vol.28, issue.1, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01263879
e case for modular redundancy in large-scale highh performance computing systems, 2009. ,
Evaluating the Viability of Process Replication Reliability for Exascale Systems, SC'11, 2011. ,
Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015. ,
Volpexmpi: An mpi library for execution of parallel applications on volatile nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
Shadow computing: An energy-aware fault tolerant computing model, Int. Conf. on Computing, Networking and Communications (ICNC), pp.73-77, 2014. ,
, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
Design, modeling, and evaluation of a scalable multi-level checkpointing system, 2010. ,
A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Trans. Parallel Dist. Syst, vol.12, issue.4, pp.346-362, 2001. ,
Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
Addressing failures in exascale computing, Int. J. High Perform. Comput. Appl, vol.28, issue.2, pp.129-173, 2014. ,
Using Replication and Checkpointing for Reliable Task Management in Computational Grids, SC'10, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788867
A rst order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Reliability-aware scalability models for high performance computing, Cluster Computing, 2009. ,