Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale , FTXS '17, 2017. ,
DOI : 10.1147/rd.401.0003
URL : https://hal.archives-ouvertes.fr/hal-01494678
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016. ,
DOI : 10.1109/IPDPS.2016.39
URL : https://hal.archives-ouvertes.fr/hal-01354886
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013. ,
DOI : 10.1109/SNAPI.2010.10
URL : https://hal.archives-ouvertes.fr/hal-00908447
Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008. ,
Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014 ,
DOI : 10.1515/9781400882618-003
URL : http://institute.lanl.gov/resilience/docs/Toward%20Exascale%20Resilience.pdf
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015. ,
DOI : 10.1016/j.future.2015.04.003
URL : https://hal.archives-ouvertes.fr/hal-01199752
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
Programming Models and Development Software for a Space-Based Many-Core Processor, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
DOI : 10.1109/SMC-IT.2011.29
Scientific Workflows and Provenance: Introduction and Research Opportunities, Datenbank-Spektrum, vol.54, issue.4, pp.193-203, 2012. ,
DOI : 10.1007/978-3-642-17819-1_23
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016. ,
DOI : 10.1109/TPDS.2016.2546248
URL : https://hal.archives-ouvertes.fr/hal-01353871
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012. ,
DOI : 10.1109/ICDCS.2012.56
URL : http://moss.csc.ncsu.edu/%7Emueller/ftp/pub/mueller/papers/icdcs12.pdf
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
DOI : 10.1109/TDSC.2004.15
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
The case for modular redundancy in large-scale high performance computing systems, PDCN. IASTED, 2009. ,
Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011. ,
DOI : 10.2316/P.2011.719-031
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
DOI : 10.1016/j.procs.2012.04.018
Fault Tolerance and Recovery of Scientific Workflows on Computational Grids, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp.777-782, 2008. ,
DOI : 10.1109/CCGRID.2008.79
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133 ,
DOI : 10.1007/978-3-540-30218-6_19
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs, VLSI Design, vol.1800, issue.3, 2013. ,
DOI : 10.1109/12.773794
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503266
Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998. ,
DOI : 10.1109/71.730527
URL : http://www.cs.utk.edu/~plank/papers/CS-97-380.ps.Z
Supporting highly-decoupled threadlevel redundancy for parallel programs, 14th Int. Conf. on High- Performance Computer Architecture (HPCA), pp.393-404, 2008. ,
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
DOI : 10.1088/1742-6596/78/1/012022
Using two-level stable storage for efficient checkpointing, IEE Proceedings - Software, vol.145, issue.6, pp.198-202, 1998. ,
DOI : 10.1049/ip-sen:19982440
Does partial replication pay off? In FTXS, 2012. ,
Programmerdirected partial redundancy for resilient HPC, Computing Frontiers, 2015. ,
Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp.452-457, 2017. ,
DOI : 10.1109/CCGRID.2017.40
Workflow Systems for Science: Concepts and Tools, ISRN Software Engineering, vol.37, issue.1, 2013. ,
DOI : 10.1109/TSC.2009.4
URL : https://doi.org/10.1155/2013/404525
On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984. ,
DOI : 10.1137/0213039
A case for two-level distributed recovery schemes, ACM SIGMETRICS Performance Evaluation Review, vol.23, issue.1, pp.64-73, 1995. ,
DOI : 10.1145/223586.223596
Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010. ,
DOI : 10.1109/HPCS.2010.5547140
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Thread-level redundancy fault tolerant CMP based on relaxed input replication, ICCIT. IEEE, 2011. ,
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, IEEE Int. Conf. on Cluster Computing, pp.93-103, 2004. ,