FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-00721216
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.403-421, 2015. ,
DOI : 10.1177/1094342014532297
URL : http://arxiv.org/abs/1312.2674
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015. ,
DOI : 10.1016/j.future.2015.04.003
URL : https://hal.archives-ouvertes.fr/hal-01199752
Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications, Proceedings of the 2009 ACM symposium on Applied Computing, SAC '09, pp.1015-1020, 2009. ,
DOI : 10.1145/1529282.1529506
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.907-918, 2014. ,
DOI : 10.1109/SC.2014.79
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp.1181-1190, 2014. ,
DOI : 10.1109/IPDPS.2014.122
Fast error-bounded lossy hpc data compression with sz [12] A. Feinberg. An 83,000-processor supercomputer can only match 1% of your brain Available at http://gizmodo.com/ an-83-000-processor-supercomputer-only-matched-one, IEEE 30th International Parallel and Distributed Processing Symposium, 2013. ,
Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of Supercomputing, SC '12, pp.1-78, 2012. ,
Low-overhead diskless checkpoint for hybrid computing systems, 2010 International Conference on High Performance Computing, pp.1-10, 2010. ,
DOI : 10.1109/HIPC.2010.5713163
Distributed Diskless Checkpoint for Large Scale Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.63-72, 2010. ,
DOI : 10.1109/CCGRID.2010.40
Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013. ,
DOI : 10.1109/TC.2012.17
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010. ,
DOI : 10.1109/ICPP.2010.80
The design and implementation of a multi-level content-addressable checkpoint file system, 2012 19th International Conference on High Performance Computing, pp.1-10, 2012. ,
DOI : 10.1109/HiPC.2012.6507514
Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads, PLoS ONE, vol.54, issue.8, p.2014 ,
DOI : 10.1371/journal.pone.0104591.t002
An optimal checkpoint/restart model for a large scale high performance computing system, IEEE International Symposium on Parallel and Distributed Processing, 2008. ,
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010. ,
DOI : 10.1109/SC.2010.18
A case for two-level distributed recovery schemes, Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '95/PERFORMANCE '95, pp.64-73, 1995. ,
Replication-Based Fault Tolerance for MPI Applications, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.7, pp.997-1010, 2009. ,
DOI : 10.1109/TPDS.2008.172
A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications, Proceedings of the 14th International Conference on High Performance Computing (HiPC'07), pp.257-268, 2007. ,
DOI : 10.1007/978-3-540-77220-0_26
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012. ,
DOI : 10.1109/DSNW.2012.6264677