An 83,000-processor supercomputer can only match 1% of your brain 2013, available at http://gizmodo.com/ an-83-000-processor-supercomputer-only-matched-one-perc-1045026757 ,
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015. ,
DOI : 10.1016/j.future.2015.04.003
URL : https://hal.archives-ouvertes.fr/hal-01199752
Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC '12, 2012. ,
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.403-421, 2015. ,
DOI : 10.1177/1094342014532297
A Scalable Asynchronous Replicationbased Strategy for Fault Tolerant MPI Applications, Proc. HiPC'07, pp.257-268, 2007. ,
Replication-Based Fault Tolerance for MPI Applications, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.7, pp.997-1010, 2009. ,
DOI : 10.1109/TPDS.2008.172
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.122
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014. ,
DOI : 10.1109/SC.2014.79
Low-overhead diskless checkpoint for hybrid computing systems, 2010 International Conference on High Performance Computing, 2010. ,
DOI : 10.1109/HIPC.2010.5713163
Distributed Diskless Checkpoint for Large Scale Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010. ,
DOI : 10.1109/CCGRID.2010.40
Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads, PLoS ONE, vol.54, issue.8, 2014. ,
DOI : 10.1371/journal.pone.0104591.t002
Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proc. SC'10, 2010. ,
DOI : 10.2172/984082
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-00721216
The design and implementation of a multi-level content-addressable checkpoint file system, 2012 19th International Conference on High Performance Computing, 2012. ,
DOI : 10.1109/HiPC.2012.6507514
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012. ,
DOI : 10.1109/DSNW.2012.6264677
Fast Error-Bounded Lossy HPC Data Compression with SZ, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016. ,
DOI : 10.1109/IPDPS.2016.11
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming, pp.173-184, 1996. ,
DOI : 10.1155/1996/483083
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016. ,
DOI : 10.1109/TPDS.2016.2546248
URL : https://hal.archives-ouvertes.fr/hal-01263879
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, 2010. ,
DOI : 10.1109/ICPP.2010.80
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications, Proceedings of the 2009 ACM symposium on Applied Computing, SAC '09, 2009. ,
DOI : 10.1145/1529282.1529506
An optimal checkpoint/restart model for a large scale high performance computing system, Proc. IPDPS. IEEE, 2008. ,
A case for two-level distributed recovery schemes, Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS '95/PERFORMANCE '95, pp.64-73, 1995. ,
Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013. ,
DOI : 10.1109/TC.2012.17