Statistics for Big Data For Dummies. For Dummies, 2015. ,
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313
Reducing Waste in Extreme Scale Systems through Introspective Analysis, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.212-221, 2016. ,
DOI : 10.1109/IPDPS.2016.100
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-01298430
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, vol.9, issue.16, pp.262772-2791, 2014. ,
DOI : 10.1109/SNAPI.2010.10
URL : https://hal.archives-ouvertes.fr/hal-00696154
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.7694
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, p.77, 2012. ,
DOI : 10.1109/SC.2012.57
Optimum checkpoints with age dependent failures, Acta Informatica, vol.27, issue.6, pp.519-531, 1990. ,
DOI : 10.1007/BF00277388
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.37-44, 2015. ,
DOI : 10.1109/DSN.2015.52
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063444
Making cloud intermediate data fault-tolerant, Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10 ,
DOI : 10.1145/1807128.1807160
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.186.5558
The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, pp.398-407, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788866
An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS'08, 2008. ,
Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010. ,
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012. ,
DOI : 10.1109/CLUSTER.2012.82
A large-scale study of failures in high-performance computing systems, Proc. of DSN, pp.249-258, 2006. ,
Cascading failover of a data management application for shared disk file systems in loosely coupled node clusters, US Patent, vol.6990, p.606, 2006. ,
Statistics for chemical and process engineers : a modern approach, 2015. ,
DOI : 10.1007/978-3-319-21509-9
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.25-36, 2014. ,
DOI : 10.1109/DSN.2014.101
Failure history, 2017. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing IEEE International Conference on, pp.93-103, 2004. ,