G. Aupy, A. Benoit, T. Hérault, and Y. Robert, Optimal Checkpointing Period: Time vs. Energy, PMBS 2013, the 4th Int. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, 2013.
DOI : 10.1007/978-3-319-10214-6_10

URL : https://hal.archives-ouvertes.fr/hal-00926199

G. Aupy, A. Benoit, P. Renaud-goud, and Y. Robert, Energy-aware checkpointing of divisible tasks with soft or hard deadlines, 2013 International Green Computing Conference Proceedings, 2013.
DOI : 10.1109/IGCC.2013.6604467

URL : https://hal.archives-ouvertes.fr/hal-00857244

N. Bansal, T. Kimbrel, and K. Pruhs, Speed scaling to manage energy and temperature, Journal of the ACM, vol.54, issue.1, pp.1-3, 2007.
DOI : 10.1145/1206035.1206038

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Which Verification for Soft Error Detection?, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015.
DOI : 10.1109/HiPC.2015.26

URL : https://hal.archives-ouvertes.fr/hal-01252382

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
DOI : 10.1109/IPDPS.2016.39

URL : https://hal.archives-ouvertes.fr/hal-01354886

A. Benoit, Y. Robert, and S. K. Raina, Efficient checkpoint/verification patterns, High Performance Computing Applications, 2015.
DOI : 10.1177/1094342015594531

URL : https://hal.archives-ouvertes.fr/ensl-01252342

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical timestepping schemes, High Performance Computing Applications, 2014.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014
DOI : 10.1177/1094342009347767

A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Assessing the Impact of Partial Verifications against Silent Data Corruptions, 2015 44th International Conference on Parallel Processing, 2015.
DOI : 10.1109/ICPP.2015.53

URL : https://hal.archives-ouvertes.fr/hal-01253493

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, PPoPP, 2013.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12, p.78, 2012.

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

R. Melhem, D. Mossé, and E. Elnozahy, The interplay of power management and fault recovery in real-time systems, IEEE Transactions on Computers, vol.53, issue.2, pp.217-231, 2004.
DOI : 10.1109/TC.2004.1261830

E. Meneses, O. Sarood, and L. V. Kalé, Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012.
DOI : 10.1109/SBAC-PAD.2012.12

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proc. SC'10, pp.1-11, 2010.

F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Transactions on Parallel and Distributed Systems, vol.12, issue.4, pp.346-362, 2001.
DOI : 10.1109/71.920586

N. B. Rizvandi, A. Y. Zomaya, Y. C. Lee, A. J. Boloori, and J. Taheri, Multiple Frequency Selection in DVFS-Enabled Processors to Minimize Energy Consumption, Energy-Efficient Distributed Computing Systems, 2012.
DOI : 10.1002/9781118342015.ch17

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

F. Yao, A. Demers, and S. Shenker, A scheduling model for reduced CPU energy, Proceedings of IEEE 36th Annual Foundations of Computer Science, 1995.
DOI : 10.1109/SFCS.1995.492493

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115