I. Assayad, A. Girault, and H. Kalla, Tradeoff exploration between reliability, power consumption, and execution time for embedded systems, International Journal on Software Tools for Technology Transfer, vol.7, issue.3, pp.229-245, 2013.
DOI : 10.1007/s10009-012-0263-9

URL : https://hal.archives-ouvertes.fr/hal-00923926

G. Aupy, A. Benoit, T. Hérault, and Y. Robert, Frédéric Vivien, and Dounia Zaidouni. 2013. On the combination of silent error detection and checkpointing, Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC, pp.11-20

G. Aupy, A. Benoit, and Y. Robert, Energy-aware scheduling under reliability and makespan constraints, 2012 19th International Conference on High Performance Computing, pp.1-10, 2012.
DOI : 10.1109/HiPC.2012.6507482

URL : https://hal.archives-ouvertes.fr/hal-00763384

N. Bansal, T. Kimbrel, and K. Pruhs, Speed scaling to manage energy and temperature, Journal of the ACM, vol.54, issue.1, pp.1-3, 2007.
DOI : 10.1145/1206035.1206038

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical timestepping schemes. The International Journal of High Performance Computing Applications DOI: 10, p.1094342014532297, 1177.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

G. Bronevetsky and . Bronis-de-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp.167-176, 2013.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele, Combined DVFS and Mapping Exploration for Lifetime and Soft-error Susceptibility Improvement in MPSoCs, Proceedings of the Conference on Design, pp.1-61, 2014.

A. Dixit and A. Wood, The impact of new technology on soft error rates, 2011 International Reliability Physics Symposium, pp.1-5, 2011.
DOI : 10.1109/IRPS.2011.5784522

N. El-sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder, Temperature Management in Data Centers: Why Some (Might) Like It Hot. SIGMETRICS Perform, Eval. Rev, vol.40, pp.1-163, 2012.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012.
DOI : 10.1109/ICDCS.2012.56

E. N. Mootaz, L. Elnozahy, Y. Alvisi, D. B. Wang, and . Johnson, A survey of rollbackrecovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002.

C. Wu and . Feng, Making a Case for Efficient Supercomputing, pp.54-64, 2003.

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 78, 2012.

M. A. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, 2011.

C. Hsu and W. Feng, A Power-Aware Run-Time System for High-Performance Computing, Proceedings of the ACM, pp.1-9, 2005.

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, pp.49-56, 2013.
DOI : 10.1145/2465813.2465821

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

A. Moody, G. Bronevetsky, K. Mohror, and B. R. De-supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM, pp.1-11, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

T. J. O-'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

M. K. Patterson, The effect of data center temperature on energy efficiency, 2008 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems, pp.1167-1174, 2008.
DOI : 10.1109/ITHERM.2008.4544393

A. Y. Nikzad-babaii-rizvandi, Y. C. Zomaya, A. J. Lee, J. Boloori, and . Taheri, Multiple Frequency Selection in DVFS-Enabled Processors to Minimize Energy Consumption, Energy-Efficient Distributed Computing Systems, 2012.

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

O. Sarood, E. Meneses, and L. V. Kale, A 'cool' way of improving the reliability of HPC machines, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, pp.1-5812, 2013.
DOI : 10.1145/2503210.2503228

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

S. Toueg and . Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

F. Yao, A. Demers, and S. Shenker, A scheduling model for reduced CPU energy, Proceedings of IEEE 36th Annual Foundations of Computer Science, p.374, 1995.
DOI : 10.1109/SFCS.1995.492493

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

B. Zhao, H. Aydin, and D. Zhu, Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems, 2008 IEEE International Conference on Computer Design, pp.633-639, 2008.
DOI : 10.1109/ICCD.2008.4751927

D. Zhu, R. Melhem, and D. Mosse, The Effects of Energy Management on Reliability in Realtime Embedded Systems, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD, pp.35-40, 2004.

J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O-'gorman et al., Accelerated testing for cosmic soft-error rate, IBM Journal of Research and Development, vol.40, issue.1, pp.1-51, 1996.
DOI : 10.1147/rd.401.0051

J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.
DOI : 10.1109/4.658626