I. Assayad, A. Girault, and H. Kalla, Tradeoff Exploration between Reliability, Power Consumption, and Execution Time, Proceedings of Computer Safety, Reliability and Security Conference (SAFECOMP), 2011.
DOI : 10.1109/24.24570

URL : https://hal.archives-ouvertes.fr/hal-00655478

G. Aupy, Source code and data for tri-criteria scheduling, " http://gaupy.org/ tri-criteria-scheduling

G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-spaccamela et al., Complexity and Approximation, 1999.
DOI : 10.1007/978-3-642-58412-1

URL : https://hal.archives-ouvertes.fr/hal-00906941

H. Aydin and Q. Yang, Energy-aware partitioning for multiprocessor real-time systems, Proceedings International Parallel and Distributed Processing Symposium, pp.113-121, 2003.
DOI : 10.1109/IPDPS.2003.1213225

M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-vincentelli, M. Peri et al., Faulttolerant platforms for automotive safety-critical applications Architectures and Synthesis for Embedded Systems, Proc. of Int. Conf. on Compilers, pp.170-177, 2003.

N. Bansal, T. Kimbrel, and K. Pruhs, Speed scaling to manage energy and temperature, Journal of the ACM, vol.54, issue.1, pp.1-39, 2007.
DOI : 10.1145/1206035.1206038

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.550.7426

E. Beigne, F. Clermidy, J. Durupt, H. Lhermet, S. Miermont et al., An asynchronous power aware and adaptive NoC based circuit, Proceedings of the Symposium on VLSI Circuits, pp.190-191, 2008.

E. Beigne, F. Clermidy, S. Miermont, Y. Thonnart, A. Valentian et al., A Localized Power Control mixing hopping and Super Cut-Off techniques within a GALS NoC, 2008 IEEE International Conference on Integrated Circuit Design and Technology and Tutorial
DOI : 10.1109/ICICDT.2008.4567241

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, 1312.
DOI : 10.1177/1094342014532297

K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally et al., Exascale computing study: Technology challenges in achieving exascale systems, Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 2008.

V. Bharadwaj, T. G. Robertazzi, and D. Ghose, Scheduling Divisible Loads in Parallel and Distributed Systems, 1996.

R. Biswas, M. Aftosmis, C. Kiris, and B. Shen, Petascale computing: Impact on future nasa missions, pp.29-46, 2007.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013.
DOI : 10.1002/cpe.3173

URL : https://hal.archives-ouvertes.fr/hal-00696154

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, International Conference on Parallel Processing and Applied Mathematics (PPAM), ser. LNCS, pp.206-215978, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

M. Bouguerra, D. Trystram, and F. Wagner, Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012.
DOI : 10.1109/TC.2012.57

URL : https://hal.archives-ouvertes.fr/hal-00788101

M. Bouguerra, A. Gainaru, L. Gomez, F. Cappello, S. Matsuoka et al., Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.501-512, 2013.
DOI : 10.1109/IPDPS.2013.74

B. Bouteiller, P. Lemarinier, K. Krawezik, and F. Capello, Coordinated checkpoint versus message log for fault tolerant MPI, Proceedings IEEE International Conference on Cluster Computing CLUSTR-03, pp.242-250, 2003.
DOI : 10.1109/CLUSTR.2003.1253321

S. Boyd and L. Vandenberghe, Convex Optimization, 2004.

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

G. Buttazzo, G. Lipari, L. Abeni, and M. Caccamo, Soft Real-Time Systems: Predictability vs, Efficiency. Springer series in Computer Science, 2005.

F. Cappello, H. Casanova, and Y. Robert, PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011.
DOI : 10.1142/S0129626411000126

URL : https://hal.archives-ouvertes.fr/hal-00945068

F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi et al., Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001.
DOI : 10.1147/rd.452.0311

A. P. Chandrakasan and A. Sinha, Jouletrack: A web based tool for software energy profiling, Design Automation Conference, pp.220-225, 2001.

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems (TOCS), pp.63-75, 1985.
DOI : 10.1145/214451.214456

G. Chen, K. Malkowski, M. Kandemir, and P. Raghavan, Reducing Power with Performance Constraints for Parallel Sparse Applications, 19th IEEE International Parallel and Distributed Processing Symposium, p.8, 2005.
DOI : 10.1109/IPDPS.2005.378

J. Chen and T. Kuo, Multiprocessor energy-efficient scheduling for real-time tasks, Proceedings of International Conference on Parallel Processing, pp.13-20, 2005.

J. Chen and C. Kuo, Energy-Efficient Scheduling for Real-Time Systems on Dynamic Voltage Scaling (DVS) Platforms, 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2007), pp.28-38, 2007.
DOI : 10.1109/RTCSA.2007.37

D. Cordeiro, G. Mounié, S. Perarnau, D. Trystram, J. Vincent et al., Random graph generation for scheduling simulations, Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, 2010.
DOI : 10.4108/ICST.SIMUTOOLS2010.8667

URL : https://hal.archives-ouvertes.fr/hal-00471255

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, 2009.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

V. Degalahal, L. Li, V. Narayanan, M. Kandemir, and M. J. Irwin, Soft errors issues in low-power caches, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.13, issue.10, pp.1157-1166, 2005.
DOI : 10.1109/TVLSI.2005.859474

M. E. Diouri, O. Gluck, L. Lefèvre, and F. Cappello, Energy considerations in checkpointing and fault tolerance protocols, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012.
DOI : 10.1109/DSNW.2012.6264670

URL : https://hal.archives-ouvertes.fr/hal-00748006

M. E. Diouri, O. Gluck, L. Lefevre, and F. Cappello, Ecofit: A framework to estimate energy consumption of fault tolerance protocols for HPC applications, Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid, pp.522-529, 2013.

J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

J. Dongarra, T. Hérault, and Y. Robert, Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013.
DOI : 10.1109/IPDPSW.2013.11

URL : https://hal.archives-ouvertes.fr/hal-00925168

M. Drozdowski, Divisible load, " in Scheduling for Parallel Processing, ser. Computer Communications and Networks, pp.301-365, 2009.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of the ACM/IEEE conference on SuperComputing (SC, 2012.

E. W. Fulp, G. A. Fink, and J. N. Haack, Predicting computer system failures using support vector machines, Proceedings of the First USENIX conference on Analysis of system logs. USENIX Association, 2008.

A. Gainaru, F. Cappello, and W. Kramer, Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.107

A. Gainaru, F. Cappello, W. Kramer, and M. Snir, Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.
DOI : 10.1109/SC.2012.57

M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NP- Completeness, 1990.

R. Ge, X. Feng, and K. W. Cameron, Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters, ACM/IEEE SC 2005 Conference (SC'05), p.34, 2005.
DOI : 10.1109/SC.2005.57

E. Gelenbe, On the Optimum Checkpoint Interval, Journal of the ACM, vol.26, issue.2, pp.259-270, 1979.
DOI : 10.1145/322123.322131

E. Gelenbe and D. Derochette, Performance of rollback recovery systems under intermittent failures, Communications of the ACM, vol.21, issue.6, pp.493-499, 1978.
DOI : 10.1145/359511.359531

E. Gelenbe and M. Hernández, Optimum checkpoints with age dependent failures, Acta Informatica, vol.27, issue.6, pp.519-531, 1990.
DOI : 10.1007/BF00277388

A. Girault, E. Saule, and D. Trystram, Reliability versus performance for critical applications, Journal of Parallel and Distributed Computing, vol.69, issue.3, pp.326-336, 2009.
DOI : 10.1016/j.jpdc.2008.11.002

URL : https://hal.archives-ouvertes.fr/hal-00753169

R. Gonzalez and M. Horowitz, Energy dissipation in general purpose microprocessors, IEEE Journal of Solid-State Circuits, vol.31, issue.9, pp.1277-1284, 1996.
DOI : 10.1109/4.535411

P. Grosse, Y. Durand, and P. Feautrier, Methods for power optimization in SOC-based data flow systems, ACM Transactions on Design Automation of Electronic Systems, vol.14, issue.3, pp.1-38, 2009.
DOI : 10.1145/1529255.1529260

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.989-1000, 2011.
DOI : 10.1109/IPDPS.2011.95

URL : https://hal.archives-ouvertes.fr/hal-01121937

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, SIGMETRICS Perf. Eval. Rev, vol.30, issue.1, 2002.
DOI : 10.1145/511399.511362

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Sandia National Laboratories, 2011.

J. Hong, S. Kim, Y. Cho, H. Yeom, and T. Park, On the choice of checkpoint interval using memory usage profile and adaptive time series analysis, Proceedings of the Pacific Rim Internation Symposium on Dependable Computing (PRDC), 2001.

Y. Hotta, M. Sato, H. Kimura, S. Matsuoka, T. Boku et al., Profile-based optimization of power performance by using dynamic voltage scaling on a PC cluster, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006.
DOI : 10.1109/IPDPS.2006.1639597

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers, vol.33, issue.6, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

R. Jejurikar, C. Pereira, and R. Gupta, Leakage aware dynamic voltage scaling for real-time embedded systems, Proceedings of the 41st annual conference on Design automation , DAC '04, pp.275-280, 2004.
DOI : 10.1145/996566.996650

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010.
DOI : 10.1109/ICPP.2010.80

H. Kawaguchi, G. Zhang, S. Lee, and T. Sakurai, An LSI for VDD-Hopping and MPEG4 system based on the chip, Proceedings of the International Symposium on Circuits and Systems (ISCAS), 2001.

O. Kella and W. Stadje, Superposition of renewal processes and an application to multi-server queues, Statistics & Probability Letters, vol.76, issue.17, pp.1914-1924, 2006.
DOI : 10.1016/j.spl.2006.04.041

K. H. Kim, R. Buyya, and J. Kim, Power Aware Scheduling of Bag-of-Tasks Applications with Deadline Constraints on DVS-enabled Clusters, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), pp.541-548, 2007.
DOI : 10.1109/CCGRID.2007.85

P. Kogge and J. Shalf, Exascale computing trends: Adjusting to the " new normal " in computer architecture, 2013.

N. Kolettis and N. D. Fulton, Software rejuvenation: Analysis, module and applications, International Symposium on Fault-Tolerant Computing (FTCS), p.381, 1995.

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010.
DOI : 10.1109/CCGRID.2010.71

URL : https://hal.archives-ouvertes.fr/inria-00433523

K. Lahiri, A. Raghunathan, S. Dey, and D. Panigrahi, Battery-driven system design: a new frontier in low power design, Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design, pp.261-267, 2002.
DOI : 10.1109/ASPDAC.2002.994932

P. Langen and B. Juurlink, Leakage-Aware Multiprocessor Scheduling, Journal of Signal Processing Systems, vol.74, issue.8, pp.73-88, 2009.
DOI : 10.1007/s11265-008-0176-8

Y. Li, Z. Lan, P. Gujrati, and X. Sun, Fault-aware runtime strategies for high-performance computing, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.4, pp.460-473, 2009.

Y. Liang, Y. Zhang, H. Xiong, and R. K. Sahoo, Failure Prediction in IBM BlueGene/L Event Logs, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.583-588, 2007.
DOI : 10.1109/ICDM.2007.46

Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Transactions on Computers, pp.699-708, 2001.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2008.

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013.
DOI : 10.1145/2465813.2465821

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

R. Melhem, D. Mosse, and E. Elnozahy, The interplay of power management and fault recovery in real-time systems, IEEE Transactions on Computers, vol.53, issue.2, 2003.
DOI : 10.1109/TC.2004.1261830

E. Meneses, O. Sarood, and L. V. Kalé, Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012.
DOI : 10.1109/SBAC-PAD.2012.12

S. Miermont, P. Vivet, and M. Renaudin, A power supply selector for energy-and area-efficient local dynamic voltage scaling, " in Integrated Circuit and System Design. Power and Timing Modeling , Optimization and Simulation, pp.556-565, 2007.

M. P. Mills, The internet begins with coal, Environment and Climate News, 1999.

M. Mitzenmacher and E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
DOI : 10.1017/CBO9780511813603

A. Moody, G. Bronevetsky, K. Mohror, and B. R. De-supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE conference on SuperComputing (SC, pp.1-11, 2010.

X. Ni, E. Meneses, and L. V. Kalé, Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, 2012.
DOI : 10.1109/CLUSTER.2012.82

J. Nocedal and S. J. Wright, Numerical Optimization, 2006.
DOI : 10.1007/b98874

T. Okuma, H. Yasuura, and T. Ishihara, Software energy reduction techniques for variable-voltage processors, IEEE Design & Test of Computers, vol.18, issue.2, pp.31-41, 2001.
DOI : 10.1109/54.914613

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam, Fault-aware job scheduling for BlueGene/L systems, Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp.64-73, 2004.
DOI : 10.1109/ipdps.2004.1302991

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

J. S. Plank and M. G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001.
DOI : 10.1006/jpdc.2001.1757

P. Pop, K. H. Poulsen, V. Izosimov, and P. Eles, Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems, Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis , CODES+ISSS '07, pp.233-238, 2007.
DOI : 10.1145/1289816.1289873

R. B. Prathipati, Energy efficient scheduling techniques for real-time embedded systems, 2004.

K. Pruhs, R. Van-stee, and P. Uthaisombut, Speed scaling of tasks with precedence constraints, Theory of Computing Systems, pp.67-80, 2008.

R. Rajachandrasekar, A. Moody, K. Mohror, and D. K. Panda, A 1 PB/s file system to checkpoint three million MPI tasks, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.143-154, 2013.
DOI : 10.1145/2493123.2462908

V. J. Rayward-smith, F. W. Burton, and G. J. Janacek, Scheduling parallel programs assuming preallocation, 1995.

Y. Robert, F. Vivien, and D. Zaidouni, On the complexity of scheduling checkpoints for computational workflows, " in Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), 2012.

S. M. Ross, Introduction to Probability Models, Tenth Edition, 2009.

W. Rudin, Principles of mathematical analysis, international Series in Pure and Applied Mathematics, 1976.

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

V. Sarkar, Exascale software study: Software challenges in extreme scale systems, 2009.

K. Sato, A. Moody, K. Mohror, T. Gamblin, B. R. De-supinski et al., Design and modeling of a non-blocking checkpointing system, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.
DOI : 10.1109/SC.2012.46

B. Schroeder and G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp.249-258, 2006.

J. Shalf, S. Dosanjh, and J. Morrison, Exascale Computing Technology Challenges, Internation Conference on High Performance Computing for Computational Science (VECPAR), ser, pp.1-25, 2011.
DOI : 10.1109/MM.2009.5

S. M. Shatz and J. Wang, Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems, IEEE Transactions on Reliability, vol.38, issue.1, pp.16-27, 1989.
DOI : 10.1109/24.24570

K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy et al., Temperature-aware microarchitecture, ACM Transactions on Architecture and Code Optimization, vol.1, issue.1, pp.94-125, 2004.
DOI : 10.1145/980152.980157

J. A. Stankovic, K. Ramamritham, and M. Spuri, Deadline Scheduling for Real-Time Systems: EDF and Related Algorithms, 1998.
DOI : 10.1007/978-1-4615-5535-3

S. Toueg and O. Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

L. Wang, G. Von-laszewski, J. Dayal, and F. Wang, Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.368-377, 2010.
DOI : 10.1109/CCGRID.2010.19

Y. Wang, P. Chung, I. Lin, and W. K. Fuchs, Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems, IEEE Transactions on Parallel and Distributed Systems, vol.6, issue.5, pp.546-554, 1995.
DOI : 10.1109/71.382324

J. Wingstrom, Overcoming The Difficulties Created by the Volatile Nature of Desktop Grids Through Understanding, Prediction and Redundancy, 2009.

L. Yang and L. Man, On-Line and Off-Line DVS for Fixed Priority with Preemption Threshold Scheduling, 2009 International Conference on Embedded Software and Systems, pp.273-280, 2009.
DOI : 10.1109/ICESS.2009.50

F. Yao, A. Demers, and S. Shenker, A scheduling model for reduced CPU energy, Proceedings of IEEE 36th Annual Foundations of Computer Science, p.374, 1995.
DOI : 10.1109/SFCS.1995.492493

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

L. Yu, Z. Zheng, Z. Lan, and S. Coghlan, Practical online failure prediction for BlueGene/P: Period-based vs event-driven, Proceedings of the International Conference on Dependable Systems and Networks Workshops, pp.259-264, 2011.
DOI : 10.1109/dsnw.2011.5958823

Y. Zhang and K. Chakrabarty, Energy-aware adaptive checkpointing in embedded real-time systems, 2003 Design, Automation and Test in Europe Conference and Exhibition, p.10918, 2003.
DOI : 10.1109/DATE.2003.1253723

Y. Zhang, X. S. Hu, and D. Z. Chen, Task scheduling and voltage selection for energy minimization, Proceedings of the 39th conference on Design automation , DAC '02, pp.183-188, 2002.
DOI : 10.1145/513918.513966

G. Zheng, X. Ni, and L. Kalé, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012.
DOI : 10.1109/DSNW.2012.6264677

G. Zheng, L. Shi, and L. V. Kalé, FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing, 2004.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, A practical failure prediction with location and lead time for BlueGene/P, Proceedings of the International Conference on Dependable Systems and Networks Workshops, pp.15-22, 2010.

D. Zhu, Reliability-aware dynamic energy management in dependable embedded real-time systems, Real-Time and Embedded Technology and Applications Symposium, pp.397-407, 2006.
DOI : 10.1145/1880050.1880062

D. Zhu and H. Aydin, Energy management for real-time embedded systems with reliability requirements, Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design , ICCAD '06, pp.528-534, 2006.
DOI : 10.1145/1233501.1233608

D. Zhu, R. Melhem, and D. Mossé, The effects of energy management on reliability in real-time embedded systems, Proceedings of the IEEE/ACM International Conference on Computer- Aided Design (ICCAD), pp.35-40, 2004.

A. Aupy, F. Benoit, Y. Dufossé, and . Robert, Reclaiming the energy of a schedule: models and algorithms, Publications Articles in international refereed journals Concurrency and Computation: Practice and Experience, pp.1505-1523, 2013.
DOI : 10.1002/cpe.2889

URL : https://hal.archives-ouvertes.fr/inria-00584944

A. [. Aupy, J. Benoit, Y. Matthieu, and . Robert, Power-aware replica placement in tree networks with multiple servers per client, Sustainable Computing: Informatics and Systems, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01059365

G. Aupy and O. Bournez, On the number of binary-minded individuals required to compute 1

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00788313

G. Atkins, D. Aupy, K. Cole, and . Pruhs, Speed Scaling to Manage Temperature, Articles in international refereed conferences Theory and Practice of Algorithms in (Computer) Systems (TAPAS), pp.9-20, 2011.
DOI : 10.1007/978-3-642-19754-3_4

URL : https://hal.archives-ouvertes.fr/hal-00786200

G. Aupy, A. Benoit, and Y. Robert, Energy-aware scheduling under reliability and makespan constraints, 2012 19th International Conference on High Performance Computing, 2012.
DOI : 10.1109/HiPC.2012.6507482

URL : https://hal.archives-ouvertes.fr/hal-00763384

G. Aupy, A. Benoit, F. Dufossé, and Y. Robert, Brief announcement, Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pp.135-136, 2011.
DOI : 10.1145/1989493.1989512

URL : https://hal.archives-ouvertes.fr/hal-00857268

G. Aupy, A. Benoit, T. Hérault, Y. Robert, and J. Dongarra, Optimal Checkpointing Period: Time vs. Energy, Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), ser. LNCS, 2013.
DOI : 10.1007/978-3-319-10214-6_10

URL : https://hal.archives-ouvertes.fr/hal-00926199

G. Aupy, A. Benoit, T. Hérault, Y. Robert, F. Vivien et al., On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, 2013.
DOI : 10.1109/PRDC.2013.10

URL : https://hal.archives-ouvertes.fr/hal-00836871

G. Aupy, A. Benoit, J. Matthieu, and Y. Robert, Power-aware replica placement in tree networks with multiple servers per client, Proceedings of Euro-Par: Parallel Processing, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01059365

G. Aupy, A. Benoit, R. Melhem, P. Renaud-goud, and Y. Robert, Energy-aware checkpointing of divisible tasks with soft or hard deadlines, 2013 International Green Computing Conference Proceedings, pp.1-8, 2013.
DOI : 10.1109/IGCC.2013.6604467

URL : https://hal.archives-ouvertes.fr/hal-00857244

G. Aupy, M. Faverge, Y. Robert, J. Kurzak, P. Luszczek et al., Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC, Workshop on Productivity and Performance (PROPER), ser, 2013.
DOI : 10.1007/978-3-642-54420-0_64

URL : https://hal.archives-ouvertes.fr/hal-00879248

G. Aupy, Y. Robert, F. Vivien, D. Zaidounirr1-]-g, A. Aupy et al., Checkpointing strategies with prediction windows Approximation algorithms for energy, reliability and makespan optimization problems, Proceedings of the Pacific Rim Internation Symposium on Dependable Computing (PRDC, 2012.

G. Aupy, A. Benoit, F. Dufossé, and Y. Robert, Reclaiming the energy of a schedule: models and algorithms, Concurrency and Computation: Practice and Experience, vol.24, issue.9, 2011.
DOI : 10.1002/cpe.2889

URL : https://hal.archives-ouvertes.fr/inria-00584944

G. Aupy, A. Benoit, J. Matthieu, and Y. Robert, Power-aware replica placement in tree networks with multiple servers per client INRIA, Rapport de recherche RR-8474, 2014.

G. Aupy, A. Benoit, R. Melhem, P. Renaud-goud, and Y. Robert, Energy-aware checkpointing of divisible tasks with soft or hard deadlines, 2013 International Green Computing Conference Proceedings, 2013.
DOI : 10.1109/IGCC.2013.6604467

URL : https://hal.archives-ouvertes.fr/hal-00857244

G. Aupy, A. Benoit, and Y. Robert, Energy-aware scheduling under reliability and makespan constraints, 2012 19th International Conference on High Performance Computing, 2012.
DOI : 10.1109/HiPC.2012.6507482

URL : https://hal.archives-ouvertes.fr/hal-00763384

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Impact of fault prediction on checkpointing strategies INRIA, Tech. Rep. RR-8023 this report is rendered obsolete by RR-8237 and RR-8239 which cover the integrality of this report in a more precise fashion, 2012.

G. Aupy, M. Shantharam, A. Benoit, Y. Robert, and P. Raghavan, Co-scheduling algorithms for high-throughput workload execution, Journal of Scheduling, vol.23, issue.2, 2013.
DOI : 10.1109/DATE.2012.6176641

URL : https://hal.archives-ouvertes.fr/hal-01252366

A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert et al., Scheduling the I/O of HPC applications under congestion INRIA, Rapport de recherche RR-8519, 2014.