A. Rizwan, S. Ashraf, C. Hukerikar, and . Engelmann, Shrink or substitute: Handling process failures in HPC systems using in-situ recovery, 2018.

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Post-failure recovery of MPI communication capability, The International Journal of High Performance Computing Applications, vol.27, issue.3, pp.244-254, 2013.
DOI : 10.1109/RELDIS.1998.740476

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014
DOI : 10.1515/9781400882618-003

URL : http://institute.lanl.gov/resilience/docs/Toward%20Exascale%20Resilience.pdf

W. Cirne and F. Berman, Using Moldability to Improve the Performance of Supercomputer Jobs, Journal of Parallel and Distributed Computing, vol.62, issue.10, pp.1571-1601, 2002.
DOI : 10.1016/S0743-7315(02)91869-1

URL : http://walfredo.dsc.ufpb.br/publications/thesis.pdf

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

P. Du and A. Bouteiller, Algorithm-based fault tolerance for dense matrix factorizations, PPoPP, pp.225-234, 2012.
DOI : 10.1145/2145816.2145845

URL : http://icl.cs.utk.edu/news_pub/submissions/lawn253.pdf

P. Dutot, G. Mounié, and D. Trystram, Scheduling parallel tasks approximation algorithms Handbook of Scheduling -Algorithms, Models, and Performance Analysis, 2004.

A. Fang, H. Fujita, and A. A. Chien, Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery, Euro-Par 2015: Parallel Processing Workshops, pp.656-668, 2015.
DOI : 10.1007/978-3-319-27308-2_53

Y. Guo, W. Bland, P. Balaji, and X. Zhou, Fault tolerant MapReduce-MPI for HPC clusters, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '15, pp.1-3412, 2015.
DOI : 10.1007/978-3-642-25821-3_9

URL : http://dl.acm.org/ft_gateway.cfm?id=2807617&type=pdf

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, Failures in large scale systems, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on , SC '17, pp.1-4412, 2017.
DOI : 10.1145/2503210.2503257

A. Hori, K. Yoshinaga, T. Herault, A. Bouteiller, G. Bosilca et al., Sliding Substitution of Failed Nodes, Proceedings of the 22nd European MPI Users' Group Meeting on ZZZ, EuroMPI '15, pp.1-14, 2015.
DOI : 10.1109/GCIS.2009.110

URL : http://dl.acm.org/ft_gateway.cfm?id=2802670&type=pdf

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

J. E. Moreira and V. K. Naik, Dynamic resource management on distributed systems using reconfigurable applications, IBM Journal of Research and Development, vol.41, issue.3, pp.303-330, 1997.
DOI : 10.1147/rd.413.0303

URL : http://www.research.ibm.com/drms/papers/RC20890.ps

S. Prabhakaranw, Dynamic Resource Management and Job Scheduling for High Performance Computing, 2016.

R. Sudarsan and C. J. Ribbens, Design and performance of a scheduling framework for resizable parallel applications, Parallel Computing, vol.36, issue.1, pp.48-64, 2010.
DOI : 10.1016/j.parco.2009.12.010

R. Sudarsan, C. J. Ribbens, and D. Farkas, Dynamic Resizing of Parallel Scientific Simulations: A Case Study Using LAMMPS
DOI : 10.1007/978-3-642-01970-8_18

K. The and . Operations, Experiences and Statistics, 2014. Int. Conf. Computational Science (ICCS), pp.576-585

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115