K. Ferreira, J. Stearley, J. H. Laros, I. , R. Oldfield et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011.
DOI : 10.1145/2063384.2063443

L. Bautista-gomez, N. Maruyama, D. Komatitsch, S. Tsuboi, F. Cappello et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.
DOI : 10.1109/SC.2010.18

A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proceedings of the 17th international conference on Parallel processing (Euro-Par'11), pp.51-64, 2011.
DOI : 10.1007/978-3-642-23397-5_6

E. Meneses, C. L. Mendes, and L. V. Kale, Team-Based Message Logging: Preliminary Results, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
DOI : 10.1109/CCGRID.2010.110

URL : http://charm.cs.illinois.edu/newPapers/10-02/paper.pdf

T. Ropars, T. Martsinkevich, A. Guermouche, A. Schiper, and F. Cappello, SPBC, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503271

URL : https://hal.archives-ouvertes.fr/hal-01121951

M. Snir and R. W. Wisniewski, Addressing failures in exascale computing, International Journal of High Performance Computing Applications, vol.28, issue.2, pp.129-173, 2014.
DOI : 10.1177/1094342014522573

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

F. Cappello, A. Geist, S. Kale, B. Kramer, and M. Snir, Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.1-28
DOI : 10.1177/1094342009347767

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

T. Davies and Z. Chen, Correcting soft errors online in LU factorization, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.167-178, 2013.
DOI : 10.1145/2493123.2462920

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications, 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), 2011.

A. Guermouche, T. Ropars, M. Snir, and F. Cappello, HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.111

URL : https://hal.archives-ouvertes.fr/hal-01121941

R. Riesen, K. Ferreira, D. Da-silva, P. Lemarinier, D. Arnold et al., Alleviating scalability issues of checkpointing protocols Improving the computing efficiency of hpc systems using a combination of proactive and preventive checkpointing, IEEE/ACM SuperComputing 2012 (SC'12) Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS'13), pp.1-18, 2012.

M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2012.
DOI : 10.1177/1094342013505348

URL : https://hal.archives-ouvertes.fr/hal-00881463

A. Lefray, T. Ropars, and A. Schiper, Replication for send-deterministic MPI HPC applications, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013.
DOI : 10.1145/2465813.2465819

URL : https://hal.archives-ouvertes.fr/hal-01121949

J. Stearley, K. Ferreira, D. Robinson, J. Laros, K. Pedretti et al., Does partial replication pay off?, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012.
DOI : 10.1109/DSNW.2012.6264669

C. George and S. Vadhiyar, Fault Tolerance on Large Scale Systems using Adaptive Process Replication, IEEE Transactions on Computers, vol.64, issue.8, 2014.
DOI : 10.1109/TC.2014.2360536

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, IEEE/ACM, pp.781-7812, 2012.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, pp.1-712, 2013.
DOI : 10.1145/2503210.2503266

L. , B. Gomez, and F. Cappello, Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'14), pp.381-382, 2014.

E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin et al., A Proposal for Task Parallelism in OpenMP, Proceedings of the 3rd International Workshop on OpenMP: A Practical Programming Model for the Multi-Core Era, pp.1-12, 2008.
DOI : 10.1007/978-3-540-69303-1_1

J. R. Allen, Dependence Analysis for Subscripted Variables and Its Application to Program Transformations, 1983.

A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell et al., OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES, Parallel Processing Letters, vol.21, issue.02, pp.173-193, 2011.
DOI : 10.1142/S0129626411000151