Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011. ,
DOI : 10.1145/2063384.2063443
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-00721216
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010. ,
DOI : 10.1109/SC.2010.18
Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proceedings of the 17th international conference on Parallel processing (Euro-Par'11), pp.51-64, 2011. ,
DOI : 10.1007/978-3-642-23397-5_6
Team-Based Message Logging: Preliminary Results, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010. ,
DOI : 10.1109/CCGRID.2010.110
URL : http://charm.cs.illinois.edu/newPapers/10-02/paper.pdf
SPBC, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503271
URL : https://hal.archives-ouvertes.fr/hal-01121951
Addressing failures in exascale computing, International Journal of High Performance Computing Applications, vol.28, issue.2, pp.129-173, 2014. ,
DOI : 10.1177/1094342014522573
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007. ,
DOI : 10.1109/MSST.2007.4367962
Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.1-28 ,
DOI : 10.1177/1094342009347767
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Correcting soft errors online in LU factorization, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.167-178, 2013. ,
DOI : 10.1145/2493123.2462920
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications, 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), 2011. ,
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.111
URL : https://hal.archives-ouvertes.fr/hal-01121941
Alleviating scalability issues of checkpointing protocols Improving the computing efficiency of hpc systems using a combination of proactive and preventive checkpointing, IEEE/ACM SuperComputing 2012 (SC'12) Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS'13), pp.1-18, 2012. ,
Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2012. ,
DOI : 10.1177/1094342013505348
URL : https://hal.archives-ouvertes.fr/hal-00881463
Replication for send-deterministic MPI HPC applications, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013. ,
DOI : 10.1145/2465813.2465819
URL : https://hal.archives-ouvertes.fr/hal-01121949
Does partial replication pay off?, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012. ,
DOI : 10.1109/DSNW.2012.6264669
Fault Tolerance on Large Scale Systems using Adaptive Process Replication, IEEE Transactions on Computers, vol.64, issue.8, 2014. ,
DOI : 10.1109/TC.2014.2360536
Detection and correction of silent data corruption for large-scale high-performance computing, IEEE/ACM, pp.781-7812, 2012. ,
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, pp.1-712, 2013. ,
DOI : 10.1145/2503210.2503266
Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'14), pp.381-382, 2014. ,
A Proposal for Task Parallelism in OpenMP, Proceedings of the 3rd International Workshop on OpenMP: A Practical Programming Model for the Multi-Core Era, pp.1-12, 2008. ,
DOI : 10.1007/978-3-540-69303-1_1
Dependence Analysis for Subscripted Variables and Its Application to Program Transformations, 1983. ,
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES, Parallel Processing Letters, vol.21, issue.02, pp.173-193, 2011. ,
DOI : 10.1142/S0129626411000151