Towards resilient parallel linear Krylov solvers: recover-restart strategies, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00843992
On resiliency in some parallel eigensolvers, Research Report, vol.8625 ,
Message logging: pessimistic, optimistic, causal, and optimal, IEEE Transactions on Software Engineering, vol.24, issue.2, pp.149-159, 1998. ,
DOI : 10.1109/32.666828
Hybrid scheduling for the parallel solution of linear systems, Parallel Computing, vol.32, issue.2, pp.136-156, 2006. ,
DOI : 10.1016/j.parco.2005.07.004
URL : https://hal.archives-ouvertes.fr/hal-00358623
A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988. ,
DOI : 10.1109/12.9736
DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design, Proceedings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, pp.196-207, 1999. ,
A message system supporting fault tolerance, ACM SIGOPS Operating Systems Review, vol.17, issue.5, pp.90-99, 1983. ,
DOI : 10.1145/773379.806617
Fine-Grained Multithreading for the Multifrontal $QR$ Factorization of Sparse Matrices, SIAM Journal on Scientific Computing, vol.35, issue.4, pp.323-345, 2013. ,
DOI : 10.1137/110846427
URL : https://hal.archives-ouvertes.fr/hal-01122471
Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters, pp.111-132, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00945068
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods, In ACM SIGPLAN Notices, vol.48, pp.167-176, 2013. ,
The University of Florida sparse matrix collection. j-TOMS, pp.1-125, 2011. ,
A Survey of Rollback-recovery Protocols in Message-passing Systems, ACM Comput. Surv, vol.34, issue.3, pp.375-408, 2002. ,
The performance of consistent checkpointing, [1992] Proceedings 11th Symposium on Reliable Distributed Systems, pp.39-47, 1992. ,
DOI : 10.1109/RELDIS.1992.235144
On the Complexity of Sparse $QR$ and $LU$ Factorization of Finite-Element Matrices, SIAM Journal on Scientific and Statistical Computing, vol.9, issue.5, pp.849-861, 1988. ,
DOI : 10.1137/0909057
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing, Parallel Computing, vol.49, pp.117-135 ,
DOI : 10.1016/j.parco.2015.07.003
Fault-tolerant high-performance matrix multiplication: theory and practice, Proceedings International Conference on Dependable Systems and Networks, pp.47-56, 2001. ,
DOI : 10.1109/DSN.2001.941390
Methods of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Standards, vol.49, issue.6, pp.409-436, 1952. ,
DOI : 10.6028/jres.049.044
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, pp.518-528, 1984. ,
Resilience for multigrid software at the extreme scale, 1506. ,
Recent Advances and New Avenues in Hardware-Level Reliability Support, IEEE Micro, vol.25, issue.6, pp.2518-2547, 2005. ,
DOI : 10.1109/MM.2005.119
Sender-based message logging, 1987. ,
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment, SIAM Journal on Scientific Computing, vol.30, issue.1, pp.102-116, 2007. ,
DOI : 10.1137/040620394
Catch-compiler-assisted techniques for checkpointing FTCS-20, Fault- Tolerant Computing Digest of Papers., 20th International Symposium, pp.74-81, 1990. ,
An optimal checkpoint/restart model for a large scale high performance computing system, Parallel and Distributed Processing, pp.1-910, 2008. ,
Dealing with faults in HPC systems, PMAA-International Workshop on Parallel Matrix Algorithms and Applications, 2014. ,
Error detection by duplicated instructions in superscalar processors. Reliability, IEEE Transactions on, vol.51, issue.1, pp.63-75, 2002. ,
Solution of Sparse Indefinite Systems of Linear Equations, SIAM Journal on Numerical Analysis, vol.12, issue.4, pp.617-629, 1975. ,
DOI : 10.1137/0712047
Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing, Journal of Parallel and Distributed Computing, vol.43, issue.2, pp.125-138, 1997. ,
DOI : 10.1006/jpdc.1997.1336
An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, 1997. ,
ICKP: a consistent checkpointer for multicomputers. Parallel Distributed Technology: Systems Applications, IEEE, vol.2, issue.2, pp.62-67 ,
Reliability Analysis in HPC clusters, Proceedings of the High Availability and Performance Computing Workshop, 2006. ,
A Flexible Inner-Outer Preconditioned GMRES Algorithm, SIAM Journal on Scientific Computing, vol.14, issue.2, pp.461-469, 1993. ,
DOI : 10.1137/0914028
Iterative Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, 2003. ,
DOI : 10.1137/1.9780898718003
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems, SIAM Journal on Scientific and Statistical Computing, vol.7, issue.3, pp.856-869, 1986. ,
DOI : 10.1137/0907058
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance, 19th IEEE International Parallel and Distributed Processing Symposium, pp.10-1109, 2005. ,
DOI : 10.1109/IPDPS.2005.157
Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors, Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007, pp.21-26, 2007. ,
DOI : 10.1109/SPA.2007.5903294
Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations, 1996. ,
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM, Proceedings of the 21st European MPI Users' Group Meeting on, EuroMPI/ASIA '14, pp.51-51 ,
DOI : 10.1145/2642769.2642774
Transient-fault recovery using simultaneous multithreading, Computer Architecture Proceedings. 29th Annual International Symposium on, pp.87-98, 2002. ,
Probabilistic logics and the synthesis of reliable organisms from unreliable components, Automata Studies, pp.43-98, 1956. ,
Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments, 2009. ,
A fault tolerant approach to microprocessor design, Proceedings International Conference on Dependable Systems and Networks, pp.411-420, 2001. ,
DOI : 10.1109/DSN.2001.941425
A statistical distribution function of wide applicability, Journal of Applied Mechanics, vol.18, pp.293-297, 1951. ,