Logging : Pessimistic, Optimistic, Causal, and Optimal, IEEE Transactions on Software Engineering, vol.24, issue.2, pp.149-159, 1998. ,
DOI : 10.1109/32.666828
Analysis of CommunicationInduced Checkpointing, FTCS '99 : Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, p.242, 1999. ,
NERSC-6 Workload Analysis and Benchmark Selection Process, 2008. ,
DOI : 10.2172/938789
URL : https://digital.library.unt.edu/ark:/67531/metadc896561/m2/1/high_res_d/938789.pdf
, The Landscape of Parallel Computing Research : A View from Berkeley, 2006.
, The Sequoia Benchmarks, 2009.
, The NAS parallel benchmarks-summary and preliminary results », Supercomputing '91 : Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pp.158-165, 1991.
« Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach, Proceedings of the Seventh Symposium on Reliable Distributed Systems, pp.3-12, 1988. ,
« Scalable causal message logging for wide-area environments, Concurrency and Computation : Practice and Experience, vol.15, issue.10, pp.873-889, 2003. ,
DOI : 10.1007/3-540-44681-8_120
Message System Supporting Fault Tolerance, SIGOPS Operating Systems Review, vol.17, issue.5, pp.90-99, 1983. ,
DOI : 10.1145/800217.806617
« Redesigning the Message Logging Model for High Performance, Concurrency and Computation : Practice and Experience, vol.22, pp.2196-2211, 2010. ,
« Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Euro-Par 2011 Parallel Processing, vol.6853, pp.51-64, 2011. ,
, Correlated Set Coordination in Fault Tolerant Message Logging Protocols », pp.51-64, 2011.
« Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery, IEEE International Conference on Cluster Computing, 2009. ,
Distributed Domino Effect Free Recovery Algorithm, IEEE International Symposium on Reliability, Distributed Softwares, and Databases, pp.207-215, 1984. ,
« Fault Tolerance in Petascale/ Exascale Systems : Current Knowledge, Challenges and Research Opportunities, International Journal of High Performance Computing Applications, vol.23, pp.212-226, 2009. ,
, On Communication Determinism in Parallel HPC Applications », 19th International Conference on Computer Communications and Networks (ICCCN 2010), 2010.
Snapshots : Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
« A model for predicting the optimum checkpoint interval for restart dumps, Proceedings of the 2003 international conference on Computational science, ICCS'03, pp.3-12, 2003. ,
« A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, pp.303-312, 2006. ,
« How to Recover Efficiently and Asynchronously when Optimism Fails, International Conference on Distributed Computing systems, pp.108-115, 1996. ,
System Resilience at Extreme Scale, DARPA, 2008. ,
Survey of Rollback-Recovery Protocols in Message-Passing Systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit, IEEE Transactions on Computers, vol.41, issue.5, pp.526-531, 1992. ,
« Group-based Coordinated Checkpointing for MPI : A Case Study on InfiniBand, Proceedings of the 2007 International Conference on Parallel Processing, ICPP '07, p.47, 2007. ,
« Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications, 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), Anchorage, 2011. ,
Recovery Scheme for Cluster Federations Using Senderbased Message Logging, Journal of Computing and Information Technology, vol.19, issue.2, pp.127-139, 2011. ,
« Domino-effect free crash recovery for concurrent failures in cluster federation, Proceedings of the 3rd international conference on Advances in grid and pervasive computing, GPC'08, pp.4-17, 2008. ,
Group-Based Checkpoint/Restart for LargeScale Message-Passing Systems », 22nd IEEE International Parallel and Distributed Processing Symposium, 2008. ,
Based Message Logging, The 17th Annual International Symposium on Fault-Tolerant Computing, pp.14-19, 1987. ,
Exascale Computing Study : Technology Challenges in Achieving Exascale Systems, 2008. ,
« A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, pp.4-11, 2003. ,
Checkpointing and Rollback-Recovery for Distributed Systems, ACM Fall joint computer conference, ACM, vol.86, pp.1150-1158, 1986. ,
Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM, vol.21, pp.558-565, 1978. ,
Crash Recovery in a Distributed Data Storage System, 1979. ,
, Team-based Message Logging : Preliminary Results », 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, 2010.
, Message Passing Interface Forum, « MPI : A Message-Passing Interface Standard, 1995.
Checkpointing for Parallel Applications in Cluster Federations, Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID'04), pp.773-782, 2004. ,
URL : https://hal.archives-ouvertes.fr/inria-00000990
Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pp.1-11, 2010. ,
« Using Time to Improve the Performance of Coordinated Checkpointing, Proceedings of the International Computer Performance & Dependability Symposium, pp.282-291, 1996. ,
, deling the Impact of Checkpoints on Next-Generation Systems », MSST '07 : Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, pp.30-46, 2007.
, System structure for software fault tolerance », Proceedings of the international conference on Reliable software, pp.437-449, 1975.
On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications, Euro-Par, issue.1, pp.567-578, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00786558
, Workshop on Resiliency in High Performance Computing, p.5, 2008.
Optimistic Message Logging for Reliable Execution of MPI Applications, 15th International Euro-Par Conference, pp.615-626, 2009. ,
URL : https://hal.archives-ouvertes.fr/inria-00424002
Optimistic and Distributed Message Logging for MessagePassing Applications, Concurrency and Computation : Practice and Experience, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00727470
« Efficient Distributed Recovery Using Message Logging, Proceedings of the eighth annual ACM Symposium on Principles of distributed computing (PODC '89), pp.223-238, 1989. ,
« Minimizing Timestamp Size for Completely Asynchronous Optimistic Recovery with Minimal Rollback, Proceedings of the 15th Symposium on Reliable Distributed Systems (SRDS '96), p.66, 1996. ,
, Distributed Systems, vol.3, pp.204-226, 1985.
« Error Recovery in Multicomputers Using Global Checkpoints, International Conference on Parallel Processing, pp.32-41, 1984. ,
, Distributed Recovery Units : An Approach for Hybrid and Adaptive Distributed Recovery, 1993.
, www.top500.org, « TOP500 List of Worlds Supercomputers, 2011.
« Trading Off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery, Concurrency and Computation : Practice and Experience, vol.21, pp.819-853, 2009. ,