Conference papers

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

Abstract : Resilience/fault-tolerance has become a key challenge for large-scale parallel systems. To ensure reliability of high performance computing systems, various kinds of techniques have been proposed, such as hardware-level fault-tolerance, checkpointing, replication, algorithm-base fault-tolerance, etc. There are also many software systems to monitor and handle system-failures, e.g. management and job-scheduling system of HPC systems. To evaluate the effectiveness of these systems, it is necessary to provide some kind of tool to inject failures in a HPC system. This paper proposes HPC-SFI, a system-level fault injection tool for HPC systems. Basically, HPC-SFI can generate three kinds of system-failures in a HPC system including in-node faults, failure in the interconnection network and failure of storage/parallel-file system. In addition, HPC-SFI can inject system-faults in pseudo-random model according to pre-defined parameters and probabilities. Preliminary experimental results demonstrate effectiveness of the tool.
Document type :
Conference papers
Complete list of metadata

Contributor : Hal Ifip <>
Submitted on : Thursday, September 5, 2019 - 1:31:31 PM
Last modification on : Thursday, September 5, 2019 - 1:35:32 PM
Long-term archiving on: : Thursday, February 6, 2020 - 6:10:36 AM


Files produced by the author(s)


Distributed under a Creative Commons Attribution 4.0 International License



Yanqi Wang, Qi Zhang, Yi Liu, Depei Qian. HPC-SFI: System-Level Fault Injection for High Performance Computing Systems. 15th IFIP International Conference on Network and Parallel Computing (NPC), Nov 2018, Muroran, Japan. pp.103-113, ⟨10.1007/978-3-030-05677-3_9⟩. ⟨hal-02279558⟩



