Modeling and Tolerating Heterogeneous Failures on Large Parallel Systems

Eric Heien; Derrick Kondo; Ana Gainaru; Dan Lapine; Bill Kramer; Franck Cappello

Communication Dans Un Congrès Année : 2011

Modeling and Tolerating Heterogeneous Failures on Large Parallel Systems

(1) , (1) , (2) , (2) , (2) , (3, 2)

1
2
3

Eric Heien

Fonction : Auteur
PersonId : 883116

Middleware efficiently scalable

Derrick Kondo

Fonction : Auteur

Middleware efficiently scalable

Ana Gainaru

Fonction : Auteur

Department of Computer Science [UIUC]

Dan Lapine

Fonction : Auteur

Department of Computer Science [UIUC]

Bill Kramer

Fonction : Auteur

Department of Computer Science [UIUC]

Franck Cappello

Fonction : Auteur

Global parallel and distributed computing

Department of Computer Science [UIUC]

Résumé

As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time. To this end, we study 5 years of system logs from a production high-performance computing system and model hard ware failures involving processors, memory, storage and net work components. We model each component and construct integrated failure models given the component us age of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Arnaud Legrand : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00788786

Soumis le : vendredi 15 février 2013-11:16:24

Dernière modification le : jeudi 4 avril 2024-21:15:25

Dates et versions

hal-00788786 , version 1 (15-02-2013)

Identifiants

HAL Id : hal-00788786 , version 1

Citer

Eric Heien, Derrick Kondo, Ana Gainaru, Dan Lapine, Bill Kramer, et al.. Modeling and Tolerating Heterogeneous Failures on Large Parallel Systems. IEEE/ACM Supercomputing Conference (SC), 2011, Seatle, United States. pp.1-11. ⟨hal-00788786⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 UNIV-LILLE3 UGA CNRS INRIA IRISA LIG UMR8623 INRIA2 UR1-MATH-STIC UNIV-PARIS-SACLAY UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM LIG_SIDCH

180 Consultations

0 Téléchargements

Modeling and Tolerating Heterogeneous Failures on Large Parallel Systems

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager