The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Bahman Javadi; Derrick Kondo; Alexandru Iosup; Dick Epema

doi:10.1016/j.jpdc.2013.04.002

Article Dans Une Revue Journal of Parallel and Distributed Computing Année : 2013

The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

(1) , (2) , (3, 4) , (3)

1
2
3
4

Bahman Javadi

Fonction : Auteur

School of Computing, Engineering and Mathematics [Sydney]

Derrick Kondo

Fonction : Auteur

Middleware efficiently scalable

Alexandru Iosup

Fonction : Auteur

Parallel and Distributed Group

Delft University of Technology

Dick Epema

Fonction : Auteur

Parallel and Distributed Group

Résumé

Abstract With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)--an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard \FTA\ data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the \FTA\ for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Arnaud Legrand : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00925098

Soumis le : mardi 7 janvier 2014-15:21:03

Dernière modification le : jeudi 4 avril 2024-21:14:40

Dates et versions

hal-00925098 , version 1 (07-01-2014)

Identifiants

HAL Id : hal-00925098 , version 1
DOI : 10.1016/j.jpdc.2013.04.002

Citer

Bahman Javadi, Derrick Kondo, Alexandru Iosup, Dick Epema. The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems. Journal of Parallel and Distributed Computing, 2013, 73 (8), pp.1208 - 1223. ⟨10.1016/j.jpdc.2013.04.002⟩. ⟨hal-00925098⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LIG GRID5000 INRIA2 SILECS LIG_SIDCH

114 Consultations

0 Téléchargements

The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager