Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Conference Papers Year : 2015

Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead

Bogdan Nicolae
Connectez-vous pour contacter l'auteur

Abstract

Dumping large amounts of related data simulta-neously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable re-silience and high availability. However, replication introduces overhead, both in terms of network traffic necessary to distribute replicas, as well as extra storage space requirements. To reduce this overhead, state-of-art techniques often apply redundancy elimination (e.g. compression or deduplication) before replication, ignoring the natural redundancy that is already present. By contrast, this paper proposes a novel scheme that treats redundancy elimination and replication as a single co-optimized phase: remotely duplicated data is detected and directly leveraged to maintain a desired replication factor by keeping only as many replicas as needed and adding more if necessary. In this context, we introduce a series of high performance algorithms specifically designed to operate under tight and controllable constrains at large scale. We present how this idea can be leveraged in practice and demonstrate its viability for two real-life HPC applications.
Fichier principal
Vignette du fichier
paper.pdf (245.8 Ko) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

hal-01115700 , version 1 (11-02-2015)

Identifiers

Cite

Bogdan Nicolae. Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead. IPDPS '15: 29th IEEE International Parallel and Distributed Processing Symposium, May 2015, Hyderabad, India. ⟨10.1109/IPDPS.2015.82⟩. ⟨hal-01115700⟩
120 View
206 Download

Altmetric

Share

Gmail Facebook X LinkedIn More