Repair Time in Distributed Storage Systems

Abstract : In this paper, we analyze a highly distributed backup stor- age system realized by means of nano datacenters (NaDa). NaDa have been recently proposed as a way to mitigate the growing energy, band- width and device costs of traditional data centers, following the popu- larity of cloud computing. These service provider-controlled peer-to-peer systems take advantage of resources already committed to always-on set top boxes, the fact they do not generate heat dissipation costs and their proximity to users. In this kind of systems redundancy is introduced to preserve the data in case of peer failures or departures. To ensure long-term fault tolerance, the storage system must have a self-repairing service that continuously reconstructs the fragments of redundancy that are lost. The speed of this reconstruction process is crucial for the data survival. This speed is mainly determined by how much bandwidth, which is a critical re- source of such systems, is available. In the literature, the reconstruc- tion times are modeled as independent (e.g., poissonian, deterministic, or more generally following any distribution). In practice, however, nu- merous reconstructions start at the same time (when the system detects that a peer has failed). Consequently, they are correlated to each other because concurrent reconstructions do compete for the same bandwidth. This correlation negatively impacts the efficiency of the bandwidth uti- lization and henceforth the repair time. We propose a new analytical framework that takes into account this correlation when estimating the repair time and the probability of data loss. Mainly, we introduce a queuing model in which reconstructions are served by peers at a rate that depends on the available bandwidth. We show that the load is unbalanced among peers (young peers inherently store less data than the old ones). This leads us to introduce a correcting factor on the repair rate of the system. The models and schemes proposed are validated by mathematical analysis, extensive set of simulations, and experimentation using the GRID5000 test-bed platform. This new model allows system designers to operate a more accurate choice of system parameters in function of their targeted data durability.
Document type :
Conference papers
Complete list of metadatas

Cited literature [18 references]  Display  Hide  Download

https://hal.inria.fr/hal-00866058
Contributor : Frédéric Giroire <>
Submitted on : Wednesday, September 25, 2013 - 5:23:42 PM
Last modification on : Monday, November 5, 2018 - 3:36:03 PM
Long-term archiving on : Friday, April 7, 2017 - 2:46:28 AM

File

globe-preprint.pdf
Files produced by the author(s)

Identifiers

Citation

Frédéric Giroire, Sandeep Kumar Gupta, Remigiusz Modrzejewski, Julian Monteiro, Stéphane Perennes. Repair Time in Distributed Storage Systems. 6th International Conference on Data Management in Cloud, Grid and P2P Systems (Globe 2013), Aug 2013, Prague, Czech Republic. pp.99-110, ⟨10.1007/978-3-642-40053-7_9⟩. ⟨hal-00866058⟩

Share

Metrics

Record views

829

Files downloads

382