Lower Bounds for the Duration of Decommission Operations with Relaxed Fault Tolerance in Replication-based Distributed Storage Systems

Abstract : Efficient resource utilization is a major concern for large-scale computer platforms. One method used to lower energy consumption and operational cost is to reduce the amount of idle resources. This can be achieved by using malleability, namely, the possibility for resource managers to dynamically increase or decrease the amount of resources of jobs while they are running. Decommissioning (removing from the cluster) the idle nodes as soon as possible allows the resource man- ager to quickly reallocate the nodes to other jobs. Challenges appear when such nodes host part of a dis- tributed storage system. Indeed, distributed storage systems need to transfer large amounts of data during decommission in order to ensure data availability and a constant level of fault tolerance. In this paper, we explore the possibility of relaxing the level of fault tolerance during the decommission in order to reduce the amount of data transfers needed before nodes are released, and thus return nodes to the resource manager faster. We quantify theoretically how much time and resources are saved by such a a fast decommission strategy compared with a standard decommission. We establish lower bounds for the duration of the different phases of a fast decommission. We use the lower bounds to estimate when the fast decommission would be useful to reduce the usage of core-hours. We implement a prototype of fast decommission mechanism. Using the prototype, we validate the lower bounds on the duration of the operation and confirm the findings about the core-hour usage.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01943964
Contributor : Nathanaël Cheriere <>
Submitted on : Thursday, December 6, 2018 - 11:21:04 AM
Last modification on : Thursday, February 7, 2019 - 4:11:17 PM

File

Report.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01943964, version 2

Citation

Nathanaël Cheriere, Matthieu Dorier, Gabriel Antoniu. Lower Bounds for the Duration of Decommission Operations with Relaxed Fault Tolerance in Replication-based Distributed Storage Systems. [Research Report] RR-9229, Inria Rennes - Bretagne Atlantique. 2018, pp.1-28. ⟨hal-01943964v2⟩

Share

Metrics

Record views

114

Files downloads

44