Global Resource Management for High Availability and Performance in a DSM-based Cluster
Résumé
High availability and performance are two desirable properties for the execution of long-running parallel scientific applications on software DSM based clusters. Global resource management in the operating system is a way to achieve these properties. To illustrate this approach, a system integrating a paged-based shared virtual memory and a parallel file system for global management of memory and disk resources is presented. Main design issues include the optimization of disk accesses in the context of a single level storage system and fault tolerance.