Operating System Support for a Network of Workstations

Trivikram Bhat

Department of Computer Science and Engineering

University of Texas at Arlington

Arlington, TX 76019

April 23, 1999

Abstract

The recent developments in the field of networking and communications has led to the popularity and growth of network of workstations (NOW’s). This development has occurred due to the recent advances in low-latency, high-bandwidth switches and high-speed LAN’s. There has also been a narrowing of the performance gap between workstations and supercomputers.

This paper focuses on a layer of an operating system on top of the network of workstations that provides various features such as transparent remote execution, dynamic load balancing and backward compatibility for existing applications. It will look at the efforts put in by a group of people from the University of California at Berkeley that developed GLUnix (Global Layer Unix) for a network of workstations. It is part of the Berkeley NOW project. It focuses on the architecture, implementation and the features of GLUnix.

The paper also discusses the important issues concerning the distribution of resources across the network. Proportional-Share scheduling is used to distribute the available resources among the clients. This is a Fair-Share scheduling approach that does not focus on the priorities of the clients and aims at effective scheduling of mixed workloads i.e. interactive, sequential and parallel jobs. It makes sure that every user gets a fair share of the resources.

1.Introduction

The paper will begin with an introduction and will then explain the need for a NOW. This is followed by the goals, the architecture and the abstractions of GLUnix. It then describes the GLUnix cluster and some other implementation details. Finally it discusses the concept of Proportional Share scheduling. There have been three basic computing models to support the different class of applications: workstations were used for low-demand interactive jobs; supercomputers and mainframes for high-demand batch processing; and massively parallel processors (MPP’s) for parallel computation. Workstations were rarely used for doing parallel computations on a large scale but this is now changing. Due to the recent developments in the field of networking, it has become much easier and faster to access data on another system than to read it from the disk. This has led to the development of network of workstations.

The primary challenge lies in the development of system software needed to manage this network. The software should ideally provide support for transparent remote execution, interactive, sequential and parallel jobs and dynamic load balancing. The Berkeley NOW project began with this search i.e. to find a system that provides the above-mentioned features, but no system could provide a complete solution. The systems present during that time included LSF [1] and Utopia [2] that supported remote execution, PVM [3] that provided support for parallel jobs, Condor [4] and Sprite [5] that were able to dynamically balance the load. GLUnix was designed to incorporate the features listed above. GLUnix, as it currently stands, runs on a 100-node cluster of Sun UltraSparcs and maintains most of the Unix I/O semantics, load balancing at startup and tolerates single node faults. The discussion centers on GLUnix but is also applicable to any system layer trying to exploit the network resources.

It will also discuss how such a system layer manages the network resources and effectively distributes among the different nodes in the network. Pooling resources has the advantage that the unused memory, processors and disks can be used by resource-intensive applications. The scheduling technique used here is based on the idea of fairness and is not priority based. In this distributed environment, we have a number of different jobs that need resources and the proportional-share scheduling is not sufficient.

The overall idea is that there exist a number of resources and a number of clients. Each of the clients is given a certain number of tickets. Depending upon the number of tickets, the corresponding share of the resources is given to the client. Every client also has a time interval called the stride that is inversely proportional to their ticket allocation. Each client has a certain number of users and currencies allow the client to distribute the tickets to the users. The users have to manage the distribution of the resources to the individual processes.

In the next sections we will discuss about the goals of GLUnix, the architectural alternatives that were considered for developing GLUnix and the reasons behind the choice of the particular architecture. We will also look at stride scheduling in detail and its extensions for interactive, sequential and parallel jobs.

Goals of GLUnix

2.1 History

The primary goal of the Berkeley NOW project was to support a mixture a parallel and sequential workloads. An early simulation study [6] of workloads was carried out on a 32-node CM-5 at Los Alamos National Lab and a 70-node workstation cluster at U.C, Berkeley.

The main goals of the Berkeley NOW project were:

Transparent Remote Execution: The applications should not be aware that they are being executed on remote nodes. In GLUnix, a user A may startup a process on a particular node in the GLUnix cluster and the master may execute this process on another node that is part of the cluster. The user A will be unaware of this.

Load Balancing: The system should balance the load by finding the best location to execute the job. It should also migrate jobs dynamically to new nodes depending on the load.

Portability: The system should be easily portable to the different architectures and operating systems.

Scheduling of Parallel jobs: It should efficiently schedule parallel jobs that can be executed concurrently. In GLUnix, every process of an N-way parallel program is given the same network process identifier (NPID) but each one has a different virtual node number (VNN). This will be discussed in detail later.

Binary Compatibility: The system should be able to meet all the requirements without having to modify or re-link the existing applications. It should execute the existing applications with slight or no modifications to them.
Availability: The system should be available to the user all the time, even when a node crashes. The system should be capable of tolerating faults.

2.2Architecture of GLUnix

There were three options that were considered in designing the initial architecture for GLUnix.

They were:

Build a new operating system.
Modify existing operating system.
Build on top of an existing system.

Each of the above options was taken into consideration and was carefully looked at and it was decided that a new layer be built on top of an existing operating system.

The first option of trying to build a new operating system from scratch was eliminated because it is too time consuming to develop a new system and also because the efforts that were put in the past had failed. Also, such systems are difficult to port to different platforms as they often fail to keep up with advancing hardware platforms.

Now the second option of modifying an existing system had a lot of benefits.

The implementation need not be constrained to services available at the user level.
If the existing system did not provide sufficient level of abstraction, it was easy to modify it.
Also, the existing applications need not have to be modified. They can be directly run on the existing operating system.

But for the following reasons the second option was also rejected.

If we change the system kernel, then it has a dramatic impact on the system portability.
It requires time to obtain the source of the existing operating system, learn, modify and debug it.
When we modify an existing operating system, it limits the ability to distribute it.

The solution was to build a global runtime environment at the user level. The choice of user level to kernel level implementation was based on the fact that it is easier to implement and debug the system if designed at the user level. This has a drawback; i.e. a user level implementation depends upon the functionality provided by the underlying operating system.

The system was implemented in two domains i.e. the Unix domain and the GLUnix domain. This allowed the slow and steady transition from the traditional Unix to the new GLUnix domain. The applications were not directly run in the GLUnix domain and a special request had to be made to run the application in the GLUnix domain.

2.3High Level Architecture

There were again a number of options that were considered for designing the high level architecture. The two main options were either a centralized or a peer-to-peer architecture. One thing to be noted is that every option that the team faced had some good points and some bad ones and they had to take a decision based on various factors. Again, while deciding upon the high level architecture, it was known that peer-to-peer architectures have the advantage of improved performance and fault tolerance. On the other hand, centralized designs are easy to implement and debug. Since the target of 100 nodes was set, they decided to choose the centralized master architecture that would serve their purpose.

2.4Internal Architecture

Here too there were two options that were considered:

Multi-threaded architecture
Event-driven architecture

Multi-threaded architectures provide concurrency that helps to improve the performance and also it is easy to deal with blocking operations. On the other hand, event-driven architectures cannot generally block but they are easier to develop and reason and race conditions can be avoided. The event-driven architecture was chosen for GLUnix.

GLUnix provides the following features:

Remote executions without complete transparency.
Load balancing without migration.
Co-scheduling of parallel jobs.
Runs existing binaries.
Tolerate single node failures.

In the next section we shall look into the GLUnix abstractions. This will give us some idea of how transparent remote execution is achieved in GLUnix.

GLUnix Abstractions

GLUnix provides the idea of globally unique network process identifiers (NPIDs). These are 32 bit long and are used to identify both sequential and parallel jobs. In an N-way parallel program consisting of N processes, each process has the same NPID. This helps to identify which program a particular process belongs to. Now, to identify each process within the program, GLUnix assigns a different virtual node number (VNN) to each process at execution time. Thus, each process is identified as a pair of <NPID, VNN>.

GLUnix also supports the Single Program Multiple Data (SPMD) model of parallel programming. This allows multiple copies of the same program to run on multiple nodes of the cluster.

GLUnix Implementation

4.1 GLUnix Cluster

In building GLUnix, two primary assumptions were made i.e. the concept of a shared file system and a homogenous cluster. The GLUnix cluster uses an NFS-mounted file system for this purpose. Also, a single cluster can include Sparcstations 10’s, 20’s and UltraSparc’s running Solaris 2.3 but it cannot include both Sparc and 8086 machines. The cluster consists of the following three components:

GLUnix Master – It stores and maintains the cluster state and is responsible for resource allocation. It also decides where particular user jobs should be executed. It maintains a connection between the startup processes and the daemons located on the other nodes of the cluster.

GLUnix Daemon – It is located on each node other than the node that has the master. It is responsible for checking the local load information i.e. the load of the node on which it resides. It then sends this information to the master.

GLUnix Library – It implements the GLUnix API, providing a way for applications to request for services.

In the above cluster, the master is located on the Shakespeare node and the daemons are located on the Hamlet, Macbeth, Romeo and Othello nodes. A user on Hamlet has executed job A, a parallel job that consists of three processes located on Macbeth, Romeo and Othello. Another user on Macbeth has executed a sequential job B that the master has placed on Othello. This is how the master selects lightly loaded nodes in the cluster to place the jobs.

4.2Internal structure of the Master, Daemon and the Library

The internal structure of the master, the daemon and the library is shown below. We can

see that all the three have the same internal structure but behave in different ways. The primary function of each of the modules is also given below and is the same in all the three i.e. the master, the daemon and the library. GLUnix events consist of both messages and signals, with the main event loop located in the comm module. When a message arrives at the node, the Msg module unpacks it into a local data structure and invokes the message handler. The Msg module is also responsible for packing the outgoing messages that are sent by the comm module. Signals are handled by the event loop that invokes the appropriate event handler.

Given below is the description of each of the modules that are part of the internal structure.

Comm – It is implemented using the Berkeley Stream Sockets over TCP/IP. It is designed to be easily replaced by faster communication primitives, such as Active Messages [7].

Msg – It performs message packing and unpacking and also invokes handlers for incoming messages.

Ndb – It is the node database that keeps information regarding the states of the different nodes in the cluster.

Pdb – It is the process database that keeps track of the currently executing processes.

Debug – It supports multiple debug levels.

Timer – It schedules periodic events.

Sched – It implements the co-scheduling policy algorithm.

Nmgt – It is responsible for node management functions like collecting data from the various nodes that are in the cluster. It also keeps track of nodes that have failed.

Rexec – It performs the job of remote execution by placing the jobs of the users on different nodes.

Part – It manages the GLUnix partitions and reservations.

GLUnix handles faults in the following way:

The master maintains a connection with the startup process and the daemons on each of the nodes. The Nmgt component of the master keeps a check to see if there are any broken connections with any of the daemons or the startup processes. When the master identifies a broken connection to a daemon, it looks in its Pdb to see a list of active processes that are on the failed nodes. If there is a sequential process then it marks it killed and if there is a parallel process, then it marks all the processes that are on the different nodes as killed.

Next we shall discuss about transparent remote execution. The users run the GLUnix programs from the command line. The user starts up a process from one of the nodes and this becomes the GLUnix startup process. This is then linked to the GLUnix library, which provides the application program interface. It then sends a request to the master for remote execution. The master then selects the N lightly loaded nodes in the system, which will be used to execute the job. Rexec module of the master registers the new process with the Pdb. It then sends an execution request to each of the Rexec modules of the selected daemons. Daemon’s Rexec module performs a fork() and then does an exec(). The daemons inform the master of any failures by returning an error code.

GLUnix Usage

GLUnix has been running and is in daily use by the Computer Science Division at the University of California at Berkeley since 1995 on about 35 Sparcstations. By 1996 the number of nodes increased to 100. By 1997, GLUnix had 130 users and had run over one million jobs.

It has the following uses:

It is used as a parallel compilation server and also for parallel computation. It was found extremely useful for testing and system administration. It was used for parallel stress testing of the Myrinet network, which revealed bugs that could not be recreated using traditional Unix functionality such as rsh.

The parallel NOW-sort application holds the record for the fastest disk-to-disk sort.

The record was set for the Datamation benchmark, an old industry standard that requires to sort a million 100-byte records as fast as possible. A 32 node NOW cluster was able to sort this amount of data in 2.41 seconds, beating the old record by more than a full second.

6. GLUnix Reservations

Initially GLUnix had no support for partitions or reservations. As the system became more heavily used, the need for a reservation system became more apparent. Again, initially the reservations were made by sending e-mail to a user distribution list. As the number of users increased, it became very difficult to make reservations by e-mail.

A new system was developed to make reservations that worked as follows:

The system works by associating three lists with each machine: aliases, users and owners. The alias list simply designates the list of names identifying each machine; the user list identifies the list of users who currently have permission to run applications on that machine; and the owner list specifies the individuals that have permission to reserve that machine. When a user wants to run an application, it can provide an alias to indicate the list of nodes, which should be used to run the program. GLUnix then makes sure that the application or the program will run only on those nodes for which the user has permission. While making a reservation, the user specifies the list of nodes on which the program should be run, the time of the reservation and the name for the reservation. When the time for the reservation arrives, GLUnix filters the set of requested nodes by each machine’s owner list, preventing users from reserving machines for which they do not have permission. GLUnix then assigns the reservation name as an alias to each machine selected. This allows the users to run the application on only those nodes that they have reserved.