Abstract-This paper presents an introduction to the technology of the Cluster of Workstations (COW) paradigm by considering the example of the Beowulf Parallel Workstation, which is an instantiation of the COW paradigm. The paper tries to present the features expected of a COW and how Beowulf follows the cluster paradigm through its hardware and software architecture. Along the way, we will try and place Beowulf among the alternative technologies like Massively Parallel Processors (MPP) and Network of Workstations (NOW). In the final stages of the paper, we will talk about the future directions of Beowulf, some other research projects on the COW principle, and we will conclude with the limitations of Beowulf and the effect of the current networking trends on the Beowulf system.
TABLE OF CONTENTS
- PHILOSOPHY OF BEOWULF
(WHY BEOWULF?)
- CLUSTERING TECHNOLOGY
- TAXONOMY OF PARALLEL COMPUTERS
- BEOWULF ARCHITECTURE
- FUTURE DIRECTIONS
- CONCLUSIONS
- REFERENCES
1. PHILOSOPHY OF BEOWULF
Earth and space sciences (ESS) project is a research project within the High Performance Computing and Communications (HPCC) program of NASA [1]. One of the goals of the ESS project is to determine the applicability of massively parallel computers to the problems faced by the Earth and space sciences community. The first Beowulf was built with the intention of solving the problems associated with the large data sets associated with ESS applications [2].
In more exact terms, the goal was a “Gigaflops Scientific Workstation”, which could provide an alternative computing medium to high-end workstations, symmetric multi-processors, and scalable distributed memory systems. Mass market commodity microprocessors have shown a rapid increase in performance, and there is significant pricing disparity between PCs and scientific workstations. The above two factors provided a sufficient ground for substantial gains in performance to cost by harnessing PC technology in parallel ensembles to provide high-end capability for scientific and engineering applications. Towards this end, NASA initiated the Beowulf program to apply these low cost system configurations to the computational requirements in the Earth and space sciences.
At Supercomputing ’96, NASA and DOE demonstrated clusters costing less than $50,000 that achieved greater than a Gigaflops/s performance [1].
2. CLUSTERING TECHNOLOGY
Simply put, a cluster is just many computers connected by a dedicated network. From [3], it is a group of computers working together to share resources or workload. A cluster usually includes some form of hardware or software integration that automatically handles sharing. Note though that the cluster has to be configured during installation and while making changes to the cluster over time.
There are all kinds of clusters, from disk-sharing clusters to full-redundant fault-tolerating operating systems and hardware. Thus, the word cluster denotes a whole family of technologies under a common name. To give a general idea of the types of clusters available, consider the following categories:
- A group of servers that balance the processing load or user load by using a central server or router, that assigns the load to different servers
- An initial or central server can determine the load of the other servers and send the new request to the least loaded server.
- Or the assignment can be based on user preference information or based on the request type.
- A group of servers that act as a central system, joining together individual resources, in whole or in part, for use by clients.
- The individual resources could be ordered in some structure as a single virtual resource – for example the Network File System (NFS).
- The individual resources are pooled in no particular order and assigned jobs as they become available – for example a printer pool or modem pool.
- A group of servers that execute the exact same application at the same time in parallel across the servers.
- This is done primarily in fault-tolerant, redundant or replicated systems to make sure that exact or correct functions are executed as required.
- Beowulf is an example of this category.
- A group of servers that execute parts of the same application across the servers to make the computing faster.
- This is parallel or distributed computing in its pure form and so much more advanced than simple clustering that it is almost beyond it.
Cluster-based distributed computing v/s True Distributed computing: The difference between true distributed computing environments and cluster-based distributed computing lies in how the distributed servers are interfaced together. If the servers are in a completely seamlessly environment where individual node identity isn’t an issue for the programmer or the administrator, you have a true distributed computing environment. A series of machines that have a distributed space that also have to be managed or identified individually would be a type of cluster. A parallel database partitioned across several machines is usually considered a cluster.
3. TAXONOMY OF PARALLELCOMPUTERS
The parallel processor technology can be partitioned into three types. They are:
- Massively Parallel Processors (MPP)
Ex. nCube, Cray, etc
- Cluster of Workstations (COW)
Ex. Beowulf
- Network of Workstations (NOW)
Ex. Berkeley NOW
Massively Parallel Processors (MPP):
- MPPs are larger.
Offered in configurations from 6 to 2,048 processors. More than 2.4 TFLOPS peak performance, offers the greatest amount of power for parallel applications. Supports large parallel workloads with up to 4TB central memory. Industry-leading bisection bandwidth in excess of 122GB per second speeds overall performance on applications I/O bandwidth of up to 128GB per second delivers solutions fast.
- MPP have the lowest latency interconnect network
- Programmers are still required to worry about locality, load balancing, granularity, and communication overheads in order to obtain the best performance.
Cluster of Workstations (COW) v/s MPP:
- COWs are relatively smaller
An example system may contain around 16 processors. Performance achieved could be in the range of: 10.9Gflop/s - Caltech Beowulf and 1.25Gflop/s - Beowulf by NASA
- Interconnect network latency is more with respect to MPP.
- Programs that do not require fine-grained computation and communication can usually be ported and run effectively on Beowulf clusters.
A small note on the importance of the latency in the interconnect network. Fine-grain granularity in parallel jargon means that individual tasks are relatively small in terms of code size and exec time. Smaller the granularity implies greater the potential for parallelism and speed-up greater but also greater will be the overheads of synchronization and communication (data dependency). Beowulf has a network latency, which does not support such fine-grained parallelism. The recent advances in high-speed networking effect the latency parameter. We will talk about this later in the paper.
Network of Workstations:
Programming a NOW is usually an attempt to harvest unused cycles on an already installed base of workstations. Programs in this environment require algorithms to be extremely tolerant of load balancing problems and large communication latency. The problem of load balancing arises due to the dependency of the NOW on having unused cycles on the workstations forming the NOW. If the algorithm assumes that a fixed # of cycles are always available, and all the workstations in the NOW are too busy to contribute idle cycles, then the algorithm will fail if it is not tolerant of load balancing problems. Also, there is unpredictability in n/w latency, as the network load is not determined by the application being run on the cluster. As the interconnection network is visible to the outside world, a portion of the network traffic may be generated by systems outside the NOW. Thus, if the algorithm is not tolerant of large communication latencies, then it will fail.
COW v/s NOW:
- The nodes in a COW are dedicated to the cluster. This has the important ramification that the performance of individual nodes are not subject to external factors like what applications are running on the component units in a COW. Note the in a NOW, the # of idle cycles contributes by a workstation will depend on the application the user is running at that particular time. This eases load-balancing problems.
- The interconnection network for a cluster is isolated from the network. This results in unpredictability in network latency being reduced and strengthened system security as the only authentication needed between processors in for system integrity.
- In a COW, operating system parameters can be tuned to achieve better throughput for coarse-grain jobs. This is not possible in a NOW as each workstation on a NOW has a user interacting with the workstation, and the user of course will want a good interactive response.
- A COW provides a global process ID. This enables a process on one node to send signals to another process, all within the user domain.
4. BEOWULF ARCHITECTURE
There are two constraints on the Beowulf workstation architecture:
- It must use exclusively commodity hardware to avoid dependence on a single vendor.
- The cost of the workstation, populated with disk and memory, should be no more than a high-performance scientific workstation, approximately $50,000 [2].
With this in mind, Beowulf clusters (FIGURE 1) have been assembled around every generation of commodity CPU since the 1994 introduction of the 100MHz Intel DX4 processor. The first Beowulf was built with DX4 processors and 10Mbps Ethernet. In late 1997, it was built using 16 200Mhz P6 processors connected by Fast Ethernet and a Fast Ethernet switch. The current price/performance point in desktop architectures, Intel’s 200MHz Pentium Pro CPU, is incorporated into the new generation Beowulf clusters. The networking equipment has changed from simple 10Mbps Ethernet to Fast Ethernet to various forms of Switched Ethernet.
We will now discuss the features expected of Beowulf and the general hardware and software architecture used to achieve this end.
Features expected of Beowulf: The features expected of Beowulf, to make it the intended “Gigaflops Parallel Workstation”, are as follows:
- High interconnect-network bandwidth and low latency inter-processor communication.
- High aggregate-bandwidth disk subsystems
- High floating-point performance
How is the Beowulf built to achieve the above features? Read on for the answer.
High Network Bandwidth: Beowulf achieves a high bandwidth by using a multi-channel interconnection network. The interconnection network can be of two types, in general. They are:
- A Fast Ethernet switched network
Refer to Figure 2
The maximum achievable speed here is 100Mbps. With the advent of Gigabit Ethernet, it may replace the 100Mbps line.
- A Crossbar Switch Architecture
Refer to Figure 3
The design of the crossbar switches is done such that for non-overlapping connections, the switch acts as a point-to-point link. And, due to the switched architecture, broadcast is eliminated, and network traffic due to broadcast reduces, thus freeing up more bandwidth. As the switches are getting faster and intelligence is being added to them, switches may dominate the future of networking.
High Aggregate Disk-Bandwidth: Why do we need a high aggregate disk bandwidth? The answer: Existing systems follow the model of a shared file server access through a common LAN. In this case, the same data would be accessed repeatedly during a working session because the typical workstation did not have the disk capacity to hold all of the requisite data. The result was long latencies to file servers, tedious response cycles, & burdening of shared resources. Thus performance suffers.
Coarse-grain parallelism implies that the program could follow the SPMD (Single Process Multiple Data) model which means the same program acting on multiple data. Later, the results are merged if required. For this, if we have disk bandwidth at each node, then the aggregate disk bandwidth will improve response times, reduce latencies and reduce the load on the network, only for coarse-grained jobs. If the job is fine grained, then due to the higher latency of the network in a cluster, the programs will perform poorly.
Beowulf achieves a high aggregate disk bandwidth by placing disks on every node, and also achieves a high-distributed disk space.
High floating-point performance:To increase floating-point performance for CPU intensive applications, small-scale (2 to 4 processors) system boards are used.
One point about the Beowulf architecture is that clusters can be built to a different set of requirements, as has been done at different universities like Drexel, GMU, Clemson, UIUC, etc. Thus, we have a very flexible architecture framework.
FIGURE 1: BEOWULF ARCHITECTURE
FIGURE 2: FAST ETHERNET SWITCHED NETWORK
FIGURE 3: CROSSBAR SWITCH NETWORK
Software Architecture of Beowulf: Until now, we have discussed about the hardware architecture of Beowulf. Let us now proceed on the road to discovering the software architecture.
The Beowulf software architecture is called the Grendel. Grendel is implemented as an add-on to the commercially available, royalty-free Linux operating system. The Beowulf distribution includes several programming environments and development libraries as individually installable packages. For ex., PVM, and MPI are available. SystemV-style IPC and p-threads are also supported.
The main features of Grendel can be classified as follows:
- Global Process ID space
- At the library level
- Independent of external libraries
- Programming Models
-PVM/MPI
-Distributed Shared Memory
- Parallel File System
Let us discuss the above features in detail.
Global Process ID (GPID) Space: Each process running in a UNIX kernel has a unique identifier called the Process ID (PID). The uniqueness is limited to the single kernel under consideration. In a parallel, distributed context, it is often convenient for UNIX processes to have a PID that is unique across an entire cluster, spanning several kernels. This can be achieved in two ways.
GPID Space – Method1:
- This incorporates the notion of a SPMD context of program execution, where multiple copies of the same code run on a collection of nodes, and share a UNIX process ID.
- One implementation of this method is discussed in [4]. Here, a parallel process is made up of a number of parallel tasks. Parallel tasks within a single process are allocated the same process ID and context ID, on each cell (a cell being a processor node). A parallel task of one process can send signals to other parallel tasks in the same process running on other cells or nodes.
GPID Space – Method2:
- The second scheme is available at the library layer in PVM [5]. PVM provides each task running in its virtual machine with a task ID that is unique across all the hosts participating in the virtual machine. But, this method is restricted to programs written and compiled under the PVM library.
- Beowulf provides an implementation of this method, called GPID-PVM. The PID space is divided into two parts: one for the local processes and one for the global ones. Some local processes, like init (in the UNIX sense), are duplicated across the cluster. So, the process space would be cluttered if the local processes were also included in the global process space.
- This requires a static allocation of non-overlapping PID ranges. This ensures that runtime PID assignment requires no inter-node communication, as kernels allocate the PIDs from their locally allotted ranges.
Programming Models: There are several distributed programming models available on Beowulf. A couple of the most commonly used are PVM and MPI [6]. Also, a distributed shared memory model is planned.
- PVM/MPI
- Embody the message passing paradigm for application portability – an application written using the PVM/MPI libraries can be ported across any platform for which these libraries are available.
- Beowulf supports a slightly modified version of the Oak Ridge PVM [7] and an unchanged Ohio State LAM MPI package [8]
- Distributed Shared Memory
- Beowulf implements a page-based Network Virtual Memory (NVM), also known as Distributed Shared Memory (DSM). The initial implementation is based on the ZOUNDS system from Sarnoff [9].
- ZOUNDS is designed to achieve the goal of a 50-microsecond page fault in the MINI gigabit ATM interface.
- The basic idea here is to flatten out the page fault/io process by putting in shortcuts to the virtual memory system where needed.
- Note that the LINUX kernel provides a VFS-like interface into the virtual memory system. This makes it simpler to add transparent distributed memory backends to implicitly managed namespaces.
Parallel File System
The basic aims of a Parallel File System [10] are:
- To allow access to storage devices in parallel from multiple nodes – disk file transfers between separate pairs of disks are done in parallel. Note that a switched network provides a point-to-point connection between nodes connected at different ports of the switch.
- To allow data on multiple storage devices to appear as a single logical file.
Beowulf systems can take advantage of a number of libraries developed to provide parallel file system interfaces to Networks of Workstations ([11],[12], [13]).
- FUTURE DIRECTIONS
- The GPID concept is good for cluster wide control and signaling of processes, but it fails with a global view of processes. For ex, the UNIX commands ps, top etc will not work unmodified on the present Beowulf system. Work is underway to provide this capability. This involves using the /proc pseudo-filesystem. The Linux implementation uses the /proc to present almost all system information. The work involves the conceptually simple step of combining the /proc directories of the cluster using the existing NFS capabilities.
- A page-based distributed shared memory uses the virtual memory hardware of the processor and a software-enforced ownership and consistency policy to give the illusion of a memory region shared among processes running an application. A more conventional DSM-NVM implementation is planned, along with support for a Network Memory Server.
- CONCLUSIONS
- Using Beowulf, we have implemented a high-performance workstation at a fairly low price.
- One more limitation is that Beowulf was built by and for the researcher with parallel programming experience.
- The operating point targeted by the Beowulf approach is scientific applications and users requiring repeated use of large data sets and large data sets large applications, with easily delineated course grained parallelism.
- Coarse-grained parallelism is one of the limitations of Beowulf, but the vastly improving networking speeds promise to reduce the interconnect- latency in future, and allow finer grained programs also to work.
7. REFERENCES