The College of New Jersey s5

Cluster Computing

Karl Strohmaier

The College of New Jersey

Fellow Researchers:

Emily Gibson

Gerik Zayatz

Faculty Mentor: Dr. Deborah Knox

Project final report submitted in partial fulfillment

of the requirements for CMSC 497, May 2002.

Abstract

This paper describes an investigation into the use of a parallel cluster. Having previously set up and configured a cluster of six machines (see Cluster Computing), our research team investigated various methodologies of parallel programming. Specifically, programming examples utilizing pipelining, divide and conquer, and synchronous execution strategies were studied.

We examine the specific implementations of each of the parallelizing methods. In addition, we will evaluate results obtained from writing and running individual programs coded in C/C++ using MPICH and making use of the different methodologies.

1. Introduction

All aspects of cluster setup were examined in Cluster Computing [Strohmaier]. A brief overview follows.

1.1 About Clusters

The term cluster refers to a system of computers networked together which are used as though they were merely a single computer, albeit one with many processors. Therefore, even though a standard network is made up of computers networked together, it cannot be thought of as a cluster, since those machines are all viewed as, and are used, separate from one another. Clusters have really only enjoyed a jump in popularity in recent years, as the price of off-the-shelf computers and hardware in general has decreased. Once this decline took place, it became economically feasible to purchase numerous computes and bring them together to form a cluster.

However, even with the network and computers, the cluster is not complete. In fact, it is not a cluster at all. The other necessary item for a cluster is a message-passing library. Such libraries allow for the creation of parallel programs on a cluster by providing functions which use the network to send messages and data between the computers. The two main message-passing libraries currently available are PVM and MPI. Recently, PVM's popularity has decreased due to the development of MPI, an open standard.

Our cluster consisted of 6 machines, all running RedHat 6.2. In the course of this semester's research, it was decided to upgrade of RedHat 7.2, as it provides a new kernel and several bug fixes over 6.2. Detailed instructions for such a conversion, and for the installation of 6.2 as well, can be found in Appendix C, a HOWTO for clustering written by Karl Strohmaier and Gerik Zayatz.

Upon having a fully working, upgraded cluster, we undertook an examination of the parallel programming methodologies listed upon. This paper focuses on the implementation these methods using the cluster.

2. Pipelining

The parallel programming strategy of pipelining utilizes much of the same logic which compromises the design of processors. In processors such as the Intel Pentium 4 chip, there are many "stages" in the pipeline, each doing a different operation. So it is with implementations of algorithms employing parallel programming and pipelining.

The basic method is described in this section. If we wish to execute a program in which there are several stages, each of which may depend on what was previously calculated, we can use a pipeline.

For instance, we could start the computation of processor 1, with processor 0 serving as a master, in case there needs to be any output. Of course, each processor can do its own output, put for all information to be displayed on one monitor, it is much easier to have a dedicated I/O processor. See Figure 1 below.

Figure 1: Schematic of a pipeline

Once processor 1 has done a portion or all of its calculations on the data, it can send that result off to processor 2. If there are more computations that need to be done, processor 1 can continue, while at the same time processor 2 is doing its work. And so, we see the parallel nature of pipelining. Each processor will be continuing with some work, sending off results to the next processor in the pipeline.

An example which clarifies the above discussion is the Sieve of Eratosthenes. This sieve serves as a way of generating prime numbers. Essentially, each stage in the sieve (pipeline) removes numbers divisible by a certain integer. Such a sieve was implemented by Gerik Zayatz. See Figure 2 below.

Figure 2: The Sieve of Eratosthenes

Processor 1 takes the initial list of numbers, which can be any length. It searches through the list, removing numbers which are divisible by 2. At some point, processor 1 sends the remaining numbers to processor 2, which removes all numbers divisible by 3, and so on. Eventually we remove all the numbers that are divisible by all the integers considered.

It would seem, though, as if we would need to have an infinite number of processors to find all the prime numbers (the numbers which were not divisible by any of the integers). However, this is not the case, as there are two strategies which can be used, bearing in mind that it is not necessary to consider every integer from 1 to n to test whether a number n is prime. We need only look to n/2, as any integer larger than n/2 cannot divide n an integral number of times. That is, for example, no number larger than 50 can divide 100 evenly.

The two strategies that can be utilized are a looping pipeline and a pipeline with stages which removes numbers divisible by more than one integer. The looping pipeline works like a circular queue. Once we have reached Pi, the last processor, it will send its results back to the first processor, giving us the structure seen in Figure 3.

Such a strategy allows us to keep checking for divisibility for an unlimited number of times, independent of the number of processors we have.

Figure 3: A picture of a looping pipeline

The second strategy we can utilize is the multiple-checking stage. In this way of looking at the pipeline, there is no looping queue, but instead the simple linear pipeline we saw earlier. However, rather than having each stage remove numbers divisible by a single integer, it removes numbers which are divided by several integers.

For example, earlier we had stage 1 removing only numbers that were divisible by 2. However, in the multiple-checking scheme, it would remove numbers divisible by 2,3,4 and 5. The determination of the amount of integers to check would of course depend on the length of the list you wish to search through and the number of processors you have.

What this view of the pipeline does is allow us to improve on message passing. As each processor removes more elements from the list, there is less that is sent on. Therefore, we cut down on the network traffic, increasing the speed of execution.

3. Divide and Conquer

As the name suggests, the divide and conquer methodology is one in which the problem is broken up into small, similar parts. Once the problem is broken down, each processor can then work on one of the smaller parts. As it often turns out, those problems that lend themselves best to the divide and conquer method are those in which the small problems are very similar, as we are able to reuse code.

Similar to the pipeline, the master process serves as the main process for I/O. This time, however, the master process is also in charge of distributing the tasks to the other processes. In the pipeline, there was no centralized distribution of work, as each process sent its results directly on to the next process.

The structure of a divide and conquer method is seen in Figure 4.

Figure 4: Divide and Conquer - the whole is broken up into parts

Once each process has finished its calculations on the small piece, the master process will then gather all the individual results together to get the final result.

One example that conforms very well to the divide and conquer method of parallel programming, which I decided to work on, is the Mandelbrot Set (see Figure 4 above).

The Mandelbrot Set is a mathematical construct, created by the successive iterations of a single formula. Though the formula is so simple, the Mandelbrot Set is one of the most complex sets known in mathematics today.

Generated by z = z * z + c, the Mandelbrot set is a series of points plotted on the complex plane, where the horizontal axis represents the real part of z, and the vertical axis, the imaginary part. (Recall that a complex number z = a + bi, where a is the real part and b the imaginary part).

The points that comprise the Mandelbrot Set are those points where the magnitude of z does not exceed 2 by a given number of iterations. The magnitude of z, also called Abs(z), equals the Sqrt(a^2 + b^2).

Any point where Abs(z) exceeds 2 quickly runs off to infinity. These points, then, can be viewed as unstable. Any point in the set, however, is stable, because it never exceeds 2. As such, the Mandelbrot set can be interpreted to be a portion of the mathematical field of chaos theory.

Obviously, calculations done by all processors will be the same: z * z + c. Therefore, this works very nicely for divide and conquer. We can divide the complex plane into a series of strips; each strip will then be the smaller part that each processor will work on. At the end, the master process can take the strips back and join them together to get the entire set.

The question of the number of processors relating to the number of strips comes into play, of course. In most scenarios, the relative numbers do not matter. In the course of coding, the complex plane was broken into 40 strips. This seemed an appropriate number, as MPI can handle only about 30 processes. (This is because the sending of messages necessitates RSH connections, and the limit of that on the master computer is around 30.)

For testing, we varied the number of strips to be less than the number of processors, and things worked fine.

The reason the relative number of processors and strips did not matter was due to the way the strips were sent to the other processors. Inside the code, there was a for loop in which the master distributed the strips. This for loop ran from 1 to the number of processors, n. In the cases where there were more strips than processors, the first n strips were sent out. Then, when a processor finished its work and reported back, it was given a new strip. If, however, there were more processors than strips, there was an if statement which made sure that no more strips were sent out after all strips had been allocated.

3.1. Code segments

This section briefly outlines some of the code used in the programming of the Mandelbrot Set. As the complete program is some 400 lines, it cannot all be given here.

3.1.1 Some code for the master process:

This is a portion of the code for the master process. This segment governs how the master sends away and receives back the portions of the complex plane.

One key discovery that allowed this to work was the MPI constant MPI_ANY_SOURCE. Prior to learning of this, I was quite uncertain of how to proceed, as I needed the master to be dynamic, in the sense that it could not be waiting for a message from any individual process. If the master was relegated to waiting for a specific process, it would greatly slow down execution, if I could make it work at all.

/* send the stripes away - notice we sent only numproc stripes away-that may not be all the stripes, if numprocs < STRIPES */

for(ip; ip<numprocs; ip++)

{

if(sent <=STRIPES)

{

MPI_Send(&sent, 1, MPI_INT,ip,0,MPI_COMM_WORLD); //which strip?

stripe_in_proc[ip]=sent; //keep track of what's where

sent++; //increment

}

/* receive back the stripes */

for(recv=0; recv<STRIPES; recv++)

{

/* Receive a max amount, Y_RESN*(X_RESN/STRIPES+1), from any slave */ MPI_Recv(store_x,Y_RESN*(X_RESN/STRIPES+1),MPI_DOUBLE,MPI_ANY_SOURCE,0,MPI_COMM _WORLD,&stat);

/* what process did we receive this from? */

from_node=stat.MPI_SOURCE;

/* how much did we receive? */

MPI_Get_count(&stat, MPI_DOUBLE,&number_from_node);

from_stripe=stripe_in_proc[from_node];

/* receive the rest of the data from the same process */ MPI_Recv(store_y,Y_RESN*(X_RESN/STRIPES+1),MPI_DOUBLE,from_node,0,MPI_COMM_WORLD,&stat);

MPI_Recv(iteration,Y_RESN*(X_RESN/STRIPES+1),MPI_INT,from_node,0,MPI_COMM_WORLD,&stat);

3.1.2. Code for the slaves

This code segment is fairly straightforward. The slaves loop endlessly, receiving portions of the complex plane. Upon doing so, they then make the necessary calculations and send the results back to the master.

The slaves know to end receiving (meaning the plane has been fully calculated) upon reception of a -1. This was chosen simply because there could not be a negative first portion of the plane.

/* keep going until we get the kill signal */

for(;;)

{

/* receive our stripe */

MPI_Recv(&stripe, 1, MPI_INT, 0,0,MPI_COMM_WORLD,&stat);

/* if we got -1, the kill signal…*/

if(stripe<0)

{

/* always kill this before ending MPI process */

MPI_Finalize();

exit(0);

}

/* do the computation with our portion of the whole plane */

compute(minr,maxr,mini,maxi, start[stripe],last[stripe],

store_x,store_y, &number,iteration);

/* send the data back to the master for graphing */

MPI_Send(store_x, number, MPI_DOUBLE,0,0,MPI_COMM_WORLD);

MPI_Send(store_y, number, MPI_DOUBLE,0,0,MPI_COMM_WORLD);

MPI_Send(iteration,number,MPI_INT,0,0,MPI_COMM_WORLD);

}

3.1.3 Full code

The URL for the full code for the Mandelbrot Set program, as well as others, can be found in Appendix A.

4. Synchronous execution

Synchronous execution is a strategy that is similar to the divide and conquer methodology examined above. However, it turns out that while all processes in the synchronous execution paradigm will be making the same calculations (as we saw in divide and conquer), that is the only similarity.

The difference is that in synchronous execution, as the name implies, there are messages being passed between the processes. As a result, the processes depend upon each other so that they can keep a coherent data set.