CPS532 Parallel Architectures

Review for the Second Exam

The exam will be in-class and timed for the last 75 minutes of the class. You may bring two pages of notes if you so desire. I will ask questions pertaining to concepts. I may also ask multiple choice, fill-in-the-word or true-false questions. These types of questions allow me to cover a lot of material. I will ask some questions from your homework and labs that we have covered.

This exam will cover my notes, Chapters 4, 5, 6, and 7 of the text, and the concepts contained in the labs on creating an MPI program (finding the sum) and the Mandelbrot set.

The way to study for this is to ask yourself what parts of the material are important and understand them. One way to do this is to write down keywords as they occur in your reading.

Parallel Computing Chapter 4 Notes

Homework: 2, 4, 5, 6, 8

Look at two fundamental strategies or techniques:

partitioning where the problem is divided into parts and each part computed separately

divide and conquer where the partitioning is done in a recursive manner by continually dividing the problem into smaller parts, solving the smaller parts and combining the results.

4.1 Partitioning

4.1.1 Partitioning Strategies

Partitioning applied to the program data

called data partitioning or domain decomposition

Partitioning applied to the functions of a program

called functional decomposition

Data partitioning is a main strategy for parallel programming.

The type of system being used has a major impact on how the partitioning is done.

Discussion about the program that adds sums of numbers.

Broadcasting the whole list of numbers versus sending a sublist to each slave breaks down into

sublist considerations

the communications cost of multiple sends from the master to the slaves

simpler code at the slave end since no decision necessary for breakdown...... doesn't look like much.

each slave PE has to know its identity (who I am)

Slaves are identified by a process ID, which is usually obtained by calling a library routine.

MPI requires communicators to be established, and processors have a rank within the communicator

see the MPI program for adding a list of numbers

Note that the operation of add is part of the name, usually a parameter. Also need a communication group defined.

Analysis

The sequential computation is O(n). Total number of processes used in the parallel computation is m.

Speedup S = ts / tp = (n-1) / (n / m + m -2) which tends to m as n gets large.

4.1.2 Divide and Conquer

Should have a familiar ring to it from your algorithm course.

Characterized by dividing a problem into subproblems that are of the same form as the larger problem.

Further divisions into smaller subproblems are usually done by recursion.

Divide and conquer is usually when main work is combining the results such as quicksort with mergesort.

Author uses term divide and conquer anytime partitioning is continued on smaller and smaller problems.

When each division creates two parts, a recursive divide-and-conquer formulation creates a binary tree.

If the data is a power of 2, then we can divide the data into a perfectly balanced binary tree. Otherwise some bottom nodes will be higher than others.

Parallel implementation

In the sequential implementation, only one node of the tree at a time can be visited.

Parallel allows to have more than one node being processed at a time.... could have 2m+1-1 processors for dividing the task into 2m parts

More common to reuse processors at each level of the tree as illustrated in Figure 4.3 (following) which uses 8 processors

Final stage will have n/8 numbers or n/p for p processors, logp levels in the tree.

When combining, the partial sums, each odd-numbered processor passes its partial sum to the adjacent even-numbered processor.

The even-numbered processors then add the partial sums with its sum and pass the result onward.

See the notes between Figure 4.3 and Figure 4.4 of how this could map onto a hypercube.

Unlike the broadcast and reduce approach, the code has to be pretty processor specific.

Analysis

Assume that n is a power of 2 and that tstartup is not included.

Communication

There is a logarithmic number of steps in the division phase. The actual division of the data into two parts is ignored.

tcomm1 = n/2 tdata + n/4 tdata + ... + n/p tdata = (n(p-1)/p ) tdata

where tdata is the time to send 1 word.

The combining phase is similar except that one data item is send in each message.

tcomm2 = tdata logp

so total communication time is

tcomm = (n(p-1)/p ) tdata + tdata logp

which has a time complexity of O(n) for constant p.

Computation time complexity is

tcomp = (n/p + logp ) or O(m)

So tp = n(p-1)/p ) tdata + tdata logp + (n/p + logp )

The very best speedup we could expect is p when all p processors are computing their partial sums. The actual speedup would be less due to division and combining phases.

4.1.3 M-ary Divide and Conquer

Divide and conquer can also be applied where a task is divided into more than two parts at each stage. For example, if the task is broken into four parts, the sequential recursive definition would be

A tree in which each node has four children is called a quadtree.

A quadtree has particular applications in decomposing two-dimensional regions into four subregions.

For example, a digitized image could be divided into four quadrants and then each of the four quadrants divided into four subquadrants, and so on, as shown in Figure 4.7.

An octtree is a tree in which each node has eight children and has application for dividing a three-dimensional space recursively.

An m-ary tree would be formed if the division is into m parts (i.e., a tree with m links from each node), which suggests that greater parallelism is available as m is increased because there are more parts that could be considered simultaneously.

4.2DIVIDE-AND-CONQUER EXAMPLES - you should be familiar with one.

4.2.1Sorting Using Bucket Sort – understand this example!

Most of the sequential sorting algorithms are based upon the compare and exchange of pairs of numbers. Bucket sort is not based upon compare and exchange, but rather is a partitioning method.

4.3 Summary

This chapter introduced the following concepts:

Partitioning and divide-and-conquer concepts as the basis for parallel computing techniques

Tree constructions

Examples of partitioning and divide-and-conquer problems - namely, bucket sort, numerical integration, and the N-body problem.

Chapter 5: Pipelined Computations

HW: 2, 3, 4, 5, 9

This chapter, is about a parallel processing technique called pipelining, which is applicable to a wide range of problems that are partially sequential in nature;

i.e., a sequence of steps must be undertaken.

Hence, we can use pipelining to parallelize sequential code.

5.1PIPELINE TECHNIQUE

In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. (In fact, this is the basis of sequential programming. )

In pipelining, each task will be executed by a separate process or processor ( each pipeline process as a pipeline stage)

Each stage will contribute to the overall problem and pass on information that is needed for subsequent stages.

This parallelism can be viewed as a form of functional decomposition. The problem is divided into separate functions that must be performed, but in this case, the functions are performed in succession. The input data is often broken up and processed separately.

As an example of a sequential program that can he formulated as a pipeline, consider a simple loop:

for (i = 0; i < n; i++) Sum = sum + a[i)

which adds all the elements of array a to an accumulating sum.

The loop could be "unfolded" to yield

sum = sum + a[0]; sum = sum + a[1];

sum = sum + a[2]; sum = sum + a[3];

sum = sum + a[4];

One pipeline solution could have a separate stage for each statement.

Each stage accepts the accumulating sum on its input sin and one element of a [] on its input a and produces the new accumulating sum on its output sout Therefore, stage i performs

sout = sin + a[i];

Instead of simple statements, a series of functions could be performed in a pipeline fashion.

A frequency filter is a more realistic example in which the problem is divided into a series of functions (functional decomposition).

The objective here is to remove specific frequencies (say the frequencies f0, f1, f2, f3etc.) from a (digitized) signal, f(t). The signal could enter the pipeline from the left.

Each stage is responsible for removing one of the frequencies.

Given that the problem can be divided into a series of sequential tasks, the pipelined approach can provide increased speed under the following three types of computations:

1.If more than one instance of the complete problem is to be executed

2.If a series of data items must be processed, each requiring multiple operations

3.If information to start the next process can be passed forward before the process has completed all its internal operations

The Type 1 arrangement is utilized widely in the internal hardware design of computers.

It also appears in simulation exercises where many simulation runs must be completed with different parameters to obtain the comparative results.

A Type 1 pipeline can be illustrated in the space-time diagram shown in Figure 5.4.

In this particular diagram, each process is assumed to have been given the same time to complete its task. Each time period is one pipeline cycle. Here, each instance of the problem requires six sequential processes.

The Type 2 arrangement, in which a series of data items must be processed in a sequence, appears in arithmetic calculations such as in multiplying elements of an array where individual elements enter the pipeline as a sequential series of numbers. The arrangement is shown in Figure 5.6, where in this case 10 processes form the pipeline and 10 elements, d0, d1, d2, d3, d4, d5, d6, d7, d8, and d9,are being processed. With p processes and n data items, the overall execution time is again given by (p - 1) + n pipeline cycles assuming these are all equal.

It is often the third arrangement, Type 3, that is utilized in parallel programs where there is only one instance of the problem to execute, but each process can pass on information to the next process before it has completed. Figure 5.7 shows space-time diagrams when information can pass from one process to another before the end of the execution of a process.

If the number of stages is larger than the number of processors in any pipeline, a group of stages can be assigned to each processor, as shown in Figure 5.8. Of course, now the pipeline stages within one processor are executed sequentially.

5.2COMPUTING PLATFORM FOR PIPELINED APPLICATIONS

A key requirement for pipelining is the ability to send messages between adjacent processes in the pipeline. This suggests direct communication links between the processors onto which adjacent processes are mapped.

An ideal interconnection structure is a line or ring structure such as line of processors connected to a host system, as shown in Figure 5.9.

(Of course, lines and rings can be embedded perfectly on meshes and hypercubes.)

The seemingly inflexible line configuration is, in fact, very convenient for many applications, yet at very low cost.

Small transputer systems are often configured as lines.

Networked workstations may not really be a suitable platform for the pipelined programs in this chapter unless a more complex interconnection structure than a simple Ethernet is used.

In the following, we will assume an interconnection structure that can provide at least simultaneous transfers between adjacent processes or processors.

A little flexibility can be achieved on this matter by using (locally) blocking send()s (the send ()s that are normally used). Then a process can continue with the next operation without waiting for the destination to he ready.

5.3PIPELINE PROGRAM EXAMPLES - you should be able to explain one

In this section we will examine sample problems that can be pipelined and classify the solutions as Type 1, Type 2, or Type 3.

5.3.1 Adding Numbers

ttotal = (time for one pipeline cycle)(number of cycles)

ttotal= (tcomp + tcomm)(m + p - 1)

where there are m instances of the problem and p pipeline stages (processes). The computation and communication times are given tcomp and tcomm respectively. The average time for a computation is given by

ta= ttotal / m

5.4 Summary

This chapter introduced the following:

The pipeline concept and its application areas

Analysis of pipelines

Examples illustrating potential of pipelining including insertion sort, prime number generation, and

solving an upper triangular system of linear equations

Chapter 6 Synchronous Computations

Consider problems solved by a group of separate computations that must at times wait for each other before proceeding, thereby becoming synchronized.

Important class of such applications is called fully synchronousapplications. In a fully synchronous application, all the processes are synchronized at regular points. Generally, the same computation or operation is applied to a set of data points. All the operations start at the same time in a lock-step manner analogous to SIMD computations.

First, we will consider synchronizing processes and then fully synchronous applications.

6.1SYNCHRONIZATION

HW: 1, 2, 4, 7, 11

6.1.1Barrier

Scenario is a number of processes computing values.

Each process must wait until all processes have reached a particular reference point in their computations.

e.g. processes need to exchange data between themselves and then continue from a known state together.

Some mechanism is needed that prevents any process from continuing past a specified point until all the processes are ready.

The basic mechanism for regulating this situation is called a barrier.A barrier is inserted at the point in each process where it must wait. All processes can continue from this point when all the processes have reached it (or, in some implementations, when a stated number of processes have reached this point).

There is a potential race condition as processes chase toward their individual goals.

Barriers apply to both shared memory and message-passing systems. In message-passing systems, barriers are often provided with library routines.

MPI has the barrier routine, MPI_Barrier (), with a named communicator being the only parameter. MPI_Barrier() is called by each process in the group, blocking until all members of the group have reached the barrier call and only returning then.

Barriers are naturally synchronous, and message tags are not used.

Since there is a single barrier call reused for all occasions in which a barrier is required, it is essential for barriers to match with the correct barrier in other processes.

Figure 6.2 illustrates the library call approach. The way that the barrier call is implemented will depend upon the implementer, who in turn will be influenced by the underlying architecture. Certain underlying architectures will suggest specific efficient implementations.

MPI does not specify internal implementation. However, we need to know something about the implementation to assess the complexity of the barrier.

Common implementations of a barrier:

6.1.2Counter Implementation

A centralized counter implementation (sometimes called a linear barrier)

A single counter is used to count the number of processes reaching the barrier. Before any process reaches its barrier, the counter is first initialized to zero. Then each process calling a barrier will increment the counter and check whether the correct number has been reached, say n. If the counter has not reached n, the process is stalled or placed in an "idle" state. If the counter has reached n, the process and all other processes waiting for the counter are released.

A mechanism must be in place to release idle processes.

Counter-based barriers often have two phases, an arrival phase (or trapping) and a departure (or release) phase. A process enters the arrival phase and does not leave this phase until all processes have arrived in this phase.

Then processes move to the departure phase and are released.

The code is built using (locally) blocking send()s and recv()s and counting using for loops. The barrier code for the slave processes is simply

send(Pmaster);

recv(Pmaster);

Messages could be received from slave processes in any order and are accepted as received, but messages are sent to slave processes in numeric order in this code.

Note that locally blocking send() s do not block.

The slave processes will move directly to their recv ()s. The recv() s are blocking in that the processes will not move out of their departure phase prematurely. The arrival phase could also be implemented with a gather routine and the departure phase with a broadcast routine.

6.1.3Tree Implementation

Barriers implemented with a counter have a time complexity of 0(n) with n processes. A more efficient barrier can be implemented using a decentralized tree construction.

Understand the operation of the alogorithm for an eight node tree and how the nodes block during the three (log8) stages the algorithm performs on a tree of this size.

The processes now must be released from the barrier. This can be done with a reverse tree construction.

The eight-process algorithm with the arrival phase and departure phase both implemented with trees requires 21og8 steps, or, in general, 2logn steps, a time complexity of O(logn).

6.1.4Butterfly Barrier

The tree construction can be developed into a so-called butterfly, in which pairs of processes synchronize at each stage in the following manner (assuming eight processes as an example):

First stageP0 P1, P2 P3, P4  P5, P6 P7

Second stage P0 P2, P1 P3, P4  P6, P5 P7

Third stage P0 P4, P1 P5, P2  P6, P3 P7

This assumes two "links" between synchronizing processes, which implies two pairs of send () /recv ().

This would be used if data were exchanged between the processes (as in other applications of the butterfly). For a barrier, each synchronization requires only a single pair of send () /recv(). After all the synchronizing stages, each process will have been synchronized with each other process, and all processes can continue.

At stage s, process i synchronizes with process i + 2s-1 if n is a power of 2. If n is not a power of 2, the communication is between process i and process (i + 2s-1) mod n. With n processes, the butterfly has log n steps (n being a power of 2), half the number of steps of the tree implementation, but the same time complexity of O(log n). The butterfly barrier maps onto a hypercube very efficiently (as does the tree implementation).