3.1.1 Speed-Up, Efficiency, and Redundancy

3 INTRODUCTION TO PARALLEL ALGORITHMS

3.1 Performance measures

There are a number of different ways to characterize the performance of both parallel computers and parallel algorithms. Usually, the peak performance of a machine is expressed in units of millions of instructions executed per second (MIPS) or millions of floating point operations executed per second (MFLOPS). However, in practice, the realizable performance is clearly a function of the match between the algorithms and the architecture.

3.1.1 Speed-up, efficiency, and redundancy

Perhaps the best known measure of the effectiveness of parallel algorithms is the speed-up ratio (Sp) with respect to the best serial algorithm, which is defined as the ratio of the time required to solve a given problem using the best known serial method to the time required to solve the same problem by a parallel algorithm using p processors. Thus, if T(N) denotes the time required by the best known serial algorithm to solve a problem of size N, and Tp(N) denotes the time taken by a parallel algorithm using p processors us in solving a problem of the same size, then

Clearly, Sp is a function of N, the size of the problem, and p, the number of processors. For notational simplicity, we often do not explicitly denote this dependence. Another related measure is known as the processor efficiency, Ep, which is the ratio of the speed-up to the number of processors. Thus

Not all serial algorithms admit speed-up through the use of parallel implementation. But all the parallel algorithms can be serially implemented in a straightforward way. Thus, since a parallel algorithm using p>1 processors must perform at least as fast as a serial algorithm, it is desired that .

In the above definition of speed-up, the parallel algorithm may be quite different from the serial algorithm. It is conceivable that in many cases one might be interested in comparing the performance of the same algorithm on the serial machine and on a given parallel processor. In such cases, the speed-up with respect to a given algorithm is simply defined as

Recall that is the time needed to execute the algorithm on i processors. Likewise, the processor efficiency in this case is

Clearly, . Further, one step of a parallel algorithm using p processors takes at most p steps when implemented serially. Thus . This in turn implies that and . Since , it readily follows that .

Any parallel algorithm, which asymptotically attains linear speed-up, that is, is said to have the optimal speed-up. It is conceivable that an algorithm with optimal speed-up may have very low processor efficiency, and likewise, an algorithm with good processor efficiency may not have large speed-up. In other words, a good serial algorithm may be a bad parallel algorithm, and a throwaway serial algorithm may end up being a reasonable parallel algorithm.

Another factor that would reflect the performance of a parallel algorithm is the total number of scalar operations required by an algorithm. In the case of serial algorithms, the serial time complexity function itself denotes the number of scalar operations. But it is a common practice to design parallel algorithms by introducing extra or redundant scalar computations to achieve speed-up. The redundancy factor, Rp, of a p-procossor algorithm is defined as the ratio of the total scalar operation performed by a parallel algorithm to the total scaler operations of a serial algorithm. Clearly, .

Consider the problem of adding 8 numbers x1, x2, … , x8 . Serial algorithm to solve the problem has the following view

1st step: y1 = x1 + x2

2nd step: y2 = y1 + x3

3rd step: y3 = y2 + x4

4th step: y4 = y3 + x5

5th step: y5 = y4 + x6

6th step: y6 = y5 + x7

7th step: y7 = y6 + x8

One of the parallel algorithms can be used to solve the same problem.

1st parallel algorithm

1st parallel step: y1 = x1 + x2 y2 = x3 + x4 y3 = x5 + x6 y4 = x7 + x8

2nd parallel step: z1 = y1 + y2 z2 = y3 + y4

3rd parallel step: w1 = z1 + z2

2nd parallel algorithm

1st parallel step: y1 = x1 + x2 y2 = x3 + x4

2nd parallel step: y3 = x5 + x6 y4 = x7 + x8

3rd parallel step: z1 = y1 + y2 z2 = y3 + y4

4th parallel step: w1 = z1 + z2

Let us calculate the above described performance measures for these parallel algorithms.

Results of calculations for the 1st parallel algorithm are as follows:

, ,,, .

Results of calculations for the 2nd parallel algorithm are as follows:

, ,,, .

It is obvious that the 1st parallel algorithm is the best.

The number of parallel steps is called the height of parallel algorithm. On the other hand, the maximal number of operations in parallel steps of an algorithm is called its weight. If there are several parallel algorithms, the one having minimal height is the best. Usually this algorithm has the maximal weight. In order to design the best parallel algorithm, in each step we try to find the maximal number of independent operations. Then carry out them simultaneously.

3.2 Basic arithmetic operations

Parallelism in problem solving can be achieved in at least three different levels: procedure level, expression level, and basic arithmetic operations level.

At the first two levels, the time complexity function merely indicates the number of basic arithmetic (arithmetic or comparison) operations. For example, at these two levels, addition of two integers is taken as a unit (independent of size of integers) operation. But it is common knowledge that it takes more time to add two large integers than to add two small ones. Thus, for the overall speed-up, it is necessary to use the best possible algorithm to perform these basic arithmetic operations. Below we analyse the number of bit operations such as AND and OR, needed to perform basic arithmetic operations in parallel. Since this is the lowest level at which parallelism can be achieved, it is called bit level or micro level parallelism.

3.2.1 Analysis of parallel addition

Addition is the simplest of the basic arithmetic operations. Despite its simplicity, development of “optimal ” algorithm parallel algorithm is far from simple.

Brent’s algorithm Let a = aN aN-1 … a2 a1 and b = bN bN-1 … b2 b1 be two integers in binary to be added. Let s = a + b, where s = sN sN-1 … s2 s1. It is well-known that , where c0 = 0, and for i = 1 to N,

where , , and are the logical exclusive-OR, inclusive-OR and AND operations, respectively. ci is called the carry from the ith bit position, pi and gi are the carry propagate condition and the carry generate condition, respectively. Clearly, we need c0 through cN-1 for computing s. Since c0=0, by distributing p1 we see that

and in general

The propagate bit and the generate bit can be computed in parallel in unit step using a linear array of N AND gates and N OR gates with two inputs. Once all the carry bit are known, si can be computed in just two steps. Thus, if TA(N) is the time required to add two N-bit binary numbers and TC(N-1) is the time to compute the N-1 carry bits c1 through cN-1 (that is height of algorithm). Obviously carry bits can be computed in logN units of time. So, TA(N)= logN + 3.

3.2.2 Parallel multiplication

We first analyse the complexity of Grade-School algorithm for parallel multiplication of two numbers.

Grade-School algorithm The following example illustrates this algorithm. Let a=1101 and b=1101. Then

x1= 0001101, x2= 0000000, x3= 0110100, x4= 1101000.

Clearly, x1+x2= 00001101 and x3+x4= 10011100 can be computed in parallel and the final result is obtained in one more addition s= (x1+x2)+(x3+x2)= 10101001. First step can be done in one unit of time. The associative fan-in algorithm takes O(logN) units of time to find sum of two 2N-bit integers. Therefore, second step takes O(log2N) units of time to compute the sum of the products. So, overall complexity (or height) of Grade-School algorithm to find product of two integers is O(log2N).

Ofman-Wallace algorithm By cleverly exploiting the properties of the full adder, Ofman and Wallace independently developed a faster algorithm for multiplying two N-bit integers. The algorithm is based on the following result.

Let and be two bit integers. Then there exists three bit integers x, y, and z such that x+y+z=a+b, where and . and can be obtained from x, y, and z in parallel in three steps by using formulae

for . Since the computation of and can be done in parallel, it follows that and can be obtained in three steps. As an example, let , , and . Then it can be easily verified that and .

The complexity of two N-bit integers can be obtained in O(logN) steps.

Katasuba-Offman algorithm We now define an algorithm based on divide-and-conquer strategy. Let a and b are N-bit integers. First, rewrite a and b as

where Ai and Bi are N/2-bit integers, A0 and B0 is remainder and A1 and B1 is the quotient when a is divided by 2N/2. Define

Then . Thus the product is obtained by performing three multiplications of (N/2)-bit integers (the recursive step) and a total of six additions/subtractions. Each additon/subtraction can be done in O(logN) steps. The overall complexity of algorithm is O(log2N).

4 PRAM ALGORITHMS

4.1 A model of serial computation

The random access machine (RAM) is a model of a serial computer. A RAM consists of a memory, a read-only input tape, a write-only output tape, and a program. The program is not stored in memory and cannot be modified. The input tape contains a sequence of integers. Every time an input value is read, and input head advanced one square. Likewise, the output head advanced after every write. Memory consists of an unbounded sequence of registers, designated r0, r1, r2, … . Each register can hold a single integer. Register r0 is the accumulator, where computations are performed. The allowed RAM operation include load, store, read, write, add, subtract, multiply, divide, test, jump, and half.

4.2 PRAM model of parallel computation

PRAM consists of a control unit, global memory, and an unbounded set of processors, each with its own private memory. Although active processors execute identical instructions, every processor has a unique index, and the value of a processor’s index can be used to enable or disable the processor or influence which memory location accessed.

A PRAM computation begins with the input stored in global memory and a single active processing element. During each step of the computation and active, enabled processor may read a value from a single private or global memory location, perform a single RAM operation, and write into one local or global memory location. Alternately, during the computation step a processor may activate another processor. All active, enabled processors must execute the same instruction, on different memory locations. The computation terminates when the last processor halts.

Various PRAM models differ on how handle read or write conflicts; i.e., when two or more processors attempt to read from, or write to, the same global memory location. Most of the results in the research literature have been based upon one of the following models:

1. EREW (Exclusive Read Exclusive Write): read or write conflicts are not allowed.

2. CREW (Concurrent Read Exclusive Write): concurrent read is allowed; multiple processors may read from the same global memory location during the same instruction step. Write conflicts are not allowed.

3. CRCW (Concurrent Read Concurrent Write): concurrent reading and concurrent writing allowed. A variety of CRCW models exist with different policies for handling concurrent writes to the global address. We list three different models:

a. COMMON. All processors concurrently writing into the same global address must be writing the same value.

b. ARBITRARY. If multiple processors concurrently write to the same global address, one of the competing processors is arbitrarily chosen as the “winner,” and its value is written into the register.

c. PRIORITY. If multiple processors concurrently write to the same global address, the processor with the lowest index succeeds in writing its value into memory location.

The EREW PRAM model is the weakest. Clearly a CREW PRAM can execute any EREW PRAM algorithm in the same amount of time; the concurrent read facility is simply not used. Similarly, a CREW PRAM can execute any EREW PRAM algorithm in the same amount of time.

The PRIORITY PRAM model is the strongest. Any algorithm designed for the COMMON PRAM model will execute with the same complexity on the ARBITRARY PRAM and PRIORITY PRAM models as well, for if all processors writing to the same location write the same value, choosing an arbitrary would cause the same result. Likewise, if an algorithm executes correctly when an arbitrary processor is chosen as the “winner,” the processor with lowest index is as reasonable an alternative as any other. Hence any algorithm designed for the ARBITRARY PRAM model will execute with the same time complexity on the PRIORITY PRAM.

4.3 PRAM algorithms

PRAM algorithms have two phases. In the first phase a sufficient number of processors are activated, and in the second phase these activated processors perform the computation in parallel.

4.3.1 Parallel reduction on EREW PRAM

Given a set of n values a1, a2, … , an, and an associative binary operator +, reduction is the process of computing a1,+ a2+… + an,. Parallel summation is an example of reduction operation. The realization of algorithm for 4+3+8+2+9+1+0+5+6+3 is illustrated below.