95409 Lecture Note on Sep. 22 & 24, 1999

prepared by: Suganthan Sivagnanasundaram 204705

Tongling Mao

Bulk-Synchronous Parallel Model

A sequential computer is that its processing is channeled through one physical location therefore in a given period of time a certain number of processing can be achieved. These sequential computers are built using central unifying model namely von Neuman architecture. This von Neuman architecture gives comfort to hardware and software designers to design more efficient sequential computers without worrying about each other’s diversity. In a parallel machine, processing occurs simultaneously at different locations therefore many more computational operations can be achieved in the same amount of time period. The von Neuman architecture is functioning as an efficient bridge to connect software and hardware to create an efficient sequential computer. To create such an efficient parallel computer we need such a bridging model for parallel computation and to act as a standard so that all people can agree. The Bulk-Synchronous Parallel model was proposed by Prof. L. G Valiant of Harvard as such a model that is capable of satisfying stringent quantitative requirements of parallel computers.

The BSP Computer

A parallel computer that is built using BSP model is illustrated below:


Pic.1 Simple Diagram of a basic BSP computer model. This diagram is extracted from

The basic requirements for a BSP computer is:

  • a set of processor-memory pairs (components)
  • a communications network that delivers messages point-to-point(router)
  • a mechanism for the efficient synchronization of all, or a subset, of the processors at regular time intervals

It should be taken that the routers are only intended for implementing storage access between different components and no combining, duplicating or broadcasting facilities associated with it. This parallel computer is called BSP computer because all or sets of processors are synchronized in bulk.

Computation on a BSP Computer

Computation in a BSP computer consists of a sequence of supersteps. Each superstep consist of steps followed by a barrier synchronization. A step is a local operation by a processor on locally available data. A barrier synchronization is at which point non local data access takes place. The requests for non local data can be placed during a superstep and data will become available during the synchronization after the current superstep ends. The non local data request is a non blocking function. This does not hold up computation. After each synchronization a global check is made to determine whether the superstep has been completed by all components. If all the components completed their superstep then the machine would proceed to next superstep otherwise a period of time will be allocated to finish the current unfinished superstep to be completed. The alternate synchronization mechanism can look after this unfinished supersteps and allow the machine to proceed to next superstep as soon as the previous superstep completed.


The following picture illustrate the phases of computation:

Pic 2. The phases in a BSP model parallel computing. . This diagram is extracted from

Performance of BSP Computers

The performance of BSP computer depends on

  • p = number of processors
  • s = processor speed (steps/sec)
  • l = the cost to achieve barrier synchronization (# of steps performed in the time taken to synchronize)
  • g = the cost to deliver message data (in steps)
  • h = # of messages that each processor send and is sent at in single exchange

A step is a basic unit of calculation

The g value is approximated to the ratio of the number of local computational operations performed per second by all the processors to the total number of datawords delivered per second by the communication system. The g enables us to determine the time taken to exchange the data between processors therefore during an exchange gh steps could have been computed in a BSP computer. Therefore the closer the g value approaches l and smaller the l for a system could produce an efficient parallel performance in a BSP computer.

Increasing the number of processors without increasing the computational load would only create more communication cost and latency and transfer costs because processors would have to communicate with other processors.

If g and l are equal to 1, the cost of accessing the remote data is approximately equal to access local data and calculation scales to the limit.

Current BSP Computers
These are some of the examples for BSP computer models
  • a single processor with cache and off-chip memory,
  • networks of workstations or PCs with PVM, MPI or some other message passing software,
  • distributed memory processor arrays (e.g. IBM SP2, Intel Paragon, Meiko CS2, Parsytec GC/PowerPlus),
  • And, when due allowance is made for the actual disposition of physical memory, shared memory multiprocessors (e.g. Multi-processor INTEL Pentium Pro, Silicon Graphics Origin or Power Challenge, CRAY T3D/T3E, Convex SPP).

Note: Most of the notes are prepared using the following website;

and A Bridging Model for Parallel Computation by Leslie G. Valiant in Communications of ACM, 33, 8, Aug. 1990.

Algorithm 1.2 Sum on the PRAM Model

The following algorithm will compute the sum of the numbers stored in an array stored in the shared memory of a PRAM with n processors.

Begin

  1. global read(A(i), a)
  2. global write(a, B(i)
  3. For h = 1 to log n do

If(i<= n/2^h) then

begin

global read (B(2i- 1),x)

global read (B(2i),y)

Set z:=x+y

Global write (z,B(i))

end

  1. if i=1 then globalwrite(z,S)

End

This is the following example done in the class.

After step 2 h = 1 h = 2 h = 3

A B B B B S

P1

P2

P3

P4

:

:

:

:

P8

n-operations n-operations

The time it took to find the sum of n numbers is:

Reading array A– 1 time unit

Writing to Array B – 1 time unit

Writing the Sum to S– 1 time unit

Calculations - the arrays were processed log(n) times.

T(n) = log(n) + 3

= Olog (n)(for parallel model)

The sequential model will take O(n) time.

Therefore the speed up factor is:

Speed up = Seq.Time/Prallel Time = O(n)/Olog(n) = O(n/log(n))

The work done is:

Reading A - n
Writing to B – n

Writing sum to S – 1

Total array processing – 4n (4(n/2 + n/4+….+1). 4(n/2))

Total work = n+n+4n+1

Chapter1.5

The Work-Time Presentation Framework of Parallel Algorithms

Work-time (WT) paradigm provides informal guidelines for a two-level top-down description of parallel algorithms.

  • Upper Level (Work-Time (WT) Presentation of algorithms): Describe the algorithm in terms of a sequence of time units, where each time unit may include any number of concurrent operations. Main advantage: No need to deal with processors, the presence of p processors would have bounded the number of operations to at most p in each unit of time.
  • Lower Level: the WT presentation of algorithms results in a parallel algorithm that runs in T(n) time units while performing W(n) work .

The WT Scheduling Principle: Let Wi(n) be the number of operation performed in time unit I, where 1 <= i<=T(n). Simulate each set of WI(n) operations in < =  Wi(n)/n parallel steps by the p processors, for each 1 <= i <= T(n) . If the simulation is successful, the corresponding p-processor PRAM algorithm takes

< = i Wi(n)/n < = i ( Wi(n)/p + 1) < = ( Wi(n)/p + T(n) ) parallel step.

The success of this principle depends on two implementation issues:

  • the calculation of Wi(n) for each i (usually trivial);
  • the allocation of each processor to the appropriate tasks to be performed by that processor.

Chapter1 Conclusion:

  • Scheduling principle: Upper and lower
  • W(n): refers to the total number of operations needed to execute algorithm.
  • T(n): refers to the number of time units required by the algorithm, where during each time unit a number of concurrent operations can take place.
  • Speedup = Sequential Time / T(n)
  • Cost = T(n)*P(n)
  • Efficiency = Speed up P by p-processors
  • PRAM model

Chapter 2.

Basic Techniques

2.1 Balanced Trees

Exp. The PRAM algorithm to compute the sum of n elements presented in Section 1.3.2 is based on a balanced binary tree .

This algorithm is an example of the general strategy of building a balanced binary tree on the input elements and traversing the tree forward and backward to and from the root.

An Optimal Prefix-Sum Problem:

A sequence of n elements {X1, X2, …. Xn} drawn from a set S with a binary associative operation, denoted by *. The prefix sums of this sequence are n partial sums (or products) defined by

Si = X1 * X2* … *Xn, 1 <= i <=n.

A sequential algorithm:

S1 = X1, S2 = X1 * X2,…… Sn = X1*X2*…..*Xn

T(n)= O(n), W(n) = O(n)

A parallel algorithm:

Using a binary tree, Each internal node represents the application of the operation * to its children during a forward traversal of the tree. Hence , each node v holds the sum of the elements stored in the leaves of the subtree rooted at v.

During first time unit, the four elements Y1 = X1 * X2, Y2 = X3 * X4, ….. Y4 = X7 * X8. The second time unit corresponds to computing Z1 = Y1 * Y2, Z2 = Y3 * Y4, …with a recursive call to handle these two inputs. At the third time units, the prefix sum is generated.

T(n) = O (log n), W(n) = O (n)

Note: Build a balance binary tree on the inputs and to traverse the binary tree to or from root leads to efficient algorithms for many simple problems. This is one of the most elementary and the most useful parallel techniques.

Pointer Jumping /Pointer doubling:

Example : A Root Finding Problem:

Let F be a forest F is specified by an array P of length n such that P(i) = j if ( i, j) is an arc in F; that is j is the parent of i in a tree of F. ( if i is a root, P(i) = i). The problem is to determine the root S(j) of the tree containing the node j, for each j between 1 and n.

Sequential Algorithm:

One based on first identifying the roots and reversing the links of the trees, followed by performing a traversal of each tree from its root. This problem can be solves in linear time.

Parallel Algorithm:

Pointer jumping (or path doubling):

Updating the successor of each node by that successor’s successor.

Algorithm(pointer Jumping)

Input: A forest of rooted directed trees, each with a self-loop at its root, such that each arc is specified by ( i, P(i)), where 1 <= i <= n .

Output: For each vertex i, the root S(i) of the tree containing i .

begin

1. for 1 <= i <=n,

Set S(i) = P(i)

while (S(i)  S(S(i)) do

Set S(i): = S(S(i))

end

As the technique is applied repeatedly, the successor of a node is an ancestor that becomes closer and closer to the root of the tree containing that node. In fact, the distance between a node and its successor doubles unless the success of the successor node is a root.

As for the time analysis, the number of iteration is O( log h). Each iteration can be executed in O(1) parallel time with O(n) operations. Therefore, the algorithm runs in O(log h) time and uses a total W(n) = O (n log n) operations.

Questions of the day:

1. The concepts of W(n), T(n), speedup, Cost and Efficiency .

2. For the basic techniques, how to use Balance trees to solve the Prefix-Sum problem?

3. For the basic techniques, what is the algorithm to solve for determining the root S(j) of the tree by using Pointer-Jumping? what is its W(n) and T(n)?