Parallel Processing

PARALLEL PROCESSING

CS802

Part I

Instructor: Dr. C. N. Zhang

Department of Computer Science

University of Regina

Regina, Saskatchewan

Canada, S4S 0A2

Table of Contents

Introduction to Parallel Processing
Pipeline
Design of pipeline
RISC instruction pipeline
Mapping nested loop algorithms into systolic arrays
Parallelism Analysis
Data dependency analysis for non loops
Data dependency analysis for loops
Parallel Computer Architectures
Review of computer architecture
Computer classification
SIMD Computers
MIMD Computers
Parallel Programming
Parallelism for single loop algorithms
Parallelism for nested loop algorithms
Introduction to MultiPascal (Part II)
Examples of Parallel Processing
Intel MMX technology (Part III)
Cluster computing (Part IV)
Task scheduling in message passing MIMD
Associative processing
Data flow computing

Introduction to Parallel Processing

Parallel Processing: two or more tasks being performed (executed) simultaneously.
Basic techniques for parallel processing

a. pipeline

b. replicated units

c. pipeline plus replicated units

Parallel processing can be implemented at the following levels
at processor level
at instruction execution level
at arithmetic and logic operation level

d. at logic circuit level

Speed-up of the parallel processing:

Where TS is the time by processing sequential (non-parallel) manner and TP is the processing time required by parallel processing technique.

Efficiency of the parallel processing

Where n is the number of units used to perform the parallel processing.

Note: 0  S  n and 0  E  1

How To Improve Performance of Computer Systems?
by electronic technology, e.g., new VLSI development
by software development, e.g., better OS and compiler
by architecture enhancements

Architecture Enhancements
pipeline
parallelism
memory hierarchy
multiple processor computer systems
hardware supports for OS and compiler
special purpose architectures

Where aijs are integers, i=1, 2, … , n, j=1, 2, …, m.

Assume that the time delay of the Adder is t. Sequential processing: Use one Adder.

TS = ( n-1 ) m t

Parallel Processing:

use two adders:

S = 2

E = 1

use m adders:

Tp = ( n-1 ) t

S = m

E = 1

use n-1 adders ( n = 2k ) as pipelining

e.g. n = 8 Tp = ( log n + m-1 ) t

a11 a21 a31 a41 a51 a61 a71 a81

a12 a22 a32 a42 a52 a62 a72 a82

… … … … … … … …

a1m a2m a3m a4m a5m a6m a7m a8m

Figure 1.1

2. Pipeline

2.1Design of Pipeline

The task is able to be divided into a number of subtasks in sequence, and hopefully all subtasks require the same amount time.
Design an independent hardware unit for each subtask.
Connect all units in order.
Arrange the input data and collect the output data.

Subtask1: S1 = a1j + a2j

Subtask2: S2 = S1 + a3j

Subtask3: Yj = S2 + a4j

unit for subtask1: S1

a1j a2j

unit for subtask2: S2

S1 a3j

unit for subtask3: Yj

+++

S2 a3j

S1 S2

a1j

a2j Yj

a3j a4j

0 0 a14 a13 a12 a11

0 0 a24 a23 a22 a21 Y4Y3Y2Y1

0 0

a31 0

a32 a41

a33 a42

a34 a43

0 a44

Tp = ( n + m - 2 ) t

Second pipeline design:

Subtask1: S1 = a1j + a2j and S2 = a3j + a4j

Subtask2: Yj = S1 + S2 S1 S2

b. unit for subtask1:

a1j a2j a3j a4j

unit for subtask2: Yj

S1 S2

c. Yj

S1 S2

a1j a2j a3j a4j

d. Y1

a11 a21 a31 a41

a12 a22 a32 a42

a13 a23 a33 a43

a14 a24 a34 a44

Tp = ( m -1 + log n ) t

2.2RISC instruction pipeline

2.2.1 Features of RISC

• Register-oriented organization

a) register is the fastest available storage device

b) save memory access time for most data passing operations

c) VLSI makes it possible to build large number of registers on a single chip

• Few instructions:

a) there are two groups of instructions: register to register, memory to register or register to memory

b) complex instructions are done by software

c) fixed size instruction format

• Simple addressing models

a) simplify memory address calculation

b) makes it easy to pipeline

• Use register window technique

a) optimizing compilers with fewer instructions

b) based on large register file

• CPU on a single VLSI chip

a) avoid long time delay of off-chip signals

b) regular ALU and simple control unit

• Pipelined instruction execution

a) two-, three-, or four-stage pipelines

b) delayed-branch and data forward

2.2.2 Two types of instruction in RISCV.

1)Register-to-register, each of which can be done by the following two phases:

I : Instruction fetch

E: Instruction execution

2)Memory-to-register (load) and register-to-memory (store), each of which can be

done by the following three phases:

I: Instruction fetch

E:Memory address calculation

D: Memory operation

•Pipeline: a task or operation is divided into a number of subtasks; that need to be performed in sequence, each of which requiresthe same amount of time.

whereTs is the total time required for a job performed by non-pipeline.

Tp is the time required by pipeline.

• Pipeline conflict: If two subtasks on the two pipeline units cannot be computed simultaneously. One task must wastes until another task is completed.

• To resolve the pipeline conflict in the instruction pipeline the following techniques are used

a. inset a no operation stage ( N ) used by two-way pipeline.

b. inset no operation instruction ( I E ) used by three-way and four-way pipeline.

• Two-way Pipeline : there is only one memory in the computer which stores both

instructions and data. Phases of I and E can be overlapped.

Example:

Load A  M

Load B  M

Add C  A + B

Store M C

Branch X

X: Add B  C + A

Halt

Figure 2.1 Sequential execution

Load A  M

Load B  M

Add C A + B

Store M C

Branch X

X: Add B  C + A

Halt

Figure 2.2Two-way pipelining

Speed-up = 17 / 11 = 1.54

Where N represent the NO operation stage (waiting stage) inserted by OS.

• Three-way pipeline: there are two memory units, one stores instructions, another stores

data. Phases of I, E, and D can be overlapped. The NOOP instruction ( two-stage ) is

used when

Example:

Load A  M

Load B  M

NOOP

Add C  A + B

Store M C

Branch X

NOOP

X: Add B  C + A

Halt

Figure 2.3Three-way pipelining

Speed-up = 17 / 10 = 1.7

Where NOOP represent the NO operation stage (waiting stage) inserted by OS.

• Four-way pipeline: dame as three-way, but phase E is divided into:

E1: register file read

E2: ALU operation and register write

Phases of I, D, E1, and E2 can be overlapped.

Example:

Load A  M

Load B  M

NOOP

Add C A + B

Store M C

Branch X

NOOP

X: Add B  C + A

Halt

Figure 2.4Four-way pipelining

Speed-up = 24 / 13 = 1.85

• Optimization of pipelining.

Goal is to reduce the number of NOOP instructions by reordering the sequence of the

program.

Example: (three-way pipeline)

100 Load A M Inserted NOOP

101 Add B B + 1

102 Jump 106

103 NOOP

106 Store M B

100 Load A M Reversed Instructions

101 Jump 105

102 Add B B + 1

106 Store M B

Use of the delayed branch

Figure 2.5

2.3 Mapping nested loop algorithms into systolic arrays

2.3.1 Systolic Array Processors

• It consists of:

(i) Host computer: controls whole processing

(ii) Control unit: provides system clock, control signals input data to systolic array,

and collects results from systolic array. Only boundary PEs are connected to

control unit.

(iii) Systolic array: is a multiple processors network plus pipelining.

Possible types of systolic arrays:

(a) Linear array:

Figure 2.6

(b) Mesh connected 2-D array:

Figure 2.7

Figure 2.8

(d) Binary tree

Figure 2.9

(e) Triangular array

Figure 2.10

• Features of systolic arrays:

(i) PEs are simple, usually contain a few registers and simple ALU circuits

(ii) There are a few types of PEs

(iii) Interconnections are local and regular

(iv) All PEs are acting rhythmically controlled by a common clock

(v) The input data are used repeatedly when they pass the PEs

Conclusion:- systolic arrays are very suitable for VLSI design

- systolic arrays are recommended for computation-intensive algorithms

Example-1:matrix-vector multiplication: Y = A X ,

where Ais an (n  n) matrix and Y and X are (n  1) matrices.

t6 a44

t5 a43 a34

t4 a42 a33 a24

t3 a41 a32 a23 a14

t2 a31 a22 a13

t1 a21 a12

t0 a11

x1 x2 x3 x4

Figure 2.11

where A, B, and C are (n  n)matrices.

j k j

  

i i k

 =

  

a31 a32 a33

a21 a22 a23 

a11 a12 a13 

time fronts

b11

b12 b21

b13 b22 b31

b23 b32 

b33  

c33 c32 c31

c23

c13 c22 c21

c12

c11

A two-dimensional systolic array for matrix multiplication.

Figure 2.12

2.3.2 Mapping Nested Loop Algorithms onto Systolic Arrays

• Some design objective parameters:

(i) Number of PEs ( np ):total number of PEs required.

(ii) Computation time ( nc ): total number of clocks required for systolic array to

complete a computation.

(iii) Pipeline period (p): time period of pipelining.

(iv) Space  time2( np tc2 ):is used to measure cost and performance, suppose

speed is more* important than cost.

• In general, a mapping procedure involves the following steps:

Step 1: algorithm normalization (all variables should be with all indices)

Step 2: find the data dependence matrix of the algorithm called D

Step 3: pick up a valid ll transformation T (suppose the algorithm is a l-nested loops)

Step 4: PE design and systolic array design

Step 5: systolic array I/O scheduling

Step1: Algorithm normalization

Some variables of a given nested algorithms may not have all indices which are called broadcasting variables. In order to find out the data dependence matrix of the algorithm, it is necessary to eliminate all broadcasting variables by renaming or adding more variables.

which can be described by the following nested algorithm:

for i : = 1 to n

for j : = 1 to n

for k : = 1 to n

C( i, j ) : = C( i, j ) + A( i, k ) * B( k, j );

(suppose all C( i, j ) = 0 before the computation)

Where A, B, and C are broadcasting variables.

Algorithm normalization:

for i : = 1 to n

for j : = 1 to n

for k : = 1 to n

begin

A( i, j, k ) : = A( i, j-1, k );

B( i, j, k ) : = B( i-1, j, k );

C( i, j, k ) : = C( i, j, k-1 ) + A( i, j, k ) * B( i, k, j );

End

where A( i, 0, k ) = aik, B( 0, j, k ) = bkj, C( i, j, 0 ) = 0, and

C( i, j, n ) = cij

Step2: Data dependence matrix

In a normalized algorithm, variables can be classified as

(i) generated variables: those variables appearing on the left side of the assignment statements

(ii) used variables: those variables appearing on the right side of the assignment statements

Consider a normalized algorithm:

for i1 : = l11 to ln1

for i2 : = l22 to ln2

for il : = ll1 to ln1

begin

end

Let A( i11 , i21 , …, in1 ) and B( i12 , i22 , …, il2 ) be a pair of generated variable and used variable in a statement.

Vector is called a data dependence vector denoted by dj with respect to

pair of A( i11 , i21 , …, in1 ) and B( i12 , i22 , …, il2 )

The data dependence matrix consists of all possible data dependence vectors of an algorithm.

Example: In the above example ( matrix-matrix multiplication ), we have

Step3: Find a valid linear transformation matrix T:

Suppose l = 3 ( i1 = i, i2 = j, i3 = k )

Where  is a ( 1  3), matrix, S is a ( 2  3) matrix

T is a valid transformation if it meets the following conditions

(a) det (T)  0

(b)  di > 0 for all i

In fact, T can be viewed as a linear mapping which maps an index space into

 is called the time transformation, S is called the space transformation.

Det (T)  0 implies that it is a one to one mapping.

A computation originally is performed at ( i, j, k )is now computed at a PE

located at ( x, y ) at time t, where t and ( x, y ) are determined as shown above.

Note that for a given algorithm there may be a large number of valid

transformations exist. An important goal of mapping is to choose an "optimal" one

which is defined by user. For example, it can reach the minimum value of tc2 np .

Step4.1: PE design including:

(a)arithmetic unit design according to the computations required in the body of the loop

(b)inputs and outputs of PE and their directions.

The number of inputs and outputs are always the same as the number of vectors in the matrix D. The direction of each pair of (input, output) is determined by the new vector obtained by:

-1 x

Step4.2: Systolic array design include:

(a) Locations of all PEs which are obtained by the equation of

be all possible index points in the given algorithm. For

instance in the above example, suppose n = 3, then the set of all possible index points are

J = ({ 1, 1, 1 }, ( 1, 1, 2 ), ... , ( 3, 3, 3 )). All possible different values of pair ( x, y ) form

all locations of PEs.

(b) After having located all PEs in the X-Y plane, interconnections between PES

are obtained according to the data path of each pair of input and output

determined in the PE design stage.

Example:

Consider the matrix-matrix multiplication again. We have:

• PE design:

According to the body of the algorithm, it requires one adder and one multiplier.

Since

there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).

Example:

Consider the matrix-matrix multiplication again. We have:

• PE design:

According to the body of the algorithm, it requires one adder and one multiplier.

Since

there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).

aout

c aout : = ain

bout : = bin

c : = c + ain * bin

bout bin

ain

Figure 2.13

The mapped vector indicates that variable will stay in the register.

• More detail design of the PE:

aout

bout bin



ain

Figure 2.14

• Array design (n = 3)

According to we have the following PE location table, where

1  i  3, 1  j  3, and 1  k  3.

Table 2.1

All possible PE locations are (-1, 1), (-1, 2), (-1, 3), (-2, 1), (-2, 2), (-2, 3), (-3, 1), (-3, 2), and (-3, 3).

Place all PEs on the X-Y plane according to those locations and connect them

according to the data paths determined in the PE design. Thus, we have the

following array ( np = 9 ):

Figure 2.15

Step5: I/O Scheduling: determine locations of all input and output data.

Computation time: tc = tmax - tmin + 1

In the above example, we have tmax = 9, tmin = 3, tc = 9 – 3 + 1 =7.

Reference time tref = tmin – 1 = 2.

Input data are: a11 a12 a13 a21 a22 a23 a31 a32 a33

b11 b12 b13 b21 b22 b23 b31 b32 b33

c11 c12 c13 c21 c22 c23 c31 c32 c33

Output data are: c11 c12 c13 c21 c22 c23 c31 c32 c33

Location of all: according to the assumption of the normalized algorithm

all = a(1,0,1).

Since t = tref = 2, all should be placed

at x = -1 and y =0.

Location of a12 :

Since t = 3 > reference time (tref = 2 ), therefore, a12 will be placed at x = -1

and y = -1 (shift back the difference along the direction of "a" data path).

Location of c11 : c11 = c( 1, 1, 0 ),

Therefore, c11 is located in PE at x = -1 and y = 1.

Continue this process. Finally, we have the following 1/0 scheduled systolic design:

0 0 b13 b23 b33

0 b12 b22 b32

b11 b21 b31

0 0 a11

0 a21 a12

a31 a22 a13

a32 a23

a33

Figure 2.16

2.3.3 Approaches To Find Valid Transformations

For a given algorithm, there may be a large number of valid transformations exist.

An important task in the mapping problem is to find an efficient systematic approach to search for all possible solutions and find an optimal one. In general, we hope to find a valid transformation such that:

(i) computation time is minimized

(ii) total number of PEs is minimized

(iii) pipeline period is 1

As shown in the above example, all these parameters ( tc, np, p ) can be calculated

after the whole design steps are completed. In the following, we give simple

formulae for those design parameters without proof.

Suppose the normalized algorithm has the form:

for i : = 1 to n

for j : = 1 to n

for k : = 1 to n

begin

end

The valid transformation is

• Computation time ( tc ): tc = ( l1 - 1 ) | t11 | + ( l2 - 1 ) | t12 | +( l3 - 1 ) | t13 |

• Number of PEs ( np ):

Tij is the ( i, j )-cofactor of matrix T.

• Pipeline period ( p ):

• Approach-1:

 Three basic row operations:

(1) rowi < --- > rowj

(2) rowi : = rowi + K rowj , i j, K is an integer

(3) rowI : = rowi (-1)

 Each row operation can be done by multiplying an elemental matrix to the given

matrix, for example:

Suppose for a given matrix D, after applying those basic row operations ( R1, R2, Rn ), we have matrix as the result, such that all elements of the first row become positive integers. We conclude that matrix T = Rn ... R2 R1, is a valid transformation.

In fact, it is easy to verify:

(i) T = Rn ... R2 R1, det(T) = det(Rn) .... det(R1)

Since det(Ri) =  1, therefore, det(T) =  1

(ii) Since all elements of the first row of T D > 0, it means  di > 0 for all i.

Besides, according to the formula , we have p = 1.

Example: we use the same example again.

• Approach-2:

Given D = [d1, d2 ... dm]

Define: a( 2  9 ) universal neighbour connection matrix:

When m is the number of vectors in data dependency matrix D.

Step-1: select a  such that di > 0 for 1  i  m (under the condition of )

Step-2: pick up a (9  m) matrix K (under the condition of )

Step-3: find S by solving equation S D = P K

Step-4: if det(T)  0 then done , otherwise pick up another and repeat

those steps until a valid transformation matrix T is found.

Example: Given

Step -1: pick up = [1 0 –1]

From equations:

t2l - t22 = 0

t2l - t23 = 0

t2l + t22 - 2 t23 = 0

3 t22 - 2 t23 = 1

t3l - t32 = 0

t3l - t33 = 0

t3l + t32 - 2 t33 = 0

3 t32 - 2 t33 = 1

3. Analysis of Parallelism in Computer Algorithms

3.1 Data dependences in non loops

• > Data dependence of two computations: one computation can not start until the other one is completed.

• > Parallelism detection: find out sets of computations that can be performed simultaneously.

• > Types of data dependencies on non-loop algorithms or programs:

Suppose that Si and Sj are two computations (statements or instructions in a non-loop algorithm or program).

Ii , Ij and Oi , Oj are the inputs and outputs of Si and Sj , respectively.

a) Type-1: data flow dependence ( read after write, denoted by RAW )

•) condition (1) j > i

(2) Ij  Oi

example-1

100 R1  R2 R3

105 R5Rl + R4

example-2:

Si : A : = B + C

Sj : D : = A  E + 2

•) notation: ( computation Sj defends on computation Si )

b) Type-2: data antidependence ( write after read, denoted by WAR )

•) condition: (1) j > i

(2) Oj  Ii

example-1

200 R1  R2 R3

205 R5R2 + R5

example-2:

Si : A : = B  C

Sj : B : = D + 5

•) notation: ( computation Sj defends on computation Si )

a) Type-3:data-output dependence ( write after read , denoted by WAW )

•) condition (1) j > i

(2) Oi  Oj

example:

Si : A : = B + C

Sj : A : = D  E

•) notation: ( computation Sj defends on computation Si )

•> Data dependence graph ( DDG ): represents all data dependencies among statements

(instructions).

Example: find DDG for the following program.

100 R1 R3 div R2

101 R4R4 R5

102 R6 R1 + R3

103 R4 R2 + R3

104 R2 R4 + R2

Solution:

Si, Sj Dependency di

100, 101no

100, 102type-11

100, 103no

100, 104type-22

101, 102no

101, 103type-3 and type-24 and 5

101, 104type-16

102, 103no

102, 104no

103, 104type-1 and type-27 and 8

DDG:

d5 d4

d1 O d6

3.2 Data dependences in loops

Consider the case of single loop in the form:

For l = l1 to ln Do l = l1, ln

Begin

S1 S1

S2 or S2

: :

SN SN

End Enddo

A loop can be viewed as a sequence of statements in which the statements ( the body

of an algorithm ) are repeated n times.

e.g., ( n = 3, N = 2 )

There are still three types of data dependencies. However, the conditions should be modified. Suppose that Si(li) and Sj(lj) are two statements in the loop, Where li and 1j are the indices of the variable shared by Si and Sj respectively

a)Type-1: (RAW)

•) condition: (1) li - 1j > 0 or li = 1j and j>i

(2) Oi  Ij

•) notation: distance, d = li - 1j 0

•) example:

for l : = 1 to 5

begin

S1 : B( l + 1 ) : = A(l)  C( l + 1 )

S2 : C( l + 4 ) : = B(l)  A( l + 1 )

end

Si Sj shared variable li - 1j = d

S1 S2 B ( l + 1 ) – l = 1

S2 S1 C ( l + 4 ) – ( l + 1 ) = 3

d = 1

d = 3

b)Type-2: (WAR)

•) condition: (1) li - 1j > 0 or li = 1j and j>i

(2) Oj  Ii

•) notation: distance, d = li - 1j 0

c)Type-3: (WAW) ( i j )

•) condition: (1) li - 1j > 0 or li = 1j and j>i

(2) Oi  Oj

•) notation: distance, d = li - 1j 0

•) example:

DO l = 1, 4

S1 : A(l) : = B(l) + C(l)

S2 : D(l) : = A( l - 1) + C(l)

S3 : E(l) : = A( l - 1) + D( l – 2)

ENDDO

Si Sj shared variable li - 1j = d Type di

S1 S1 no

S1 S2 A l - ( l - 1 ) = 1 1 d1

S1 S3 A l - ( l - 1 ) = 1 1 d2

S2 S1 A ( l – 1) - l = -1 no

S2 S2 no

S2 S3 D l - ( l - 2 ) = 2 1 d3

S3 S1 A ( l - 1 ) - l = -1 no

S3 S2 D l - 2 - l = -2 no

S3 S3 no

DDG:

l = 1 l = 2 l = 3 l = 4

d1 d1 d1

d2 d2 d2

d3 d3

3.4. Techniques for removing some Data Dependencies

• Variable renaming.

S1 : A : = B  C S1 : A : = B  C

S2 : D : = A + 1 ======> S2 : D : = A + 1

S3 : A : = A  D S3 : AA : = A  D

Data dependencies:

======>

• Node splitting (introducing new variable).

DO l = 1, N DO l = 1, N

S0 : AA(l) : = A( l + 1)

S1 : A(l) : = B(l) + C(l) S1 : A(l) : = B(l) + C(l)

S2 : D(l) : = A(l) + 2 ======> S2 : D(l) : = A(l) + 2

S3 : F(l) : = D(l) + A(l+1) S3 : F(l) : = D(l) + AA(l)

ENDDO ENDDO

Data dependencies:

d=1

d=0

d=0 d=0

d=1

======>

• Variable substitution.

S1 : X : = A + B S1 : X : = A + B

S2 : Y : = C  D ======> S2 : Y : = C  D

S3 : Z : = EX + FY S3 : Z : = E( A + B ) + F( C  D )

Data dependencies:

======>

4. Parallel Computer Architectures

4.1

Review of Computer Architecture

a.Basic Components of Computer System

CPU: includes ALU and Control Unit
Memory: cache memory, main memory, secondary memory
I/O: DMA, I/O interface, and I/O devices
Bus: a set of communication pathways (data, address, and control signals) connect two or more components

Single Processor Configurations

Multiple bus connected computer systems.

I/O

Bus Memory Bus

I/O

Bus Memory Bus

Figure 4.1

Single bus connected systems

BBSY

BR

BG

System Bus

Figure 4.2

Note:

At any time only one unit can control the bus, though two or more units require to use the bus at the same time.
The arbiter is used to select which unit is going to use the bus next time.
BR (bus request), BG (bus grant), SACK (Select acknowledgement), and BBSY (bus busy) are the most important control signals used by the arbiter.
The timing diagram of these control signals are:

BBSY

SACK

Figure 4.3

Central Processing Unit (CPU)

ALU : performs all logic and arithmetic operations and has some registers which store operands and results.
Control Unit:produces all control signals such that instructions can be fetched and executed one by one. Control unit can be implemented by logic circuits or microprogramming.

An instruction cycle includes several smaller function cycles, e.g.,

1. instruction fetch cycle.

2. indirect (operand fetch) cycle.

3. execution cycle.

d. Memory