PARALLEL PROCESSING
CS802
Part I
Instructor: Dr. C. N. Zhang
Department of Computer Science
University of Regina
Regina, Saskatchewan
Canada, S4S 0A2
Table of Contents
- Introduction to Parallel Processing
- Pipeline
- Design of pipeline
- RISC instruction pipeline
- Mapping nested loop algorithms into systolic arrays
- Parallelism Analysis
- Data dependency analysis for non loops
- Data dependency analysis for loops
- Parallel Computer Architectures
- Review of computer architecture
- Computer classification
- SIMD Computers
- MIMD Computers
- Parallel Programming
- Parallelism for single loop algorithms
- Parallelism for nested loop algorithms
- Introduction to MultiPascal (Part II)
- Examples of Parallel Processing
- Intel MMX technology (Part III)
- Cluster computing (Part IV)
- Task scheduling in message passing MIMD
- Associative processing
- Data flow computing
- Introduction to Parallel Processing
- Parallel Processing: two or more tasks being performed (executed) simultaneously.
- Basic techniques for parallel processing
a. pipeline
b. replicated units
c. pipeline plus replicated units
- Parallel processing can be implemented at the following levels
- at processor level
- at instruction execution level
- at arithmetic and logic operation level
d. at logic circuit level
Speed-up of the parallel processing:
Where TS is the time by processing sequential (non-parallel) manner and TP is the processing time required by parallel processing technique.
Efficiency of the parallel processing
Where n is the number of units used to perform the parallel processing.
Note: 0 S n and 0 E 1
- How To Improve Performance of Computer Systems?
- by electronic technology, e.g., new VLSI development
- by software development, e.g., better OS and compiler
- by architecture enhancements
- Architecture Enhancements
- pipeline
- parallelism
- memory hierarchy
- multiple processor computer systems
- hardware supports for OS and compiler
- special purpose architectures
Where aijs are integers, i=1, 2, … , n, j=1, 2, …, m.
Assume that the time delay of the Adder is t. Sequential processing: Use one Adder.
TS = ( n-1 ) m t
Parallel Processing:
use two adders:
S = 2
E = 1
- use m adders:
Tp = ( n-1 ) t
S = m
E = 1
- use n-1 adders ( n = 2k ) as pipelining
e.g. n = 8 Tp = ( log n + m-1 ) t
a11 a21 a31 a41 a51 a61 a71 a81
a12 a22 a32 a42 a52 a62 a72 a82
… … … … … … … …
a1m a2m a3m a4m a5m a6m a7m a8m
Figure 1.1
2. Pipeline
2.1Design of Pipeline
- The task is able to be divided into a number of subtasks in sequence, and hopefully all subtasks require the same amount time.
- Design an independent hardware unit for each subtask.
- Connect all units in order.
Arrange the input data and collect the output data.
Subtask1: S1 = a1j + a2j
Subtask2: S2 = S1 + a3j
Subtask3: Yj = S2 + a4j
- unit for subtask1: S1
a1j a2j
unit for subtask2: S2
S1 a3j
unit for subtask3: Yj
+++
S2 a3j
c.
S1 S2
a1j
a2j Yj
a3j a4j
d.
0 0 a14 a13 a12 a11
0 0 a24 a23 a22 a21 Y4Y3Y2Y1
0 0
a31 0
a32 a41
a33 a42
a34 a43
0 a44
Tp = ( n + m - 2 ) t
Second pipeline design:
Subtask1: S1 = a1j + a2j and S2 = a3j + a4j
Subtask2: Yj = S1 + S2 S1 S2
b. unit for subtask1:
a1j a2j a3j a4j
unit for subtask2: Yj
S1 S2
c. Yj
S1 S2
a1j a2j a3j a4j
d. Y1
Y2
Y3
Y4
a11 a21 a31 a41
a12 a22 a32 a42
a13 a23 a33 a43
a14 a24 a34 a44
Tp = ( m -1 + log n ) t
2.2RISC instruction pipeline
2.2.1 Features of RISC
• Register-oriented organization
a) register is the fastest available storage device
b) save memory access time for most data passing operations
c) VLSI makes it possible to build large number of registers on a single chip
• Few instructions:
a) there are two groups of instructions: register to register, memory to register or register to memory
b) complex instructions are done by software
c) fixed size instruction format
• Simple addressing models
a) simplify memory address calculation
b) makes it easy to pipeline
• Use register window technique
a) optimizing compilers with fewer instructions
b) based on large register file
• CPU on a single VLSI chip
a) avoid long time delay of off-chip signals
b) regular ALU and simple control unit
• Pipelined instruction execution
a) two-, three-, or four-stage pipelines
b) delayed-branch and data forward
2.2.2 Two types of instruction in RISCV.
1)Register-to-register, each of which can be done by the following two phases:
I : Instruction fetch
E: Instruction execution
2)Memory-to-register (load) and register-to-memory (store), each of which can be
done by the following three phases:
I: Instruction fetch
E:Memory address calculation
D: Memory operation
•Pipeline: a task or operation is divided into a number of subtasks; that need to be performed in sequence, each of which requiresthe same amount of time.
whereTs is the total time required for a job performed by non-pipeline.
Tp is the time required by pipeline.
• Pipeline conflict: If two subtasks on the two pipeline units cannot be computed simultaneously. One task must wastes until another task is completed.
• To resolve the pipeline conflict in the instruction pipeline the following techniques are used
a. inset a no operation stage ( N ) used by two-way pipeline.
b. inset no operation instruction ( I E ) used by three-way and four-way pipeline.
• Two-way Pipeline : there is only one memory in the computer which stores both
instructions and data. Phases of I and E can be overlapped.
Example:
Load A M
Load B M
Add C A + B
Store M C
Branch X
X: Add B C + A
Halt
Figure 2.1 Sequential execution
Load A M
Load B M
Add C A + B
Store M C
Branch X
X: Add B C + A
Halt
Figure 2.2Two-way pipelining
Speed-up = 17 / 11 = 1.54
Where N represent the NO operation stage (waiting stage) inserted by OS.
• Three-way pipeline: there are two memory units, one stores instructions, another stores
data. Phases of I, E, and D can be overlapped. The NOOP instruction ( two-stage ) is
used when
Example:
Load A M
Load B M
NOOP
Add C A + B
Store M C
Branch X
NOOP
X: Add B C + A
Halt
Figure 2.3Three-way pipelining
Speed-up = 17 / 10 = 1.7
Where NOOP represent the NO operation stage (waiting stage) inserted by OS.
• Four-way pipeline: dame as three-way, but phase E is divided into:
E1: register file read
E2: ALU operation and register write
Phases of I, D, E1, and E2 can be overlapped.
Example:
Load A M
Load B M
NOOP
NOOP
Add C A + B
Store M C
Branch X
NOOP
NOOP
X: Add B C + A
Halt
Figure 2.4Four-way pipelining
Speed-up = 24 / 13 = 1.85
• Optimization of pipelining.
Goal is to reduce the number of NOOP instructions by reordering the sequence of the
program.
Example: (three-way pipeline)
100 Load A M Inserted NOOP
101 Add B B + 1
102 Jump 106
103 NOOP
106 Store M B
100 Load A M Reversed Instructions
101 Jump 105
102 Add B B + 1
106 Store M B
Use of the delayed branch
Figure 2.5
2.3 Mapping nested loop algorithms into systolic arrays
2.3.1 Systolic Array Processors
• It consists of:
(i) Host computer: controls whole processing
(ii) Control unit: provides system clock, control signals input data to systolic array,
and collects results from systolic array. Only boundary PEs are connected to
control unit.
(iii) Systolic array: is a multiple processors network plus pipelining.
Possible types of systolic arrays:
(a) Linear array:
Figure 2.6
(b) Mesh connected 2-D array:
Figure 2.7
(c) 2-D hexagonal array
Figure 2.8
(d) Binary tree
Figure 2.9
(e) Triangular array
Figure 2.10
• Features of systolic arrays:
(i) PEs are simple, usually contain a few registers and simple ALU circuits
(ii) There are a few types of PEs
(iii) Interconnections are local and regular
(iv) All PEs are acting rhythmically controlled by a common clock
(v) The input data are used repeatedly when they pass the PEs
Conclusion:- systolic arrays are very suitable for VLSI design
- systolic arrays are recommended for computation-intensive algorithms
Example-1:matrix-vector multiplication: Y = A X ,
where Ais an (n n) matrix and Y and X are (n 1) matrices.
t6 a44
t5 a43 a34
t4 a42 a33 a24
t3 a41 a32 a23 a14
t2 a31 a22 a13
t1 a21 a12
t0 a11
x1 x2 x3 x4
Figure 2.11
where A, B, and C are (n n)matrices.
j k j
i i k
=
a31 a32 a33
a21 a22 a23
a11 a12 a13
time fronts
b11
b12 b21
b13 b22 b31
b23 b32
b33
c33 c32 c31
c23
c13 c22 c21
c12
c11
A two-dimensional systolic array for matrix multiplication.
Figure 2.12
2.3.2 Mapping Nested Loop Algorithms onto Systolic Arrays
• Some design objective parameters:
(i) Number of PEs ( np ):total number of PEs required.
(ii) Computation time ( nc ): total number of clocks required for systolic array to
complete a computation.
(iii) Pipeline period (p): time period of pipelining.
(iv) Space time2( np tc2 ):is used to measure cost and performance, suppose
speed is more* important than cost.
• In general, a mapping procedure involves the following steps:
Step 1: algorithm normalization (all variables should be with all indices)
Step 2: find the data dependence matrix of the algorithm called D
Step 3: pick up a valid ll transformation T (suppose the algorithm is a l-nested loops)
Step 4: PE design and systolic array design
Step 5: systolic array I/O scheduling
Step1: Algorithm normalization
Some variables of a given nested algorithms may not have all indices which are called broadcasting variables. In order to find out the data dependence matrix of the algorithm, it is necessary to eliminate all broadcasting variables by renaming or adding more variables.
which can be described by the following nested algorithm:
for i : = 1 to n
for j : = 1 to n
for k : = 1 to n
C( i, j ) : = C( i, j ) + A( i, k ) * B( k, j );
(suppose all C( i, j ) = 0 before the computation)
Where A, B, and C are broadcasting variables.
Algorithm normalization:
for i : = 1 to n
for j : = 1 to n
for k : = 1 to n
begin
A( i, j, k ) : = A( i, j-1, k );
B( i, j, k ) : = B( i-1, j, k );
C( i, j, k ) : = C( i, j, k-1 ) + A( i, j, k ) * B( i, k, j );
End
where A( i, 0, k ) = aik, B( 0, j, k ) = bkj, C( i, j, 0 ) = 0, and
C( i, j, n ) = cij
Step2: Data dependence matrix
In a normalized algorithm, variables can be classified as
(i) generated variables: those variables appearing on the left side of the assignment statements
(ii) used variables: those variables appearing on the right side of the assignment statements
Consider a normalized algorithm:
for i1 : = l11 to ln1
for i2 : = l22 to ln2
:
:
for il : = ll1 to ln1
begin
S1
S2
:
:
Sm
end
Let A( i11 , i21 , …, in1 ) and B( i12 , i22 , …, il2 ) be a pair of generated variable and used variable in a statement.
Vector is called a data dependence vector denoted by dj with respect to
pair of A( i11 , i21 , …, in1 ) and B( i12 , i22 , …, il2 )
The data dependence matrix consists of all possible data dependence vectors of an algorithm.
Example: In the above example ( matrix-matrix multiplication ), we have
Step3: Find a valid linear transformation matrix T:
Suppose l = 3 ( i1 = i, i2 = j, i3 = k )
Where is a ( 1 3), matrix, S is a ( 2 3) matrix
T is a valid transformation if it meets the following conditions
(a) det (T) 0
(b) di > 0 for all i
In fact, T can be viewed as a linear mapping which maps an index space into
is called the time transformation, S is called the space transformation.
Det (T) 0 implies that it is a one to one mapping.
A computation originally is performed at ( i, j, k )is now computed at a PE
located at ( x, y ) at time t, where t and ( x, y ) are determined as shown above.
Note that for a given algorithm there may be a large number of valid
transformations exist. An important goal of mapping is to choose an "optimal" one
which is defined by user. For example, it can reach the minimum value of tc2 np .
Step4.1: PE design including:
(a)arithmetic unit design according to the computations required in the body of the loop
(b)inputs and outputs of PE and their directions.
The number of inputs and outputs are always the same as the number of vectors in the matrix D. The direction of each pair of (input, output) is determined by the new vector obtained by:
y
1
-1 x
Step4.2: Systolic array design include:
(a) Locations of all PEs which are obtained by the equation of
be all possible index points in the given algorithm. For
instance in the above example, suppose n = 3, then the set of all possible index points are
J = ({ 1, 1, 1 }, ( 1, 1, 2 ), ... , ( 3, 3, 3 )). All possible different values of pair ( x, y ) form
all locations of PEs.
(b) After having located all PEs in the X-Y plane, interconnections between PES
are obtained according to the data path of each pair of input and output
determined in the PE design stage.
Example:
Consider the matrix-matrix multiplication again. We have:
• PE design:
According to the body of the algorithm, it requires one adder and one multiplier.
Since
there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).
Example:
Consider the matrix-matrix multiplication again. We have:
• PE design:
According to the body of the algorithm, it requires one adder and one multiplier.
Since
there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).
Y
aout
c aout : = ain
bout : = bin
c : = c + ain * bin
bout bin
ain
X
Figure 2.13
The mapped vector indicates that variable will stay in the register.
• More detail design of the PE:
aout
bout bin
ain
Figure 2.14
• Array design (n = 3)
According to we have the following PE location table, where
1 i 3, 1 j 3, and 1 k 3.
Table 2.1
All possible PE locations are (-1, 1), (-1, 2), (-1, 3), (-2, 1), (-2, 2), (-2, 3), (-3, 1), (-3, 2), and (-3, 3).
Place all PEs on the X-Y plane according to those locations and connect them
according to the data paths determined in the PE design. Thus, we have the
following array ( np = 9 ):
Y
X
Figure 2.15
Step5: I/O Scheduling: determine locations of all input and output data.
Computation time: tc = tmax - tmin + 1
In the above example, we have tmax = 9, tmin = 3, tc = 9 – 3 + 1 =7.
Reference time tref = tmin – 1 = 2.
Input data are: a11 a12 a13 a21 a22 a23 a31 a32 a33
b11 b12 b13 b21 b22 b23 b31 b32 b33
c11 c12 c13 c21 c22 c23 c31 c32 c33
Output data are: c11 c12 c13 c21 c22 c23 c31 c32 c33
Location of all: according to the assumption of the normalized algorithm
all = a(1,0,1).
Since t = tref = 2, all should be placed
at x = -1 and y =0.
Location of a12 :
Since t = 3 > reference time (tref = 2 ), therefore, a12 will be placed at x = -1
and y = -1 (shift back the difference along the direction of "a" data path).
Location of c11 : c11 = c( 1, 1, 0 ),
Therefore, c11 is located in PE at x = -1 and y = 1.
Continue this process. Finally, we have the following 1/0 scheduled systolic design:
0 0 b13 b23 b33
0 b12 b22 b32
b11 b21 b31
0 0 a11
0 a21 a12
a31 a22 a13
a32 a23
a33
Figure 2.16
2.3.3 Approaches To Find Valid Transformations
For a given algorithm, there may be a large number of valid transformations exist.
An important task in the mapping problem is to find an efficient systematic approach to search for all possible solutions and find an optimal one. In general, we hope to find a valid transformation such that:
(i) computation time is minimized
(ii) total number of PEs is minimized
(iii) pipeline period is 1
As shown in the above example, all these parameters ( tc, np, p ) can be calculated
after the whole design steps are completed. In the following, we give simple
formulae for those design parameters without proof.
Suppose the normalized algorithm has the form:
for i : = 1 to n
for j : = 1 to n
for k : = 1 to n
begin
S1
S2
:
:
Sm
end
The valid transformation is
• Computation time ( tc ): tc = ( l1 - 1 ) | t11 | + ( l2 - 1 ) | t12 | +( l3 - 1 ) | t13 |
• Number of PEs ( np ):
Tij is the ( i, j )-cofactor of matrix T.
• Pipeline period ( p ):
• Approach-1:
Three basic row operations:
(1) rowi < --- > rowj
(2) rowi : = rowi + K rowj , i j, K is an integer
(3) rowI : = rowi (-1)
Each row operation can be done by multiplying an elemental matrix to the given
matrix, for example:
Suppose for a given matrix D, after applying those basic row operations ( R1, R2, Rn ), we have matrix as the result, such that all elements of the first row become positive integers. We conclude that matrix T = Rn ... R2 R1, is a valid transformation.
In fact, it is easy to verify:
(i) T = Rn ... R2 R1, det(T) = det(Rn) .... det(R1)
Since det(Ri) = 1, therefore, det(T) = 1
(ii) Since all elements of the first row of T D > 0, it means di > 0 for all i.
Besides, according to the formula , we have p = 1.
Example: we use the same example again.
• Approach-2:
Given D = [d1, d2 ... dm]
Define: a( 2 9 ) universal neighbour connection matrix:
When m is the number of vectors in data dependency matrix D.
Step-1: select a such that di > 0 for 1 i m (under the condition of )
Step-2: pick up a (9 m) matrix K (under the condition of )
Step-3: find S by solving equation S D = P K
Step-4: if det(T) 0 then done , otherwise pick up another and repeat
those steps until a valid transformation matrix T is found.
Example: Given
Step -1: pick up = [1 0 –1]
From equations:
t2l - t22 = 0
t2l - t23 = 0
t2l + t22 - 2 t23 = 0
3 t22 - 2 t23 = 1
t3l - t32 = 0
t3l - t33 = 0
t3l + t32 - 2 t33 = 0
3 t32 - 2 t33 = 1
3. Analysis of Parallelism in Computer Algorithms
3.1 Data dependences in non loops
• > Data dependence of two computations: one computation can not start until the other one is completed.
• > Parallelism detection: find out sets of computations that can be performed simultaneously.
• > Types of data dependencies on non-loop algorithms or programs:
Suppose that Si and Sj are two computations (statements or instructions in a non-loop algorithm or program).
Ii , Ij and Oi , Oj are the inputs and outputs of Si and Sj , respectively.
a) Type-1: data flow dependence ( read after write, denoted by RAW )
•) condition (1) j > i
(2) Ij Oi
example-1
:
:
100 R1 R2 R3
:
105 R5Rl + R4
example-2:
:
:
Si : A : = B + C
Sj : D : = A E + 2
•) notation: ( computation Sj defends on computation Si )
b) Type-2: data antidependence ( write after read, denoted by WAR )
•) condition: (1) j > i
(2) Oj Ii
example-1
:
:
200 R1 R2 R3
:
205 R5R2 + R5
example-2:
:
:
Si : A : = B C
Sj : B : = D + 5
•) notation: ( computation Sj defends on computation Si )
a) Type-3:data-output dependence ( write after read , denoted by WAW )
•) condition (1) j > i
(2) Oi Oj
example:
:
:
Si : A : = B + C
:
Sj : A : = D E
•) notation: ( computation Sj defends on computation Si )
•> Data dependence graph ( DDG ): represents all data dependencies among statements
(instructions).
Example: find DDG for the following program.
100 R1 R3 div R2
101 R4R4 R5
102 R6 R1 + R3
103 R4 R2 + R3
104 R2 R4 + R2
Solution:
Si, Sj Dependency di
100, 101no
100, 102type-11
100, 103no
100, 104type-22
101, 102no
101, 103type-3 and type-24 and 5
101, 104type-16
102, 103no
102, 104no
103, 104type-1 and type-27 and 8
DDG:
d2
d8
d7
d5 d4
d1 O d6
3.2 Data dependences in loops
Consider the case of single loop in the form:
For l = l1 to ln Do l = l1, ln
Begin
S1 S1
S2 or S2
: :
SN SN
End Enddo
A loop can be viewed as a sequence of statements in which the statements ( the body
of an algorithm ) are repeated n times.
e.g., ( n = 3, N = 2 )
There are still three types of data dependencies. However, the conditions should be modified. Suppose that Si(li) and Sj(lj) are two statements in the loop, Where li and 1j are the indices of the variable shared by Si and Sj respectively
a)Type-1: (RAW)
•) condition: (1) li - 1j > 0 or li = 1j and j>i
(2) Oi Ij
•) notation: distance, d = li - 1j 0
d
•) example:
for l : = 1 to 5
begin
S1 : B( l + 1 ) : = A(l) C( l + 1 )
S2 : C( l + 4 ) : = B(l) A( l + 1 )
end
Si Sj shared variable li - 1j = d
S1 S2 B ( l + 1 ) – l = 1
S2 S1 C ( l + 4 ) – ( l + 1 ) = 3
d = 1
d = 3
b)Type-2: (WAR)
•) condition: (1) li - 1j > 0 or li = 1j and j>i
(2) Oj Ii
•) notation: distance, d = li - 1j 0
d
c)Type-3: (WAW) ( i j )
•) condition: (1) li - 1j > 0 or li = 1j and j>i
(2) Oi Oj
•) notation: distance, d = li - 1j 0
d
•) example:
DO l = 1, 4
S1 : A(l) : = B(l) + C(l)
S2 : D(l) : = A( l - 1) + C(l)
S3 : E(l) : = A( l - 1) + D( l – 2)
ENDDO
Si Sj shared variable li - 1j = d Type di
S1 S1 no
S1 S2 A l - ( l - 1 ) = 1 1 d1
S1 S3 A l - ( l - 1 ) = 1 1 d2
S2 S1 A ( l – 1) - l = -1 no
S2 S2 no
S2 S3 D l - ( l - 2 ) = 2 1 d3
S3 S1 A ( l - 1 ) - l = -1 no
S3 S2 D l - 2 - l = -2 no
S3 S3 no
DDG:
l = 1 l = 2 l = 3 l = 4
d1 d1 d1
d2 d2 d2
d3 d3
3.4. Techniques for removing some Data Dependencies
• Variable renaming.
S1 : A : = B C S1 : A : = B C
S2 : D : = A + 1 ======> S2 : D : = A + 1
S3 : A : = A D S3 : AA : = A D
Data dependencies:
======>
• Node splitting (introducing new variable).
DO l = 1, N DO l = 1, N
S0 : AA(l) : = A( l + 1)
S1 : A(l) : = B(l) + C(l) S1 : A(l) : = B(l) + C(l)
S2 : D(l) : = A(l) + 2 ======> S2 : D(l) : = A(l) + 2
S3 : F(l) : = D(l) + A(l+1) S3 : F(l) : = D(l) + AA(l)
ENDDO ENDDO
Data dependencies:
d=1
d=0
d=0 d=0
d=1
d=1
======>
• Variable substitution.
S1 : X : = A + B S1 : X : = A + B
S2 : Y : = C D ======> S2 : Y : = C D
S3 : Z : = EX + FY S3 : Z : = E( A + B ) + F( C D )
Data dependencies:
======>
4. Parallel Computer Architectures
4.1
Review of Computer Architecture
a.Basic Components of Computer System
- CPU: includes ALU and Control Unit
- Memory: cache memory, main memory, secondary memory
- I/O: DMA, I/O interface, and I/O devices
- Bus: a set of communication pathways (data, address, and control signals) connect two or more components
- Single Processor Configurations
- Multiple bus connected computer systems.
I/O
Bus Memory Bus
or
I/O
Bus Memory Bus
Figure 4.1
- Single bus connected systems
BBSY
BR
BG
System Bus
Figure 4.2
Note:
- At any time only one unit can control the bus, though two or more units require to use the bus at the same time.
- The arbiter is used to select which unit is going to use the bus next time.
- BR (bus request), BG (bus grant), SACK (Select acknowledgement), and BBSY (bus busy) are the most important control signals used by the arbiter.
- The timing diagram of these control signals are:
BBSY
BR
BG
SACK
Figure 4.3
- Central Processing Unit (CPU)
- ALU : performs all logic and arithmetic operations and has some registers which store operands and results.
- Control Unit:produces all control signals such that instructions can be fetched and executed one by one. Control unit can be implemented by logic circuits or microprogramming.
An instruction cycle includes several smaller function cycles, e.g.,
1. instruction fetch cycle.
2. indirect (operand fetch) cycle.
3. execution cycle.
d. Memory