CS 350: Principal of Parallel Computing

LECTURE 13

Matrix-Matrix Operations

1. Matrix-Matrix Multiply

A×B = C

Matrix A:

A11 A12 … A1q

A21 A22 … A2q

A31 A32 … A3q

… …

An1 An2 … Anq

Matrix B:

B11 B12 … B1q

B21 B22 … B2q

B31 B32 … B3q

… …

Bn1 Bn2 … Bnq

The complexity in multiplying two matrices of order n is

T (n, 1) = c n3

There are only 4 lines of sequential code:

For ( I = 1; I <= n; I++)

{

For ( J = 1; J <= n; J++)

{

For ( K = 1; K <= n; K++)

{

C[I][J] += A[I][K]*B[K][J];

}

1.1 Method 1: Row-Column Partition, aka, Ring Method:

1.1.1 Step 0: Initial Matrix Setup

Matrix A:

A11@1 A12@1 … A1q@1

A21@2 A22@2 … A2q@2

A31@3 A32@3 … A3q@3

… …

An1@P An2@P … Anq@p

Matrix B:

B11@1 B12@2 … B1q@p

B21@1 B22@2 … B2q@p

B31@1 B32@2 … B3q@p

… …

Bn1@1 Bn2@2 … Bnq@p

1.1.2 Step 1:

Multiply Row 1 of A with Column 1 of B to create C11 by processor 1

Multiply Row 2 of A with Column 2 of B to create C22 by processor 2

… …

Multiply Row p of A with Column p of B to create Cpp by processor p

Therefore, the diagonal elements of C matrix are created in parallel.

1.1.3 Step 2:

Roll up all rows in A by one processor unit and then multiply A and B’s corresponding rows and columns to form the next diagonal C elements.

Matrix A:

A21@1 A22@1 … A2q@1

A31@2 A32@2 … A3q@2

… …

A11@p A12@p … A1q@p

Matrix B:

B11@1 B12@2 … B1q@p

B21@1 B22@2 … B2q@p

B31@1 B32@2 … B3q@p

… …

Bn1@1 Bn2@2 … Bnq@p

1.1.4 Step 3:

Repeating Step 2 until all rows of A have passed through all processors and the operation is done.

Timing analysis:

On one processor:

TCOMP(n, 1) = c n3tcomp

Where tcomp is the time needed to perform unit computation such as multiplying a pair of numbers.

On P processors, the total cost is

T(n, p) = TCOMP(n, p) + TCOMM(n, p)

The communication cost to roll rows of A is:

Tcomm(n, p) = (p-1)*[n2/p]*tcomm~= n2 tcomm

where tcomm is the time needed to communicate one number.

The computation cost for multiplying matrix of size n/p with matrix of size n:

Tcomp(n, p) = p2*c [n/p]3*tcomp ~= c*n3 tcomp/p

Thus, the speedup is

S(n,p) = T(n,1)/[ TCOMP(n, p) + TCOMM(n, p)]

= P/{1 + c’[p/n]*[tcomm/tcomp]}

Remarks:

1. Overhead is proportional to p/n, i.e., speedup increases when n/p increases

2. Overhead is proportional to [tcomm/tcomp], i.e., speedup increases when n/p increases

3. The above two are universal for parallel computing

4. Memory is parallelized

1.2 Method 2: Broadcast-Multiply-Roll (BMR method)

1.2.1 Step 0: Initial Block Partition:

Matrix A:

A11@1 A12@2 A13@3

A21@4 A22@5 A23@6

A31@7 A32@8 A33@9

Matrix B:

B11@1 B12@2 B13@3

B21@4 B22@5 B23@6

B31@7 B32@8 B33@9

Implementations:

1.2.2 Step1:

Broadcast all diagonal elements of A to the individual processors on each row:

Matrix A becomes:

A11@1 A11@2 A11@3

A22@4 A22@5 A22@6

A33@7 A33@8 A33@9

1.2.3 Step2:

Multiply Row 1 of A with Row 1 of B; Row 2 of A with Row 2 of B, etc, to produce partial C.

1.2.4 Step 3:

Broadcast the next diagonal elements of A and roll up B rows and then multiply like Step 2:

Matrix A becomes:

A12@1 A12@2 A12@3

A23@4 A23@5 A23@6

A31@7 A31@8 A31@9

Matrix B becomes:

B21@1 B22@2 B23@3

B31@4 B32@5 B33@6

B11@7 B12@8 B13@9

1.2.5 Step 4:

Repeat the above Step 3 until all B elements have traveled to all processors. Complete C elements are gathered.

Done.

1.2.6 Timing analysis:

On one processor:

TCOMP(n, 1) = 2 n3tcomp

Where tcomp is the time needed to perform unit computation such as multiplying a pair of numbers.

On P processors, the total cost includes the Broadcast (communication), sub-matrix multiply, and Rolling up matrix elements:

T(n, p) = TB(n, p) + TM(n, p) + TR(n, p)

The broadcast cost is

TB(n, p) = m2 tcomm + (q-2) tstart up

Where m = n/q

The sub-matrix multiply cost is

TM(n, p) = 2m3 tcomp

The rolling up sub-matrix cost is

TR(n, p) = 2m2 tcomm

Thus, the speedup is

S(n,p) = T(n,1)/[ TB(n, p) + TM(n, p) + TR(n, p)]

And the Overhead is

h(n,p) = [1/nq + (P-2q)/2π2]* [tcomm/tcomp]

Remarks:

1. Overhead is proportional to p/n, i.e., speedup increases when n/p increases

2. Overhead is proportional to [tcomm/tcomp], i.e., speedup increases when n/p increases

3. The above two are universal for parallel computing

4. Memory is parallelized

1.3 Method 3: Column partition A and Row partition B:

Matrix A:

A11@1 A12@2 … A1q@p

A21@1 A22@2 … A2q@p

A31@1 A32@2 … A3q@p

… …

An1@1 An2@2 … Anq@p

Vector B:

B1 @ 1

B2 @ 2

B3 @ 3

…

Bn @ p

Implementations:

1.3.1 Step1:

Multiply the entire row of elements of A with all vector elements and then lump to form one element of the resulting matrix C:

Matrix A:

A11@1 A12@2 … A1q@p

Vector B:

B1 @ 1

B2 @ 2

B3 @ 3

…

Bn @ p

1.3.2 Step 2:

Repeat the above Step 1 for all rows to form the complete resulting vector C. Done.

1.3.3 Timing analysis:

Each processor now works on n × n/p matrix with a vector of length n. So the time on each processor is

On one processor:

TCOMP(n, p) = c n * n/p

The communication to lump all elements to form one scalar in C:

TCOMM(n, p) = d’’ p * n/p

Thus, the speedup is

S(n,p) = T(n,1)/[ TCOMP(n, p) + TCOMM(n, p)] = P/[1 + c’p/n]

Remarks:

1. Overhead is proportional to p/n, or speedup increases when n/p increases

2. Method is not so easy

3. Memory is parallelized and so is the communication

4. But communication cost (relative to method 1) has increased.