CS 350: Principal of Parallel Computing
LECTURE 13
Matrix-Matrix Operations
1. Matrix-Matrix Multiply
A×B = C
Matrix A:
A11 A12 … A1q
A21 A22 … A2q
A31 A32 … A3q
… …
An1 An2 … Anq
Matrix B:
B11 B12 … B1q
B21 B22 … B2q
B31 B32 … B3q
… …
Bn1 Bn2 … Bnq
The complexity in multiplying two matrices of order n is
T (n, 1) = c n3
There are only 4 lines of sequential code:
For ( I = 1; I <= n; I++)
{
For ( J = 1; J <= n; J++)
{
For ( K = 1; K <= n; K++)
{
C[I][J] += A[I][K]*B[K][J];
}
}
}
1.1 Method 1: Row-Column Partition, aka, Ring Method:
1.1.1 Step 0: Initial Matrix Setup
Matrix A:
A11@1 A12@1 … A1q@1
A21@2 A22@2 … A2q@2
A31@3 A32@3 … A3q@3
… …
An1@P An2@P … Anq@p
Matrix B:
B11@1 B12@2 … B1q@p
B21@1 B22@2 … B2q@p
B31@1 B32@2 … B3q@p
… …
Bn1@1 Bn2@2 … Bnq@p
1.1.2 Step 1:
Multiply Row 1 of A with Column 1 of B to create C11 by processor 1
Multiply Row 2 of A with Column 2 of B to create C22 by processor 2
… …
Multiply Row p of A with Column p of B to create Cpp by processor p
Therefore, the diagonal elements of C matrix are created in parallel.
1.1.3 Step 2:
Roll up all rows in A by one processor unit and then multiply A and B’s corresponding rows and columns to form the next diagonal C elements.
Matrix A:
A21@1 A22@1 … A2q@1
A31@2 A32@2 … A3q@2
… …
A11@p A12@p … A1q@p
Matrix B:
B11@1 B12@2 … B1q@p
B21@1 B22@2 … B2q@p
B31@1 B32@2 … B3q@p
… …
Bn1@1 Bn2@2 … Bnq@p
1.1.4 Step 3:
Repeating Step 2 until all rows of A have passed through all processors and the operation is done.
Timing analysis:
On one processor:
TCOMP(n, 1) = c n3tcomp
Where tcomp is the time needed to perform unit computation such as multiplying a pair of numbers.
On P processors, the total cost is
T(n, p) = TCOMP(n, p) + TCOMM(n, p)
The communication cost to roll rows of A is:
Tcomm(n, p) = (p-1)*[n2/p]*tcomm~= n2 tcomm
where tcomm is the time needed to communicate one number.
The computation cost for multiplying matrix of size n/p with matrix of size n:
Tcomp(n, p) = p2*c [n/p]3*tcomp ~= c*n3 tcomp/p
Thus, the speedup is
S(n,p) = T(n,1)/[ TCOMP(n, p) + TCOMM(n, p)]
= P/{1 + c’[p/n]*[tcomm/tcomp]}
Remarks:
1. Overhead is proportional to p/n, i.e., speedup increases when n/p increases
2. Overhead is proportional to [tcomm/tcomp], i.e., speedup increases when n/p increases
3. The above two are universal for parallel computing
4. Memory is parallelized
1.2 Method 2: Broadcast-Multiply-Roll (BMR method)
1.2.1 Step 0: Initial Block Partition:
Matrix A:
A11@1 A12@2 A13@3
A21@4 A22@5 A23@6
A31@7 A32@8 A33@9
Matrix B:
B11@1 B12@2 B13@3
B21@4 B22@5 B23@6
B31@7 B32@8 B33@9
Implementations:
1.2.2 Step1:
Broadcast all diagonal elements of A to the individual processors on each row:
Matrix A becomes:
A11@1 A11@2 A11@3
A22@4 A22@5 A22@6
A33@7 A33@8 A33@9
1.2.3 Step2:
Multiply Row 1 of A with Row 1 of B; Row 2 of A with Row 2 of B, etc, to produce partial C.
1.2.4 Step 3:
Broadcast the next diagonal elements of A and roll up B rows and then multiply like Step 2:
Matrix A becomes:
A12@1 A12@2 A12@3
A23@4 A23@5 A23@6
A31@7 A31@8 A31@9
Matrix B becomes:
B21@1 B22@2 B23@3
B31@4 B32@5 B33@6
B11@7 B12@8 B13@9
1.2.5 Step 4:
Repeat the above Step 3 until all B elements have traveled to all processors. Complete C elements are gathered.
Done.
1.2.6 Timing analysis:
On one processor:
TCOMP(n, 1) = 2 n3tcomp
Where tcomp is the time needed to perform unit computation such as multiplying a pair of numbers.
On P processors, the total cost includes the Broadcast (communication), sub-matrix multiply, and Rolling up matrix elements:
T(n, p) = TB(n, p) + TM(n, p) + TR(n, p)
The broadcast cost is
TB(n, p) = m2 tcomm + (q-2) tstart up
Where m = n/q
The sub-matrix multiply cost is
TM(n, p) = 2m3 tcomp
The rolling up sub-matrix cost is
TR(n, p) = 2m2 tcomm
Thus, the speedup is
S(n,p) = T(n,1)/[ TB(n, p) + TM(n, p) + TR(n, p)]
And the Overhead is
h(n,p) = [1/nq + (P-2q)/2π2]* [tcomm/tcomp]
Remarks:
1. Overhead is proportional to p/n, i.e., speedup increases when n/p increases
2. Overhead is proportional to [tcomm/tcomp], i.e., speedup increases when n/p increases
3. The above two are universal for parallel computing
4. Memory is parallelized
1.3 Method 3: Column partition A and Row partition B:
Matrix A:
A11@1 A12@2 … A1q@p
A21@1 A22@2 … A2q@p
A31@1 A32@2 … A3q@p
… …
An1@1 An2@2 … Anq@p
Vector B:
B1 @ 1
B2 @ 2
B3 @ 3
…
Bn @ p
Implementations:
1.3.1 Step1:
Multiply the entire row of elements of A with all vector elements and then lump to form one element of the resulting matrix C:
Matrix A:
A11@1 A12@2 … A1q@p
Vector B:
B1 @ 1
B2 @ 2
B3 @ 3
…
Bn @ p
1.3.2 Step 2:
Repeat the above Step 1 for all rows to form the complete resulting vector C. Done.
1.3.3 Timing analysis:
Each processor now works on n × n/p matrix with a vector of length n. So the time on each processor is
On one processor:
TCOMP(n, p) = c n * n/p
The communication to lump all elements to form one scalar in C:
TCOMM(n, p) = d’’ p * n/p
Thus, the speedup is
S(n,p) = T(n,1)/[ TCOMP(n, p) + TCOMM(n, p)] = P/[1 + c’p/n]
Remarks:
1. Overhead is proportional to p/n, or speedup increases when n/p increases
2. Method is not so easy
3. Memory is parallelized and so is the communication
4. But communication cost (relative to method 1) has increased.