AMS 530: Principles of Parallel Computing

Project 1

Xu Shufu {}

Problem 1.2

Create two NxN square matrices with random elements whose values are within [-1, 1] and compute their product by using P processors. Your program should general for reasonable values of N and P.

Plot the speedup curves at N=1000, 2000, 4000 and P=1, 2, 4, 8 processors.

Solution:

[Algorithm]

If we compute the matrix product, A*B=C, we can divide A into, in which all blocks have same number of rows. And it’s easy to know that C=. Hence if N processes can be used in the computation, we can divide the matrix into N parts and let each process compute its own part.

In my program, I first let the process 0 generate the matrices and then send the divided part of A to its corresponding process, with the whole matrix B. The other processes will compute their part and send the result back to process 0. Of course process 0 will compute one part of C.

To get the time consumed by the computation, I inserted the sentences with ‘MPI_Wtime()’ before and after the computation of each process. And let process 0 sum up the total and print it out to the console.

One more thing to mention is that I used one-dimension array to denote matrixin my program. This is convenient because it is obvious that.

[Program]

Please refer to the last two pagesfor program source code commented.

[Result]

First of all, let look at the tables below for some sense of speedup.

Table for the total time consumed by all processes:

N / 1 / 2 / 4 / 8
1,000,000 / 294.194909 / 314.286355 / 329.685670 / 412.958848
4,000,000 / 2846.372501 / 2704.187981 / 2885.517559 / 3056.368553
16,000,000 / 29138.321840 / 30511.331769 / 39101.949395 / 34597.477279

T(P,N) (arithmetic mean)

N / 1 / 2 / 4 / 8
1,000,000 / 294.194909 / 157.1431775 / 82.4214175 / 51.619856
4,000,000 / 2846.372501 / 1352.0939905 / 721.37938975 / 382.046069125
16,000,000 / 29138.321840 / 15255.6658845 / 9775.48734875 / 4324.684659875

S(P,N)

N / 1 / 2 / 4 / 8
1,000,000 / 1 / 1.87 / 3.57 / 5.70
4,000,000 / 1 / 2.10 / 3.95 / 7.45
16,000,000 / 1 / 1.91 / 2.98 / 6.74

Following plot will give us a much better view of the speedup.

[Analysis]

The speedup does not grow as fast as the number of processes does. But as long as the problem size N grows, the speedup seems to grow faster. Furthermore, there is a trend that the speedup will reach a limit if the number of processes keeps increasing. The reason, I guess, is that there is an overhead in communicating among the processes. Also, not like this one, most problems have limited portion of code that could be speedup by making it parallel. That’s exactly what the Amadahl's Law says. Fortunately, computational problems, like this one, have gained dramatic speedup through parallel computing.

Since there are many jobs running on the supercomputer, they affect each other to some extent, which can be seen from the above graph clearly.

Note:Something wrong with case (4000000,2)(marked green), maybe just error, caused S(P,N)>P.

Project 1 source code

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include "mpi.h"

//should be changed accordingly

#define MAT_SIZE 1000000 //4000000 16000000

#define MAT_RANK 1000 //2000 4000

#define NUM_PROC 1 //2 4 8

double myrand();

int main(int argc, char **argv)

{

//Data declarition

double *matA, *matB, *matP;

double time[NUM_PROC];

int disp, length; //displacement and block length of each matrix block

int i,j,k; //counter

int myrank, size;

MPI_Status status;

MPI_Init( &argc, &argv );

MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

MPI_Comm_size( MPI_COMM_WORLD, &size);

if (size != NUM_PROC) MPI_Abort( MPI_COMM_WORLD, 1 );

//allocate memory for the three matices

matA=malloc(sizeof(double)*MAT_SIZE);

matB=malloc(sizeof(double)*MAT_SIZE);

matP=malloc(sizeof(double)*MAT_SIZE);

length=(int)(MAT_SIZE/NUM_PROC);//length of the block assigned for each process

if (myrank == 0) //code for process zero

{

time[0] = MPI_Wtime();

for(i=0;i<MAT_SIZE;i++) matA[i]=myrand();//Initializing A

for(i=0;i<MAT_SIZE;i++) matB[i]=myrand();//Initializing B

for(i=1;i<NUM_PROC;i++){

disp=i*length;//block displacement in matrix for process i

//sending part of Matrix A

MPI_Send(matA+disp, length, MPI_DOUBLE, i, i+98, MPI_COMM_WORLD);

//sending all of Matrix B

MPI_Send(matB, MAT_SIZE, MPI_DOUBLE, i, i+99, MPI_COMM_WORLD);

}

for(i=0;i<(int)(MAT_RANK/NUM_PROC);i++){

for(j=0;j<MAT_RANK;j++){

for(k=0;k<MAT_RANK;k++){

matP[i*MAT_RANK+j] += matA[i*MAT_RANK+k] * matB[k*MAT_RANK+j];

}

}

}

time[0] = MPI_Wtime() - time[0];

}

else // code for various process(with id - myrank)

{

time[myrank] = MPI_Wtime();

disp=myrank*length;//block displacement in matrix for process i

MPI_Recv(matA+disp, MAT_SIZE, MPI_DOUBLE, 0, myrank+98, MPI_COMM_WORLD, &status);

MPI_Recv(matB, MAT_SIZE, MPI_DOUBLE, 0, myrank+99, MPI_COMM_WORLD, &status);

for(i=0;i<(int)(MAT_RANK/NUM_PROC);i++){

for(j=0;j<MAT_RANK;j++){

for(k=0;k<MAT_RANK;k++){

matP[disp+i*MAT_RANK+j] += matA[disp+i*MAT_RANK+k]*matB[k*MAT_RANK+j];

}

}

}

//sending last result to process 0

MPI_Send(matP+disp, length, MPI_DOUBLE, 0, myrank+999, MPI_COMM_WORLD);

//freeing memory allocated

free(matA);

free(matB);

free(matP);

//time consumed by this process

time[myrank] = MPI_Wtime() - time[myrank];

MPI_Send(time+myrank, 1, MPI_DOUBLE, 0, myrank+1000, MPI_COMM_WORLD);

}

if(myrank==0){

for(i=1;i<NUM_PROC;i++){

disp=i*length;//block displacement in matrix for process i

//receiveing result iteratively

MPI_Recv(matP+disp, length, MPI_DOUBLE, i, i+999, MPI_COMM_WORLD, &status);

MPI_Recv(time+i, 1, MPI_DOUBLE, i, i+1000, MPI_COMM_WORLD, &status);

time[0] += time[i];

}

printf("Time used by(N,P): %f",MAT_SIZE,NUM_PROC,time[0]);

free(matA);

free(matB);

free(matP);

}//final calculation of time consumed

MPI_Finalize();

}

double myrand()

{

return -1+rand()/(RAND_MAX+1.0);

}