Project Report of Parallel Matlab Implementation with Communication System Simulation

Project Report of Parallel Matlab Implementation with Communication System Simulation

Project Report of Parallel Matlab Implementation with Communication System Simulation

Abstract: Nowadays, there are a lot of third-party, free toolkits for parallel Matlab implementation. However, not all of them are simple and efficient enough for running in our laboratory’s PC environment. To find out an appropriate tool for parallel Monte Carlo (MC) simulation, eight candidate tools were then screened and some of them were also tested on Windows platform. “pMatlab” from MIT was finally selected for parallel simulation applications. First of all, one simple code from the author was modified with “pMatlab” and the running time deduction was verified thereafter. Secondly, a communication system simulation example from [1] was recoded in so called “parallel array programming” style and the efficiency of the parallel computing was proven again.

  1. Candidate Software Screening

In this project, eight current available tools of parallel computing are cataloged into three categories: 1) Embarrassingly Parallel; 2) Message Passing and 3) Back-End Support, as in Table 1. Most of these tools are not fit for a Windows Intranet environment. Some of them are in dormant status of development to support new versions Windows. Some were turned from free software to commercial software and the others require an UNIX environment to support workload sharing. The only decent “free” tools could be used in our current laboratory are MIT’s “MatlabMPI” and its upgraded version - “pMatlab.” Here below is a brief item-by-item summary of these tools.

Concept / Disadvantage / Advantage
Embarrassingly Parallel / / Keeps the parent process working at the local processor while distributing the child processes among other processors.
Gathers all the results being returned by child processes before the parent processes carry on. / 1. No communications between child processes; 2. Iteration parallel calculation unavailable; 3. Parallel algorithm uncontrollable. / 1. Simple as working on the whole object script;
2. Free and compact
Message Passing / / Every single child process running on the other processors is independent and could be manipulated and invoked by peer / parent processes. / 1. Complex coding 2. Command level parallel operation not as robust as script level operation; 3. Overhead communication between processors costly; 4. Limited replacement for original Matlab functions / 1. Flexible operations on child processes; 2. Compiled parallel commands similar to original Matlab commands in syntax.
Back-End Support / / Front end functions work as a User Interface. Front end packs / sends tasks to a computation engine. The computation engine should be a powerful computation library. Calculation is done by engine ONLY and the result is send back to client interface automatically or upon request. / 1. Not a truly parallel calculation;
2. Efficiency depends on tasks numbers in the queue;
3. Powerful calculation engine costly. / 1. Only one Matlab license needed

Table1. Three Categories of Parallel Matlab [2]

  1. “Parmatlab”

(

Category: Embarrassing parallel in an UNIX or Windows environment. Work on a Multi-processor or Network platform.

Status: Dormant. Support Matlab R12. The last update was 2001-04-25.

Test Result: Could not pass its self test with Matlab R2006b. (Error: “Opening port problem happens at mex file (tcpipmex.c), ‘Too many open port?’”)

Conclusion: The problem happens because of a low-level .c file which is not able to be fixed.

  1. “Matlab Parallelization Toolkit”

(

Category: Embarrassing parallel in an UNIX environment. Work on a Multi-processor or Network platform.

Status: Active. Support Matlab R11, R12, R13 and R14SP1.

Test Result: Has not been tried in a UNIX environment so far…

Conclusion: According to the documents offered by the developer, this tool has neat syntax especially for the calculation with low inter-process communication, e.g. the “for” loop. However, as an embarrassing parallel, it is still limited and needs multiple licenses for the group calculation.

  1. “MULTICORE”

(

Category: Embarrassing parallel in a Windows environment. Work on a Multi-processor or Network platform.

Status: Active. Support R2006b

Test Result: Multi-processor mode did work, but the calculation speed was not improved notably. Therefore, its efficiency is code-depending.

Conclusion: With limited functions, this tool is still easy to use and “no special toolboxes are used, no compilation of mex-files is necessary”. Nevertheless, it is not fit for huge function evaluations scenarios, where “every function evaluation only needs a fraction of a second to execute, then the overhead might be significant.”

  1. “Cornell Multitasking Toolbox for Matlab (CMTM)”

(http://www.tc.cornell.edu/services/support/forms/cmtm.p2.htm)

Category: Message passing parallel in an UNIX or Windows environment. Work on a Multi-processor or Network platform.

Status: Active but a time-limited trail version. Support R2006b.

Test Result: Additional MPI/Pro 1.7.0 tools box is required.

Conclusion: N/A

  1. “MatlabMPI”

(http://www.ll.mit.edu/ MatlabMPI/)

Category: Message passing parallel in an UNIX or Windows environment. Work on a Multi-processor or Network platform.

Status: Dormant.

Test Result: Multi-processor mode worked. By one simple MatlabMPI command, few tested codes completed their calculations within 30%-40% less time on a 2-cores PC. On the contrary, more than half of the codes were slowed down by the MatlabMPI command. As the multi-processor mode of this tool is just a command with the original code name as one of its inputs, its efficiency is code-depending. Further, the networking mode could not work, because MatlabMPI can only recognize a remote computer name defined by UNIX syntax. Neither Windows PC names nor IP addresses are acceptable.

Conclusion: Calculation efficient is code-depending. Resource arrangement is transparent and uncontrollable to users.

  1. “pMatlab”

(http://www.ll.mit.edu/pMatlab/)

Category: Message passing parallel in an UNIX or Windows environment. Work on a Multi-processor or Network platform.

Status: Active. Upgraded version of “MatlabMPI”

Test Result: Similar program syntax to the original Matlab.

Much more flexible parallel commands for distribution and collection operations.

Underlying new data type - “distributed matrix”, which maps grids data to different processors.

The same problem in its networking mode as in “MatlabMPI”’s networking mode.

Conclusion: Innovative “parallel array programming” mechanism is introduced. More flexible and powerful parallel calculation applications can be developed by this kit than by MatlabMPI.

  1. “Netslove”

(http://icl.cs.utk.edu/netsolve/index.html)

Category: Backend support parallel on a Network platform. Host works in an UNIX environment only. Slaveries work in an UNIX or Windows environment.

Status: Active

Test Result: Could not pass its self-test if there is not any available UNIX host server. For the default processing server, “netsolve.cs.utk.edu”, there were more than 13,000 tasks in its queue.

Conclusion: A local UNIX host serve is needed to be setup to accept and distribute tasks over a heterogeneous network.

  1. “Distributed Computing Toolbox 3.1” from Mathworks

(

Category: Message passing parallel in a Windows environment Work on a Multi-processor or Network platform. For the Network mode, an additional “MATLAB Distributed Computing Engine” is needed.

Status: Active

Test Result: N/A

Conclusion: Commercial software.

  1. “Start-P” originate from “MATLAB*P” by MIT – Ron Choy and etl.

(

Category: Message passing parallel in an UNIX or Windows environment.

Status: Active

Test Result: N/A

Conclusion: Commercial software.

  1. Performance Measurement with a Simple Code

To examine the relationship between the running time T and the experiment numbers N in one Monte Carlo (MC) implementation, two simple programs and their parallel computation counterparts - pMatlab modifications are given in this section. In the first program, “test.m”, M random bits and noise samples are generated in each experiment and total N experiments are processed in one program execution. Further, in the second example, “testB.m” (See Appendix for the codes of “test.m” and “testB.m”), an integrated bits generation / detection system with the same M by N data structure is given. As illustrated in Fig1, if the execution of “test.m” is strictly allocated to one processor, the T is linear in the MC experiment numbers N. On the other hand, as the M bits in each experiment are generated by “rand (1, M)” function at once, the T does not increase linearly with M. For this reason, generally, distributing N iterations to two CPUs equally and making these two processors process N/2 experiments in parallel could cut the T half. Ultimately, theoretically speaking, the efficient of executing a MC simulation might be doubled or quadrupled linearly with the number of CPUs increasing.

Fig1. T increasing with N vs. T increasing with M

Thus, the “parallel versions” of the two examples were written to verify above inference. The flow chart of “testpB.m”, the parallel counterparts of “testB.m” is below Fig2. [3] (See Appendix for the parallel versions of the two examples). For example, if running simulation on two CPUs, the M×N information bits array will be divided into two M×N/2 bits matrices. For the four CPUs scenario, the dimensions of the “distribution matrix” in each CPU will be changed to M and N/4.

Fig2. Flow Chart of the Parallel Program “testpB.m”

As a result, running time deduction the parallel Matlab program “testpB.m” achieved is shown in Fig3.

Fig3. Running time deduction achieved by pMatlab program “testpB.m”

In Fig3, the “theoretical” running time T’ for 2CPUs and T’’ for 4 CPUs respectively represent the ½ and ¼ of the original running time T for one CPU. In fact, Fig3 only shows the general trends of the running time deduction realized by parallel Matlab in our examples. Regarding the exactly quantitative efficiency improvements, let’s investigate the empirical time measurements of T0, T0’ and T0” in Table2.

No. of Experiments / T0 (Sec) / T0' on 2CPUs(Sec) / T0' /T0 / T0'' on 4CPUs(Sec) / T0'' /T0
500 / 23.766 / 22.500 / 0.95 / 16.188 / 0.68
1,000 / 48.094 / 34.797 / 0.72 / 23.094 / 0.48
2,000 / 96.266 / 61.656 / 0.64 / 35.531 / 0.37
4,000 / 192.547 / 116.219 / 0.60 / 63.609 / 0.33
8,000 / 385.531 / 226.781 / 0.59 / 117.375 / 0.30
16,000 / 769.516 / 441.406 / 0.57 / 224.594 / 0.29
32,000 / 1531.100 / 882.672 / 0.58 / 439.719 / 0.29
64,000 / 3079.000 / 1752.600 / 0.57 / 867.906 / 0.28

Table2. Running time measurements of “testpB.m” execution

with

according calculation efficiency improvements

Notably, in the table, the more experiments the simulation contains, the more running time deduction can be achieved by the parallel Matlab programs. The reason of this observation result is that compared with the time being spent on main simulation part, the time on those data distribution / collection operations (Fig2.) can be regarded as fixed, no matter how big the operation object - “distribution matrix” is. As a result, the longer the simulation running takes; the closer the actual running time T0’ is to the preceding “theoretical” running time T’. This tendency can be expressed as

limT0→∞T0’ = T’theor. =0.5T0 (for 2 CPUs)

limT0→∞T0” = T”theor. =0.25T0(for 4 CPUs)

The maximum time decreasing can be observed in above experiment is T0’ = 0.57T0 and T0” =0.28 T0.

  1. Parallel Programming a MC Simulation Example of a Communication System

Finally, to testify the validity of pMatlab, a MC simulation example of BPSK system BER estimating was re-coded with parallel arrays and run on a four-core PC. (See Appendix for the original program from [1] and the modified code.) The estimation outputs from the “parallelized” simulation are same as the results from the original example [1]. (See Appendix for estimation results and running time measurements.) Fig4 compares the flow chart of the “parallelized” program with the chart of the original code. As before mentioned, the operations of setting up distribution map, formatting inputs / outputs distribution matrix, post-processing aggregated outputs are the key steps deserving more attention.

Fig4. Comparison between original code flow chart and “parallelized” program flow chart

  1. Appendix

1) “test.m”

N=input('Number of experiments >');

M=input('Number of bits/experiment >');

t0=cputime;

for j=1:N

d=round(rand(1,M)); %data generate

x=2*d-1;

n=0.89*randn(1,M);

y=x+n;

end;

t_sum=cputime-t0;

t_avg=t_sum/N;

disp(t_sum);

disp(t_avg);

------

2) “testp.m”

N=1e6;

M=1e4;

t0=cputime;

% Turn parallelism on or off.

PARALLEL = 1;

% Create Maps.

map1 = 1;

if (PARALLEL)

% Initialize pMatlab.

pMatlab_Init;

Ncpus = pMATLAB.comm_size;

my_rank = pMATLAB.my_rank;

% Break up rows.

map1 = map([Ncpus 1], {}, 0:Ncpus-1 );

end

d=zeros(Ncpus,M,map1);

error_count=zeros(N,1,map1);

% Get the local portion of the global indices

my_i_global = global_ind(d, 1);

% Get the local portion of the distributed matrix

my_d = local(d);

my_error_count=local(error_count);

for i_local=1:round(N/Ncpus);

my_d=round(rand(1,M));

x=2*my_d-1;

n=0.89*randn(1,M);

y=x+n;

end;

% Finalize the pMATLAB program

if (PARALLEL)

disp('SUCCESS_Parallel');

pMatlab_Finalize;

end;

t_sum=cputime-t0;

t_avg=t_sum/N;

disp(t_sum);

disp(t_avg);

------

3) “testB.m”

clear all

N=input('Number of experiments >');

M=input('Number of bits/experiment >');

error_count=zeros(1,N);

t0=cputime;

for j=1:N

for u=1:M

d=round(rand(1)); %data generate

x=2*d-1;

n=0.89*randn(1);

y=x+n;

if y>0

est=1;

else

est=0;

end;

if (est~=d);

error_count(j)=error_count(j)+1;

end;

end;

end;

BER=error_count/M;

t_sum=cputime-t0;

t_avg=t_sum/N;

disp(t_sum);

disp(t_avg);

------

4) “testpB.m”

N=1e4;

M=1e4;

t0=cputime;

% Turn parallelism on or off.

PARALLEL = 1;

% Create Maps.

map1 = 1;

if (PARALLEL)

% Initialize pMatlab.

pMatlab_Init;

Ncpus = pMATLAB.comm_size;

my_rank = pMATLAB.my_rank;

% Break up rows.

map1 = map([Ncpus 1], {}, 0:Ncpus-1 );

end

d=zeros(Ncpus,M,map1);

error_count=zeros(N,1,map1);

% Get the local portion of the global indices

my_i_global = global_ind(d, 1);

% Get the local portion of the distributed matrix

my_d = local(d);

my_error_count=local(error_count);

for i_local=1:round(N/Ncpus)

for u=1:M

my_d(u)=round(rand(1));

x=2*my_d(u)-1;

n=0.89*randn(1);

y=x+n;

if y>0

est=1;

else

est=0;

end;

if (est~=my_d(u));

my_error_count(i_local)=my_error_count(i_local)+1;

end;

end;

end;

error_count=put_local(error_count,my_error_count);

error_count_final=agg(error_count);

% Finalize the pMATLAB program

if (PARALLEL)

disp('SUCCESS_Parallel');

pMatlab_Finalize;

end;

t_sum=cputime-t0;

t_avg=t_sum/N;

disp(t_sum);

disp(t_avg);

------

5) “c10.2_MCBPSKber_3.m”

clear all

N=100; %Experiment times for each EbNo setting

EbNodB=0:8; %vector of Eb/No(dB) values

z=10.^(EbNodB/10); %convert to linear scale

delay=54; %enter delay value (samples)

BER=zeros(1,length(z)); %initialize BER vector

Errors=zeros(1,length(z)); %initialize Errors vector

BER_T=qfunc(sqrt(2*z)); %theor. (AWGN) BER vector

M=round(1000./BER_T); %Guarantee 100 errors for each trial

FilterSwitch=1

t=cputime;

Tol_BER=zeros(1,length(z));

Tol_Errors=zeros(1,length(z));

Avg_BER=zeros(1,length(z));

Avg_Errors=zeros(1,length(z));

for k=1:length(z)

M(k)=max(1000,M(k)); %ensure at least one block processed

for j=1:N %N experiments each EbNo setting

[BER(j),Errors(j)]=c10_MCBPSKrun(M(k),z(k),delay,FilterSwitch)

Tol_BER(k)=Tol_BER(k)+BER(j);

Tol_Errors(k)=Tol_Errors(k)+Errors(j);

end;

Avg_BER(k)=Tol_BER(k)/N;

Avg_Errors(k)=Tol_Errors(k)/N;

end;

disp(Avg_Errors);

disp(Avg_BER);

disp(BER_T);

semilogy(EbNodB,Avg_BER,'o',EbNodB,BER_T,'-r')

axis([min(EbNodB),max(EbNodB),1e-4,1]);

xlabel('E_b/N_o-dB');ylabel('Bit Error Rate');grid

legend('System under Study','AWGN Ref',0)

e=cputime-t;

disp(e);

%End

------

6) “c102p_MCBPSKber_3.m”

EbNodB=0:8; %vector of Eb/No(dB) values

t=cputime;

z=10.^(EbNodB/10); %convert to linear scale

BER_T=qfunc(sqrt(2*z)); %theor. BER vec

delay=54; %enter delay value (samples)

N=100;

FilterSwitch=1;

% Turn parallelism on or off.

PARALLEL = 1;

% Create Maps.

map1 = 1;

if (PARALLEL)

% Initialize pMatlab.

pMatlab_Init;

Ncpus = pMATLAB.comm_size;

my_rank = pMATLAB.my_rank;

% Break up rows.

map1 = map([Ncpus 1], {}, 0:Ncpus-1 );

end;

%Creat 'M'-No. of symbols input matrix, 'Errors'-error number output matrix

%'BER'-BER simulation output matrix

%Configure a 'Ncpus' by 'length(EbNodB)' 'z'matrix for M generation

zp=ones(Ncpus,length(EbNodB));

for i=1:Ncpus

zp(i,:)=z(:);

end;

M=round((1000./(qfunc(sqrt(2*zp)))).*ones(Ncpus,length(EbNodB),map1));

Tol_Errors=zeros(Ncpus,length(EbNodB),map1); %initialize Errors vector

Tol_BER=zeros(Ncpus,length(EbNodB),map1); %initialize BER vector

Avg_Errors=zeros(1,length(EbNodB));

Avg_BER=zeros(1,length(EbNodB));

% Get the local portion of the global indices

my_i_global = global_ind(M, 1);

% Get the local portion of the distributed matrix

my_M = local(M);

my_Tol_Errors=local(Tol_Errors);

my_Tol_BER=local(Tol_BER);

%Loop over the local indices

for k=1:length(EbNodB)

my_M(k)=max(1000,my_M(k));

for j=1:N/Ncpus

[BER(j),Errors(j)]=c10_MCBPSKrun(my_M(k),z(k),delay,FilterSwitch)

my_Tol_BER(k)=my_Tol_BER(k)+BER(j);

my_Tol_Errors(k)=my_Tol_Errors(k)+Errors(j);

end % of j

end % of k

%Store the local portion of Errors into the dmat 'Errors'

Tol_Errors=put_local(Tol_Errors,my_Tol_Errors);

Tol_BER=put_local(Tol_BER,my_Tol_BER);

%Finally, aggregate all of the output onto the leader process

Tol_Errors_final=agg(Tol_Errors);

Tol_BER_final=agg(Tol_BER);

Errors_final=zeros(1,length(EbNodB));

BER_final=zeros(1,length(EbNodB));

for i=1:Ncpus

Errors_final(1,:)=Errors_final(1,:)+Tol_Errors_final(i,:);

BER_final(1,:)=BER_final(1,:)+Tol_BER_final(i,:);

end;

Errors_avg=Errors_final/N;

BER_avg=BER_final/N;

% Finalize the pMATLAB program

if (PARALLEL)

disp('SUCCESS_Parallel');

pMatlab_Finalize;

end;

% Finally, display the resulting matrix on the leader

disp(Errors_avg);

disp(BER_avg);

disp(BER_T);

semilogy(EbNodB,BER_avg,'o',EbNodB,BER_T,'-r')

axis([min(EbNodB),max(EbNodB),1e-4,1]);

xlabel('E_b/N_o-dB');ylabel('Bit Error Rate');grid

legend('System under Study','AWGN Ref',0)

e=cputime-t;

disp(e);

%End

------

7) Estimation result of running“c102p_MCBPSKber_3.m”

->One CPU

Errors

1.0e+003 *

1.2032 1.2858 1.4093 1.5506 1.7338 2.0306 2.4688 3.1632 4.3995

BER Estimation

0.1003 0.0756 0.0542 0.0361 0.0219 0.0122 0.0059 0.0024 0.0008

BER Theoretical

0.0786 0.0563 0.0375 0.0229 0.0125 0.0060 0.0024 0.0008 0.0002

Running Time

5.1216e+004

->Two CPUs

Errors

1.0e+003 *

1.2034 1.2935 1.4046 1.5465 1.7436 2.0231 2.4672 3.1646 4.4057

BER Estimation

0.1003 0.0761 0.0540 0.0360 0.0221 0.0121 0.0059 0.0024 0.0008

BER Theoretical

0.0786 0.0563 0.0375 0.0229 0.0125 0.0060 0.0024 0.0008 0.0002

Running Time

2.6557e+004

->Four CPUs

Errors

1.0e+003 *

1.1995 1.2878 1.4063 1.5471 1.7292 2.0336 2.4623 3.1665 4.3916

BER Estimation

0.1000 0.0758 0.0541 0.0360 0.0219 0.0122 0.0059 0.0024 0.0008

BER Theoretical

0.0786 0.0563 0.0375 0.0229 0.0125 0.0060 0.0024 0.0008 0.0002

Running Time

1.3317e+004

References

[1] William Tranter, K. Sam Shanmugan and et al, Principles of Communication Systems Simulation with Wireless Applications, Prentice, New Jersey, 2004.

[2] Ron Choy and Alan Edelman, Parallel MATLAB: Doing It Right, Proceeding of the IEEE, Vol.93, No.2, Feb 2005

[3] Hahn Kim and Albert Reuther, Writing Parallel Parameter Sweep Application with pMatlab, MIT Lincoln Laboratory.

1