1. Project Description:

In 3D, N particles are arranged, initially, in a 101010 cube, i.e., any triplet of random numbers in [0, 10] can be the coordinates of a particle. These particles interact under the so-called Lennard-Jones potential

where Vij is the pair-wise potential between particles i and j with distance rij. Write a parallel program to minimize the total (Lennard-Jones potential) energy of the particle system by adjusting particle positions. Please do the following three parts:

(1)Compute the minimal energy for N=16 particles by any method of your choice by 1 node. You will invest the best but practical efforts for this part of calculation (e.g., run your code for a couple of hours on one node.)

(2)Now, use P=4, 8, 16, 20 nodes to minimize the system energy, individually. You terminate the minimizations when you reach at a minimal energy similar to the minimal energy obtained by P=1 (as in (1) above). Results within 5 percent of relative errors are considered the similar.

(3)Plot the speedup curve.

  1. Introduction:

Molecular Dynamics simulation of systems interacting under covalent type interactions is handled using the Lennard-Jones type potential model (J. E. Lennard-Jones, Proc. Roy. Soc. A106, 463, 1924), which is used with mixed degree of success in interpreting interactions between overlapping electron clouds in closed shell (read uncharged) atoms. The simulations assume that the forces acting on the atoms are short range. It is repulsive if two uncharged particles are too close and mildly attractive after the minimum with a longer tail in the energy curve shown below and thus represent weak van der Waals interactions well:

Many liquids, inert gases and some covalent solids can be simulated using LJ potential. As a first step of any MD simulation, we need to minimize the energy of the system by moving the atoms and obeying the boundary conditions.

  1. Algorithm:

The N-particle in space interacting following Lennard Jones potential can be treated as randomly distributed particle following the 6-12 potential law given by:

Vij(r) = 4 [ (  / rij )12 - (  / rij )6 ]

The values of  and  are assumed to be 1 for current purpose.

The division of the particles is done in the following way:

Each particle is represented by an array (x or x2) containing particle positions (PX, PY, PZ), displacements from the previous position (FX, FY, FZ array elements) and LJ energy accumulator (FE). Initially we divide total number of particles in the box in equal number with each processor having two sets of particles – one is the share of the particle and the other is a copy of another processors share. The algorithm has a smart way of interchanging the particle arrays ultimately achieving the one to all communication requirement of any pair potential model.

Energy Calculation: The energy calculation on a bigger set of atoms will require atom decomposition so that each processor works on a smaller set of atoms. For now, the algorithm calculates LJ interactions on all atoms from a smaller subset assigned to each processor. The array passed contains provisions for ignoring LJ interaction beyond a certain cutoff distance. It exchanges particle sets and accumulates the sum of all LJ interactions.

Minimization of Energy - Metropolis Criterion: The program fist puts the atoms in a 10x10x10 box randomly distributed in x, y and z-directions respectively. The program checks that the particle number provided in the file startup.inpt is divisible by the number of processes. Then it calculates the LJ potential. It then moves the particle by random distance and recalculates the energy to check if the current choice of coordinates is of lower energy. The acceptance or rejection of the current positions is determined by the so-called metropolis criterion. The Metropolis Criterion is a common way of accepting or rejecting solutions in statistical mechanics. If we define E as the difference in energy between two different steps in a MC simulation (i.e. E= En+1 - En), we accept the solution unconditionally when (E<0.0) or accept with probability exp(-E/T). The implication is straightforward – the downhill changes are always accepted and the uphill changes are sometimes accepted. If a wrong guess is made, the program puts the particles back to their last known good configuration i.e. lowest energy configuration. This ensures that the final configuration belongs to the particles at their lowest energy position from a Lennard Jones potential field.

The topological consideration:The code assumes a ring topology as a general way of reducing communication between nodes. Each process interacts with the particles from its nprocs/2-1 anti-clockwise neighbors. Further improvement would require knowing node arrangement in Galaxy

The program MCLJ_metropolis.f calculates the minimization of LJ potential by using two particle arrays x and x2 with elements currently assigned to positions (PX, PY, PZ), cutoff distance (FC) and Energy (FE). This array can be reused for storing individual particle velocity or forces if the code needs be augmented to a full featured MD code.

PROGRAM:

c ****************************************************************

c Monte Carlo/Metropolis algorithm based LJ MD 3D Serial codei ***

c ****************************************************************

c ****************************************************************

c ********** Santanu Chaudhuri *** 04-01-2003 *******************

c ********** AMS 530 Project2.1 *********************************

c ********** MPI Implementation of Sandia Algorithm ***********

implicit real*8(a-h,o-z)

include 'mpif.h'

integer myrank, ierr, nproc, nprocs, nparticle, nlocal

integer hostinit

integer PX, PY, PZ, FX, FY, FZ, FE

parameter (PX=0, PY=1, PZ=2, FX=3, FY=4, FZ=5, FE=6)

parameter (hostinit = 0)

double precision timebegin, timeend, startclock, stopclock

integer status(MPI_STATUS_SIZE)

integer i, icycle, j, k, dest, src

c **** Temperature

parameter(T=0.1, beta=1./T)

c **** Max Number of particles

parameter(Npart=400)

c **** Max Number of Cycles

parameter(Ncycle=1000,Nprint = Ncycle/10)

parameter(dxmax = 0.1d0)

c **** Random number generator's seed

parameter(iseed=1234567)

c **** Maximum number of Processes

parameter(maxproc=30)

c **** Set up arrays

dimension x(0:6,0:Npart-1), xnew(0:6,0:Npart-1)

dimension x2(0:6,0:Npart-1), x2new(0:6,0:Npart-1)

dimension last_disp(0:(Npart-1)*3,0:maxproc)

c ************ Start of MPI part ***************************

real dx(0:Npart), dy(0:Npart), dz(0:Npart)

real Energy(0:Npart), r2ij(0:Npart), dist(0:Npart)

real engstore(0:Ncycle), TotalEng, SumEng

call mpi_init (ierr)

call mpi_comm_rank (MPI_COMM_WORLD, myrank, ierr)

call mpi_comm_size (MPI_COMM_WORLD, nprocs, ierr)

c ******** Requirs even no. of Particle and Procs. *****

if (myrank .eq. hostinit) then

if(mod(nprocs,2).eq.0)then

open (20,file='startup.inpt')

read (20,*) nparticle

read (20,*) r2cut

if (mod(nparticle,nprocs) .ne. 0) then

write(6,*)'Number of particle not divisible by procs'

nparticle = -1

end if

else

write(6,*) 'Number of processes must be even'

nparticle = -1

end if

end if

c *************** Close the processor check part *****************

c ****** The number of particles is broadcast to all processes

call mpi_bcast (nparticle, 1, MPI_INTEGER, hostinit,

# MPI_COMM_WORLD, ierr)

c

nlocal = nparticle/nprocs

c ************** debug block ****************

write(6,*) "particle|Procs|nlocal", nparticle, nprocs, nlocal

c **********************************************

startclock= MPI_Wtime()

c ********** Hostinit can now initialize Coordinate *******

c

if (myrank .eq. hostinit) then

do i=0,nlocal-1

c ************* Convert to <0,10| random generator *********

c ************ Charge on particle is generated ***********

c ************ Using <0,8| -3 algo for scaling ***********

x(PX,i) = (rand(0))*10.0d0

x(PY,i) = (rand(0))*10.0d0

x(PZ,i) = (rand(0))*10.0d0

x(FX,i) = 0.0d0

x(FY,i) = 0.0d0

x(FZ,i) = 0.0d0

x(FE,i) = 0.0d0

c write(6,*) "Charge case 1", pos1(MM,i)

end do

do k=0,nprocs-1

if (k .ne. hostinit) then

do i=0,nlocal-1

x2(PX,i) = (rand(0))*10.0d0

x2(PY,i) = (rand(0))*10.0d0

x2(PZ,i) = (rand(0))*10.0d0

x2(FX,i) = 0.0d0

x2(FY,i) = 0.0d0

x2(FZ,i) = 0.0d0

x2(FE,i) = 0.0d0

c write(6,*) "Charge case 2", pos2(MM,i)

end do

c ****************** Sending x2 to other procs *********

c ****************** Size is 6*nlocal ********************

call mpi_send (x2, 7*nlocal, MPI_REAL,

# k, 100, MPI_COMM_WORLD, ierr)

c **** set count *********************************

end if

end do

else

call mpi_recv (x, 7*nlocal, MPI_REAL,

# hostinit, 100, MPI_COMM_WORLD, status, ierr)

end if

c *************** Debug block ********************

c write(6,*) "sent arrays", nlocal

c ***************

do i= 0,nlocal-1

x2(PX,i) = x(PX,i)

x2(PY,i) = x(PY,i)

x2(PZ,i) = x(PZ,i)

x2(FX,i) = 0.0d0

x2(FY,i) = 0.0d0

x2(FZ,i) = 0.0d0

x2(FE,i) = 0.0d0

end do

c *** Start MC cycle (assuming rand() returns random real*1 from 0 to 1)

c startclock= MPI_Wtime()

do icycle = 1, Ncycle

c ****************** Random displacement if icycle > 1 *************

c

c ************Move the particles positions **************************

c ***************** After first step *********************

c ***************** Check if the Energy minimized ********

if (icycle.gt.1) then

do i=0,nlocal-1

c ************ move X, Y, Z in random directions ***********

x(FX,i) = (rand(0) - 0.5d0) * dxmax

x(FY,i) = (rand(0) - 0.5d0) * dxmax

x(FZ,i) = (rand(0) - 0.5d0) * dxmax

x(PX,i) = x(PX,i)+ x(FX,i)

x(PY,i) = x(PY,i)+ x(FY,i)

x(PZ,i) = x(PZ,i)+ x(FZ,i)

x(FE,i) = 0.0

end do

do i=0,nlocal-1

x2(FX,i) = (rand(0) - 0.5d0) * dxmax

x2(FY,i) = (rand(0) - 0.5d0) * dxmax

x2(FZ,i) = (rand(0) - 0.5d0) * dxmax

x2(PX,i) = x2(PX,i)+ x2(FX,i)

x2(PY,i) = x2(PY,i)+ x2(FY,i)

x2(PZ,i) = x2(PZ,i)+ x2(FZ,i)

x2(FE,i) = 0.0

end do

c ************** Receive x values from hostinit *****************

c call mpi_scatter(x, 7*nlocal, MPI_REAL,x, 7*nlocal, MPI_REAL

c # hostinit,MPI_COMM_WORLD, ierr)

end if

c ************** First Part of Eng *****************

do i=0,nlocal-1

do j=i+1,nlocal-1

dx(i) = x(PX,i) - x2(PX,j)

dy(i) = x(PY,i) - x2(PY,j)

dz(i) = x(PZ,i) - x2(PZ,j)

r2ij(i) = dx(i)**2+dy(i)**2+dz(i)**2

if (r2ij(i).gt.0.0d0) then

r2ij(i)=1.0d0/(r2ij(i)*r2ij(i)*r2ij(i))

Energy(i) = Energy(i) -4.* ( r2ij(i) - r2ij(i)*r2ij(i) )

x(FE,i)=Energy(i)

end if

end do

end do

c *********** Ring Topology ******************

c *********** Need to make it general ********

c *********** Unknown in Galaxy **************

dest = mod (nprocs+myrank-1, nprocs)

src = mod (myrank+1, nprocs)

do k=0,nprocs/2-2

call mpi_sendrecv_replace(x2,7*nlocal,MPI_REAL,dest,200,

# src, 200, MPI_COMM_WORLD, status, ierr)

do i=0,nlocal-1

do j=0,nlocal-1

dx(i) = x(PX,i) - x2(PX,j)

dy(i) = x(PY,i) - x2(PY,j)

dz(i) = x(PZ,i) - x2(PZ,j)

r2ij(i) = dx(i)**2+dy(i)**2+dz(i)**2

dist(i) = sqrt(r2ij(i))

if (r2ij(i).gt.0.0d0) then

r2ij(i)=1.0d0/(r2ij(i)*r2ij(i)*r2ij(i))

Energy(i) = Energy(i) -4.* ( r2ij(i) - r2ij(i)*r2ij(i) )

x(FE,i)=Energy(i)

end if

end do

end do

end do

c

c Now x2 is rotated once more so it is diametrically opposite its home

c process. x(i) accumulates forces from the interaction with particles

c 0,..,i-1 from its opposing process. x2(i) accumulates force from the

c interaction of its home particles with particles i+1,...,nlocal-1 in

c its current location.

c

if (nprocs .gt. 1) then

call mpi_sendrecv_replace (x2, 7*nlocal, MPI_REAL, dest, 300,

# src, 300, MPI_COMM_WORLD, status, ierr)

do i=nlocal-1,0,-1

do j=i-1,0,-1

dx(i) = x(PX,i) - x2(PX,j)

dy(i) = x(PY,i) - x2(PY,j)

dz(i) = x(PZ,i) - x2(PZ,j)

r2ij(i) = dx(i)**2+dy(i)**2+dz(i)**2

dist(i) = sqrt(r2ij(i))

if (r2ij(i).gt.0.0d0) then

r2ij(i)=1.0d0/(r2ij(i)*r2ij(i)*r2ij(i))

Energy(i) = Energy(i) -4.* ( r2ij(i) - r2ij(i)*r2ij(i) )

x(FE,i)=Energy(i)

end if

end do

end do

c

c In half the processes we include the interaction of each particle with

c the corresponding particle in the opposing process.

if (myrank .lt. nprocs/2) then

do i=0,nlocal-1

dx(i) = x(PX,i) - x2(PX,j)

dy(i) = x(PY,i) - x2(PY,j)

dz(i) = x(PZ,i) - x2(PZ,j)

r2ij(i) = dx(i)**2+dy(i)**2+dz(i)**2

dist(i) = sqrt(r2ij(i))

if (r2ij(i).gt.0.0d0) then

r2ij(i)=1.0d0/(r2ij(i)*r2ij(i)*r2ij(i))

Energy(i) = Energy(i) -4.* ( r2ij(i) - r2ij(i)*r2ij(i) )

x(FE,i)=Energy(i)

end if

end do

endif

c

c ********** x2 array is returned to its home process****

c

dest = mod (nprocs+myrank-nprocs/2, nprocs)

src = mod (myrank+nprocs/2, nprocs)

call mpi_sendrecv_replace (x2,7*nlocal,MPI_REAL,dest,400,

# src, 400, MPI_COMM_WORLD, status, ierr)

end if

c

c The x and x2 arrays are summed to give the total energy under LJ potential

c

TotalEng=0.0d0

SumEng=0.0d0

do i=0,nlocal-1

if (x(FE,i).le.0.0d0) then

TotalEng = TotalEng + x(FE,i)

end if

end do

c write(6,*) "Total Energy", TotalEng, "from", myrank

c ************** Put TotalEng from all Procs to SumEng ***********

c ****** MPI_Barrier to finish all jobs ************************

c

call mpi_barrier (MPI_COMM_WORLD, ierr)

call MPI_reduce(TotalEng,SumEng,1,MPI_REAL,

x MPI_SUM,0,MPI_COMM_WORLD,ierr)

c ************** Enegy is sumed and stored for the MC step ******

engstore(icycle)=SumEng

c **************** Stepwise Timing and MPI_barrier ************************

c timeend = mpi_wtime ()

c ************ Use Metropolis criterion ********************

w=0.0d0

if (icycle.gt.1) then

w = engstore(icycle) - engstore(icycle-1)

end if

if((w.lt.0.d0).or.(exp(-beta*w).gt.rand(0))) then

c **** accept the move, upgrade x_i, total energy, and energy

OptEng = engstore(icycle)

if(myrank.eq.hostinit)then

write(6,*) OptEng,icycle

end if

c ****************** store the good coords *******************

Do i=0,nlocal-1

xnew(PX,i)=x(PX,i)

xnew(PY,i)=x(PY,i)

xnew(PZ,i)=x(PZ,i)

x2new(PX,i)=x2(PX,i)

x2new(PY,i)=x2(PY,i)

x2new(PZ,i)=x2(PZ,i)

end do

else

c if (myrank.eq.hostinit) then

c write(6,*)"Rejected Energy", engstore(icycle)

c end if

c *************** Retain the old as so far lowest ***********************

engstore(icycle) = engstore(icycle -1)

c *************** Put the particles back to where they were *************

c ************* Get the x2's from processes **************************

c ************* change them back ************************************

do i=0,nlocal-1

x2(PX,i)=x2(PX,i)-x2(FX,i)

x2(PY,i)=x2(PY,i)-x2(FY,i)

x2(PZ,i)=x2(PZ,i)-x2(FZ,i)

end do

do i=0,nlocal-1

x(PX,i)=x(PX,i)- x(FX,i)

x(PY,i)=x(PY,i)- x(FY,i)

x(PZ,i)=x(PZ,i)- x(FZ,i)

end do

end if

c ********** End of MC block **********

end do

c ****** MPI_Barrier to finish all jobs ************************

c

call mpi_barrier (MPI_COMM_WORLD, ierr)

stopclock = MPI_Wtime()

write(6,*) "Total time for Minimization",(stopclock-startclock)

call mpi_finalize (ierr)

end

  1. Results:

The results can be discussed in terms of suitability of the algorithm discussed above for different number of processors for different system sizes. The test runs were carried out for 10,000 steps of Monte Carlo simulation where the energy reaches near the minimum. The code was tested for 400, 800 and 1600 particles using 4, 8, 16 and 20 processors respectively to analyze effectiveness of the algorithm with respect to different system sizes and parallel efficiency.

Table 1: The runtime (shown here are the node 0 time i.e. the longest)

Processor #

/ N=400
(Seconds) / N=800
(Seconds) / N=1600
(Seconds)
1 / 307.6 (375) / 1116.6(1500) / 4476.76 (6000)
4 / 76.9 / 279.15 / 1119.19
8 / 60.5 / 188.55 / 570.4
16 / 54.2 / 175.5 / 505
20 / 46.9 / 104.26 / 410

Time taken by one processor is calculated from the 4-processor time. It can be calculated from the time taken for a 16 particle 10000 steps run (3 sec) by the serial code assuming proportional increase (which is not true for codes with nested loops) and they are also quoted in bracket (blue) in Table 1. The energy minimization results were comparable for different processor values but the lowest energy is not expected to be the same due to the limited efficiency of the pseudo random generators used is limited. Below is the curve showing energy going downhill.

  1. Performance Analysis:

Performance analysis is done in terms of calculating the speedup curves. Speedup is defined as S(P, N) = T(1,N)/T(P, N) where T(1,N) is the time taken by the serial code to run on one processor and T(P,N) is time taken on P processors. The speedup curve for the above calculations is as follows:

The dotted line is for the ideal case where S(P,N)=P. The performance of the code is not close to the ideal but the capability of handling bigger system size a big advantage. The code is load balanced but the iterative procedure with multiple send and receive has a big communication cost. As a result, this is not well suited for small system sizes. The speedup curve shows that the algorithm gives best speedup for 1600 particle system. In fact, in system sizes below 50 particles, increasing the number of processor increases the computation time.

Conclusion:

The code shows that MPI has improved the performance and made it possible to handle significantly larger system sizes. The speedup gets real better in the case of 1600 particle in a 10x10x10 box. The other approaches like using the cutoffs or more sophisticated algorithms for decomposition of force can improvement the performance and make it possible to handle codes of even bigger size.