HPL Benchmarking of TDG Cluster

Leonard Hung Morris Law

Hong Kong Baptist University

ABSTRACT

We benchmarked the TDG (Teaching Development Grant) Cluster with the HPL (High Performance Linpack) software. Each trial was done on gradual basis after successive tuning. Results will be submitted to TOP500.org.

1. Setting up the Cluster

The cluster was composed of 1 master-node and 64 compute-nodes of Dell PowerEdge 2650 motherboards (2 x 2.8 GHz Xeon CPU with 2 GB RAM) interconnected with Gigabit Ethernet by an Extreme BlackDiamond 6816 switch. It was installed with Rocks 2.3.1 [1] as the testing operating system.

2. Running HPL

We tested the binary xhpl (of Rocks) under MPICH [2] but the result was mundane. We also re-compiled ATLAS library, gcc 3.2.2 and HPL [3] to get another xhpl but the improvement was little. Then the third xhpl was compiled but not using the open BLAS and ATLAS libraries. Instead, the library developed by K. Goto [4] was used. A few trials of small-size problems showed that this new xhpl gave an average performance boost of 30% in comparison with BLAS. So we applied this for all our tests.

As the chief bottleneck of the computation was network traffic, a bigger job using more memory in each node ensured better performance. However, too large a problem would push the system to use hard disk swapping to handle the data, which in turn would pull down the performance severely. Hence, we began to run many big jobs to trace out the peak performance, with an attention to avoid any occurrence of swapping.

After we had located the peak performance around N=120000, we started to fine tune the system in order to get the best results.

3. Fine Tuning

Here is a sequence of fine tuning procedures used to get the best results:

i. turn off logical processors for stability; if logical processors were turned on, it was found that an overflow user-mode process could hang up the kernel in Rocks;

ii. set NB to 104 for HPL.dat input file; other NB values were also tried but this one was found the best from experiments;

iii. turn off ganglia and httpd services in compute-nodes to reserve more memory and processor power;

iv. set the MPICH environment variable P4_GLOBMEMSIZE to 24 MB for large-size problems;

v. use a special machine-file (instead of round-robin node ordering) for MPICH to optimize the use of shared memory in SMP architecture; the reason is due to the difference of interconnection between two processors in the same node from that between two processors in two separate nodes [5]; 3 more specially crafted machine-files were also tested.

The following chart shows a part of the improvement of benchmarks after step by step tuning:

Fig. 1 HPL benchmarks vs. problem size N

The first 4 points (from the left) used a round robin machine-file while system monitoring was on (i.e. the real situation when the cluster is launched); the next 4 points used a special machine-file while system monitoring was off; the last point showed a downgrade due to swapping.

4. Experimental Results

One of the peak performances Rmax = 385.7 Gflops was obtained when:

Nmax = 123000 NB = 104 P = 8 Q = 16

A further search of medium-size problems was necessary for TOP500.org result submission. The value N(1/2) = 33500 was obtained when R = Rmax / 2 = 192.9 ± 5% Gflops. This value was first estimated to be around 31000 by extrapolation using the following graph:

Fig. 2 HPL benchmarks vs. problem size N (in log scale)

As the curve is quite linear for small-size problems, a first guess of N(1/2) will reduce the time for the search.

To summarize, we repeat the test in a single run with N = 10000, 33500, 50000, 123000, 124000 and get the file HPL_final_01.dat. This time Rmax = 383.2 when N = 123000 (consider total system noise around 0.6% disturbance to Rmax), R = 193.2 when N(1/2) = 33500 which falls within 1% error range of Rmax / 2. The % efficiency E is calculated by:

E = Rmax / ( processor clock speed * no. of CPUs * 2 for Xeon )

= 383.2 / ( 2.8 * 128 * 2 )

= 53.5%

The following chart shows the final result (also compared with others):

Fig. 3 HPL benchmarks vs. problem size N (finalized)

Comparison of the final result using Goto library (in circles) with that using re-compiled ATLAS (in crosses).

5. Conclusions

The HPL benchmark of the TDG Cluster is Rmax = 383.2 Gflops, positioned at no. 162 when compared with the TOP500 list for Nov, 2002 [6].

6. References

[1] Rocks cluster. http://www.rocksclusters.org/Rocks/

[2] MPICH. http://www-unix.mcs.anl.gov/mpi/mpich/

[3] HPL. http://www.netlib.org/benchmark/hpl/

[4] K. Goto BLAS library. http://www.cs.utexas.edu/users/kgoto/

[5] T. Leng, R. Ali, J. Hsien, V. Mashayekhi, R. Rooholamini.

“An Empirical Study of Processes-to-Processors Mapping on Small-Scale SMP Clusters for

Message-Passing Efficiency,” Dell Computer Cooperation.

[6] TOP500.org. http://www.top500.org/list/2002/11/