Appendix-A-II-Benchmark Instructions

Appendix A.II Benchmark Instructions

Table of Contents

1 Introduction 2

2 Microbenchmarks 2

2.1 Intel/Pallas MPI 2

2.1.1 Setup and installation 2

2.1.2 Running IMB 2

2.1.3 Offeror Response 3

2.2 HPL 4

2.2.1 Building HPL 4

2.2.2 Running HPL 4

2.2.3 Offeror Response 4

2.3 STREAM 5

2.3.1 Build and Install 5

2.3.2 Benchmark runs 5

2.3.3 Offeror Response 5

3 I/O benchmarks 6

3.1 IOR 6

3.1.1 Procedure 6

3.1.2 Verification 7

3.1.3 Offeror Response 7

3.2 Bonnie++ 7

3.2.1 Procedure 7

3.2.2 Verification 8

3.2.3 Offeror Response 8

3.3 mdtest 8

3.3.1 Procedure 8

3.3.2 Verification 9

3.3.3 Offeror Response 9

4 Application benchmarks 9

4.1 Gaussian 9

4.1.1 License and Installation instructions 9

4.1.2 Setting up and Running Benchmarks 9

4.1.3 Validation 10

4.1.4 Offeror Response 11

4.2 WRF 11

4.2.1 WRF Benchmark Verification Criterion 12

4.2.2 Initial steps, prerequisite software, and libraries 13

4.2.3 Configuring and compiling the WRF benchmark code 14

4.2.4 Running the WRF benchmark: 15

4.2.5 Interpreting WRF performance 16

4.2.6 What to return from WRF benchmark 17

4.3 Gromacs 18

4.3.1 Building Gromacs 18

4.3.2 Setup and Running the Gromacs Benchmarks 18

4.3.3 Validation of Results 19

4.3.4 Offeror Response 19

4.4 VASP 20

4.4.1 License 20

4.4.2 Installation 20

4.4.3 Benchmarks 20

4.5 OpenFoam 21

4.5.1 Prerequisites 22

4.5.2 Configuring and Compiling the OpenFOAM Benchmark 22

4.5.3 Running the OpenFOAM Benchmark 23

4.5.4 Verifying the OpenFOAM Benchmark 23

4.5.5 What to return for the OpenFOAM Benchmark 24

4.6 Parallel EnergyPlus 24

4.6.1 Installation 24

4.6.2 Running EnergyPlus 27

1 Introduction

General benchmark instructions are given in the benchmarking section of Appendix A.I: Technical Specifiations of the ESIF-HPC-1 RFP. This section gives specific instructions in the context of the general benchmarking section.

2 Microbenchmarks

2.1 Intel/Pallas MPI

2.1.1 Setup and installation

The Intel MPI Benchmarks (IMB) source tarball is included in the provided media. After extraction, the src/make_ict file should be edited for the correct compiler and MPI library. The compiler options are defined in the makefile.base file.

2.1.2 Running IMB

The IMB-MPI1 executable may be run for benchmarking purposes in the following way:

% mpirun –np <NP> ./IMB-MPI1 –msglen <lengths> PingPong SendRecv Barrier AlltoAll AllReduce

Here the lengths file contains the message sizes to be tested.

The same MPI library shall be used for all tests.

Results for the following configurations shall be reported:

· Across the full set of standard nodes

· Within a group of tightly connected standard nodes (termed a “scalable unit” in section 3.2.1 of the Technical Specifications), if applicable

· Across the full set of accelerated nodes

· Within a group of tightly connected accelerated nodes (“scalable unit”), if applicable

· Across the full HPC system.

Additionally, for the Ping-Pong tests, results from the following configurations shall be reported:

· Between groups of tightly connected standard nodes, if applicable

· Between groups of tightly connected accelerated nodes, if applicable.

Except for the Ping-Pong tests, all tests should be run in two modes:

· One MPI rank per node

· N MPI ranks per node, where N is the number of physical x86-64 cores per node.

2.1.2.1 Point-to-Point Tests

1. Zero byte latency tests

The Ping-Pong zero-byte latency tests shall be conducted:

(a) Between two nodes that are nearest neighbors

(b) Between two nodes that are the farthest apart topologically.

2. Fixed Message-Size Tests

Results shall be reported using a fixed message size of 524,288 bytes or 0.5 MB for the SendRecv and Ping-Pong tests.

2.1.2.2 Collectives

Results shall be reported using a fixed message size of 524,288 bytes or 0.5 MB for the AllReduce and AlltoAll tests.

Results shall be reported for zero-byte Barrier time.

2.1.3 Offeror Response

The Offeror shall report the timing measurements for the tests described above in the MPI tab of the Offeror’s benchmark spreadsheet.

Optionally, the Offeror may report additional results for a fixed message size different from 0.5 MB.

The compiler, compiler flags, MPI library, and environment variables used should be included in the benchmark report. If results for a fixed message size over 0.5 MB are included, a discussion of the value demonstrated by these results should also be included in the benchmark report.

The job scripts, makefiles, and output logs from all the runs shall be returned to NREL with the Offeror’s proposal.

2.2 HPL

HPL is a software package that solves a (random) dense linear system in double precision (64-bit) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

2.2.1 Building HPL

The HPL source is included in the provided media. After extraction from the tar file, build HPL according to the instructions in the INSTALL file in the hpl‑2.0 directory. Choose the values in HPL.dat according to instructions in the TUNING file, and as appropriate and optimal for the Offeror system.

2.2.2 Running HPL

Offerors shall run and report the results for the following cases:

(a) A single standard computational node

(b) 1024 standard computational nodes

If a constellation of scalable units is proposed, Offerors shall additionally run and report the results for HPL on the full scalable unit.

Offerors shall run a version of HPL tuned for the proposed accelerator node and report results for the following cases:

(d) A single accelerated computational node

(e) The full set of accelerated computational nodes.

Finally, Offerors shall report the results for HLP for the full ESIF-HPC-1 HPC resource, including all standard and accelerated nodes.

2.2.3 Offeror Response

Report results in the HPL tab of the Benchmark Results spreadsheet.

For each of the results reported, the compiler flags used and all runtime environment variables, including the value of OMP_NUM_THREADS, as well as the number of threads per node used, and the values of P, Q, N, and NB shall be included in the benchmark report.

For accelerated nodes, the benchmark report shall detail the compiler and programming environment used and explain all code modifications needed to port and optimize HPL to run on the accelerators.

The output files from each run shall be included with the benchmark results as part of the Offeror's proposal.

2.3 STREAM

2.3.1 Build and Install

The stream.c benchmark file is included in the provided media. For additional information, please see www.cs.virginia.edu/stream/ref.html.

2.3.2 Benchmark runs

The STREAM benchmark shall be run on a single standard node, a single accelerated node, and on a large memory DAV node. For each, the Offeror should compile this program with the highest optimization level.

The default array size is 2 million words with a total memory usage of 46 MB (shown in the output from the run). In addition to running with the default array size, Offerors shall change the value of array size N, so that the total memory used is 60% of the node memory.

For the standard computational node and for the large memory DAV node, the Offeror shall run both memory sizes using (a) one thread, (b) C/2 threads; and (c) C threads, where C is the number of physical cores on a node and where the number of threads is set using the OMP_NUM_THREADS environment variable.

For the accelerated computational node, it is highly desired that the Offeror include results using the accelerator(s). STREAM results should be reported without data transfer between the host and accelerator(s), so the bandwidth of the local device memory is measured. In this case, N should be chosen so that 60% of the device memory is used.

2.3.3 Offeror Response

For each node type, report the value from the line reporting the STREAM triad rates for memory bandwidth into the Offeror’s Benchmark Results spreadsheet. The benchmark report shall include the compiler and compiler options used, all environment variable settings, and the values of N.

Any modifications of the source code needed to run on the proposed accelerators shall be provided, along with the compilation and execution commands used. A discussion of these modifications, as well as the compilation and execution commands used to run on the accelerator(s), should be included in the benchmark report.

The output files from each run shall be included as part of the Offeror's proposal.

3 I/O benchmarks

3.1 IOR

IOR is used for testing performance of parallel file systems using various I/O modes and access patterns. IOR uses MPI for process synchronization. The IOR version 2.10.3 is included in the benchmark distribution.

3.1.1 Procedure

To build and execute the IOR benchmark, perform the following:

1. Unzip and untar the IOR-2.10.3.tgz file.

2. cd to the IOR directory and build with the MPIIO API. Consult the USER_GUIDE file for detailed build instructions.

The Offeror shall run IOR to demonstrate the I/O performance to /scratch for each of the parallel filesystem technologies described in response to the Technical Specifications. IOR shall be run from client systems that are external to those system(s) providing the filesystem server services. The filesystem configuration and disk controller settings used to run IOR shall be the same as those used for running mdtest and Bonnie++. If this is not the case, the Offeror shall fully describe any differences and the rationale for the change.

The Offeror shall run IOR reporting read and write performance for MPIIO filePerProc mode and MPIIO shared mode. For each test, the Offeror shall identify the number of clients needed to demonstrate the maximum bandwidth and demonstrate that performance as measured by IOR is reproducible.

To avoid caching effects, all IOR test results shall be produced with "reorderTasksConstant" on. The Offeror should adjust the test file size by adjusting the blockSize and segmentCount options, to be at least twice the size of the on-node memory.

1. The Offeror shall run IOR in MPIIO filePerProc mode as shown below, with the configuration file (configXXX) as shown. The results of this test should be reported in the “as-is” section of the Benchmark Results spreadsheet.

# cat configXXX

IOR START

api=MPIIO

platform=<insert identifier>

blockSize=16M

transferSize=8M

segmentCount=8

repetitions=10

verbose=1

interTestDelay=2

filePerProc=1

readFile=1

writeFile=1

reorderTasksConstant=1

RUN

IOR STOP

# ./IOR -f ./configXXX > ./MPI_XXX.out

The Offeror may optimize IOR MPIIO filePerProc performance by changing configuration file (configXXX) parameters, reporting the “optimized” performance in the Benchmark Results spreadsheet, and returning the modified configXXX file

2. The Offeror shall run IOR in MPIIO shared mode by resetting the filePerProc specification to zero:

filePerProc=0

and rerun IOR and report shared mode performance results. Again, the Offeror may optimize IOR MPIIO shared performance by changing configuration file (configXXX) parameters, reporting the “optimized” performance in the Benchmark Results spreadsheet, and returning the modified configXXX file.

3.1.2 Verification

There are no validation criteria for the IOR benchmark.

3.1.3 Offeror Response

The Offeror’s IOR command lines, configuration files, stdout filename, and IOR Max, Min, Mean, and StdDev metrics shall be recorded in the IOR worksheet of the Offeror’s Benchmark_Results.xlsx spreadsheet for each of the above-specified tests. The Offeror shall return the following files for each run:

· The scripts and configuration files used

· Standard output

· Standard error.

3.2 Bonnie++

Bonnie++ is a free file system benchmarking tool for UNIX-like operating systems, developed by Russell Coker. Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance.

Bonnie++ allows you to benchmark how your file systems perform various tasks.

Bonnie++ benchmarks these factors:

· Data read and write speed

· Number of seeks that can be performed per second

· Number of file metadata operations that can be performed per second.

3.2.1 Procedure

The Offeror shall run the Bonnie++ benchmark to demonstrate the single node user-accessible file system performance from each computing resource (i.e., the standard computational nodes, the accelerated computational nodes, and the DAV nodes) to each file system (i.e., /scratch for each parallel file system technology and the home file system).

To build and run the bonnie++ benchmark, perform the following:

1) Unzip and untar the bonnie++-1.03e.tgz file.

2) cd to the bonnie++-1.03e directory, configure, make, and install the benchmark. The prefix option may be used to install in a non-default location.

The Offeror shall construct one or more script(s) for running Bonnie++ from a single node to one or more user directories using 1, C/2, and C cores via the bonnie semaphore mechanism, where C is the number of cores on a node. Directories may be defined to span multiple hardware components as long as a regular user would be able to also perform the operation, e.g., on lustre:

mkdir stripe; lfs setstripe -c 4 -o 0 stripe

could be used to set up a specific stripe across specific osts.

An example that is running four processes with a semaphore would be as follows:

bonnie++ -p4

bonnie++ -m RedRock-4 -y -f -d /scratch/wjones/osts/0 >& out4.0 &

bonnie++ -m RedRock-4 -y -f -d /scratch/wjones/osts/2 >& out4.2 &

bonnie++ -m RedRock-4 -y -f -d /scratch/wjones/osts/6 >& out4.6 &

bonnie++ -m RedRock-4 -y -f -d /scratch/wjones/osts/8 >& out4.8 &

Example script and output files are provided in the same directory as the source tar file.

The sum of the performance reported in the output files should be reported in the benchmark spreadsheet.

3.2.2 Verification

No formal verification is provided with the benchmark. However, the files from whence the report performance is aggregated should be provided in the Offeror’s response.

3.2.3 Offeror Response

The metrics reported by the Bonnie++ benchmark should be recorded in the Bonnie worksheet tab of the Offeror’s benchmark spreadsheet.

3.3 mdtest

The mdtest benchmark is used for testing the metadata performance of a file system, measuring file and directory creation, stat, and deletion performance.

3.3.1 Procedure

The Offeror shall run mdtest to demonstrate the metadata performance of each parallel filesystem technology described in response to the Technical Specifications for 1, 2, 4, 8, 16, 32, and 64 nodes. The file system configuration and disk controller settings used to run mdtest shall be the same as those used for running IOR and Bonnie++. If this is not the case, the Offeror shall fully describe any differences and the rationale for the change.

To build and run the mdtest benchmark, perform the following:

1. Unzip and untar the mdtest-1.8.3.tgz.file

2. cd to the mdtest-1.8.3 directory and consult the README file for detailed build instructions.

The Offeror shall construct one or more script(s) for running mdtest from multiple client nodes simultaneously, utilizing the following parameters:

mdtest -I 16 -z 8 -b 2 -R4 -i 10

3.3.2 Verification

There are no validation criteria for the IOR benchmark.

3.3.3 Offeror Response

The metrics reported by the mdtest benchmark shall be recorded in the mdtest worksheet tab of the Offeror’s benchmarks spreadsheet. The Offeror shall return the following files for each run: