Performance Summary of the Cubed-Sphere Dynamical Core (C192 with 63- Vertical Levels

Performance summary of the Cubed-Sphere Dynamical Core (c192 with 63- vertical levels and 30-tracers) for the Held-Suarez test case.

August 31, 2014

Below describes the work-to-date on a performance study of the latest version of the cubed-sphere atmospheric dynamical core on CPU architectures. The study was performed on Gaea and Titan (AMD: Interlagos). All the studies were performed with the c192 with 63 vertical levels and 30 tracers with the Held-Suarez test case as this is a typical configuration used at the GFDL in production. The study has been coordinated with Eric Dolven from Cray and Chuck Yount from Intel. Eric has been working with the Cray compiler, my task was to work with the Intel compiler and Chuck has been working on the VTUNE analysis which we hope will lead to an improvement in the OpenMP performance of the code.[Srinath V1][Unknown A2]

The work to date on the Intel compiler has been more successful than on the Cray. The Cray compiler had several difficulties in running the code particularly with OpenMP. The initial results from the Cray compiler were significantly better than with Intel, but after some optimizations on the Intel compiler this difference has been reversed.

The general conclusions from the study are shown below. The experiments were all performed on both Gaea and Titan. Gaea is our current production platform and a comparison between Gaea and Titan was important as access to later software releases that are available on Titan but not on Gaea. A comparison was made between the platforms and for the cores counts that we are interested in; we see no significant differences in performance with one exception which is discussed below. All the times given reported are in seconds for the main computational loop (FV_DYNAMICS) for a 5-day experiment.

The compiler releases used for the experiments were: Intel (13.1.3.292) and CCE (8.2.2).

Definitions of code base:

Original-code: The original code provided for the study.

Latest-code: The original code with an updated version of the MPI-communications library.

Comparison between the performance on Titan and Gaea platforms.

Titan with latest-code (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES Scale
462 1 864 864 1.0
302 1 1536 1536 1.5
178 1 3456 3456 2.6
462 1 864 864 1.0
296 2 864 1728 1.6
171 4 864 3456 2.7

Gaea with latest-code (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES Scale
564 1 864 864 1.0
374 1 1536 1536 1.5
226 1 3456 3456 2.5
564 1 864 864 1.0
298 2 864 1728 1.9
171 4 864 3456 3.3

Notice that there is a significant per core performance difference between Titan and Gaea for 864 MPI-Ranks and 1-OpenMP thread. The conjecture is that this difference is the result of Titan having twice the memory bandwidth per compute node compared with a Gaea compute node. (Titan has 8 and Gaea has 16-compute units per compute node). Unfortunately, the hardware derived metrics on Gaea with perftools are not working so has not been possible to confirm this proposition. We are also seeing significantly better OpenMP than MPI scaling on Gaea than on Titan. The hypothesis is that given number of cores, OpenMP is putting less-pressure on a shared memory resource than MPI[Srinath V3][Unknown A4].

Comparison between the performance with synchronous and asynchronous communications

The code used in this study was identical except the communications was changed from synchronous and asynchronous.

Gaea with latest-code with asynchronous communication (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES Scale
564 1 864 864 1.0
374 1 1536 1536 1.5
226 1 3456 3456 2.5
Gaea with latest-code with asynchronous communication (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES SCALE
577 1 864 864 1.0
386 1 1536 1536 1.5
228 1 3456 3456 2.5

The results show that the performance difference between the implementation of the asynchronous to a synchronous communications scheme cases is not significant. A detailed analysis shows that the most important factor which determines the MPI scaling is the imbalance in the times for the compute dense regions (D_SW, C_SW, GOEPK...) and the frequency of the MPI communication.

These imbalances can be seen in the compute dense regions for the case:

OMP_NUM_THREADS=4 and MPI_Ranks=864

Function Name Minimum Time Maximum Time

TRACER_2D45.58 51.80
D_SW 42.32 46.37
REMAPPING 19.79 21.22

GEOPK13.47 13.72
C_SW 8.65 9.72

The imbalances are a consequence of the different amounts of work in the tile edges and corners. To perform the communication, synchronization has to occur between the MPI-Ranks and it is this imbalance time that gates the MPI scaling and not the time spent sending MPI messages.

Performance study with latest version of update domains

The latest version of the communications library provides the capability of utilizing OpenMP for the buffers in the communications. As can be seen, the times differences are negligible, but the changes are worth implementing given the direction of future architectures with significant number of cores.

Titan with latest-code (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES Scale
462 1 864 864 1.0
302 1 1536 1536 1.5
178 1 3456 3456 2.6
Titan with original-code (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES Scale
463 1 864 864 1.0
306 1 1536 1536 1.5
183 1 3456 3456 2.5

Improvements in overall code performance

The general profile for each of the major segments in the code is shown:

Group Percent | Group Name

100.0% | TOTAL
|======

| 69.7% | USER CODE
|======
| 20.6% | MPI
|------
| 12.3% | MPI_ALLREDUCE
| 6.4% | MPI_WAIT
|======
| 6.5% | PTHREAD
|------

| 3.8% | pthread_cond_wait
| 2.0% | pthread_join

|======

The MPI time is 20% of the total time and this MPI time is dominated by MPI_ALLREDUCE which should be avoided.

4.1Single core performance

The approach has been to use the high-level optimization flags in the Intel compiler.

The original times were generated with the standard set of compiler flags typically used our production runs:

-O3 -debug minimal -fp-model precise -override-limits [Srinath V5][Unknown A6]–g

The optimized high-level optimization compiler flags used are:

-O3 -debug minimal -override-limits -unroll-aggressive -ipo –opt prefetch=3 -prof-use[Srinath V7][Unknown A8]

with the floating-point model flags:

-fp-model fast=1 -ftz -no-prec-div -no-prec-sqrt -fimf-precision=low

Analysis of compiler listings from the Cray compiler shows that most regions of the code have been vectorized, unrolled, blocked and collapsed. Attempts to further guide the compiler to perform more of these operations with core based directives have to be done with care. The use of the directives makes a user guarantee that all such addresses are safe to evaluate, and that the execution of will not cause traps. Addition of compiler directives for these cases would improve the performance. However, the degree of improvement is dependent on the data pattern and many other factors, but the compiler has determined that the directive would help to some degree.

Some consideration has been given to improvement the performance of the code through reducing the memory foot-print of the working arrays within the model. However, a detailed analysis of the compute intensive section s of the code show that the profile is relatively flat and the regions that the code sections that do consume a significant number of cycles such as in FYPPM:

do j=js-2,je+2
do i=ifirst,ilast
xt = 0.25*(q(i,j+1) - q(i,j-1))
dm(i,j) = sign(min(abs(xt), max(q(i,j-1), q(i,j), q(i,j+1)) - q(i,j), q(i,j) - min(q(i,j-1), q(i,j), q(i,j+1))), xt)
enddo
enddo

Involve a minimal number of fp operations and the working set-size cannot be reduced.

The Cray complier does shows some regions in the code where it does outperform the Intel compiler. To determine how exactly why this is the case would require an intensive hardware counter analysis of both codes and the potential gains are not currently worth the manpower needed for such a task.

4.2MPI scaling performance

The MPICH_RANK_REORDER_METHOD environment variable with the associated layout file was used to optimize the multi-node layout.

4.3OpenMP scaling performance

There is additional work that could be done in the area of improved OpenMP; the results from the VTUNE performance analysis tool identified these areas and were implemented in the code.

With all of the above changes this resulted in the following performance.

Titan with latest-code (Intel compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES
Optimized Non-optimized

364 462 1 864 864

237 302 2 864 1728

137 178 4 864 3456

The results for Intel show an approximate 27-30% improvement.

Titan with latest-code (Cray compiler):

Time OMP_NUM_THREADS MPI_Ranks NUM_CORES Scale
423 1 864 864 1.0
244 4 3456 3456 1.7

[Srinath V1]Did VTUNE help with OpenMP scaling? Was Inspector used to find possible errors with threading?

[Unknown A2]Reply to Srinath Vadlamani (09/29/2014, 21:36): "..."

VTUNE did help with scaling but it required a collaboration with INTEL to get any meaningful results. Inspector has been very helpful. I should have added that the VTUNE data was gathered on Stampede and not Gaea.

[Srinath V3]Is this because of the suspected higher bandwidth? What hardware metrics would be used to confirm bandwidth usage? I see that VTUNE has a 澱andwidth�collect category I hope to explore.

[Unknown A4]Reply to Srinath Vadlamani (09/29/2014, 21:56): "..."

A working version of VTUNE is not available on Gaea. I fact none of the performance tools worked on Gaea; besides the hardware metrics on the AMD-Interlagos are almost none existent so I'm not sure that the hardware data is available to make any conclusions

[Srinath V5]I can稚 find this flag in Intel 15.0.0 ifort on Babbage

[Unknown A6]Reply to Srinath Vadlamani (09/29/2014, 22:27): "..."

These are just the standard flags we use in the GFDL production model runs on Gaea with ifort v13...

[Srinath V7]Did this just help with profiling and not performance?

[Unknown A8]Reply to Srinath Vadlamani (09/29/2014, 22:28): "..."

The -prof-use turns on PGO; look at the Intel documentation as it explains how to use it.