http://software.intel.com/en-us/forums/topic/495421

I have continued to investigate why I don’t appear to get AVX instruction performance improvements and also OpenMP performance improvements.

Considering Jim and Tim’s comments about a memory access bottleneck, I have investigated the impact of memory usage size on vector and OpenMP performance improvements; with some success.

I have taken a recent example I have posted of OpenMP usage and run it for:

·  Different vector instruction compilation options, and

·  Varying memory usage footprints.

I have also carried out these tests on two processors that I have available:

·  Intel® Xeon® W3520 @ 2.67 GHz with 8mb cache, 12.0 GB memory and 120GB SSD

·  Intel® Core™ i5 2540M @ 2.60 GHZ with 3mb cache, 8.0 GB memory and 128GB SSD

For those who know processors, both of these are cheap and have relatively poor performance for their processor class, so I am investigating the performance improvements that can be achieved for low specked Intel processors.

Apart from processor class (for available instruction set) and processor clock rate, the other important influences on performance are:

·  Processor cache size, (8mb and 3mb)

·  Memory access rate (1066mhz and 1333mhz)

I presume cache size is defined by the processor chip, while memory access is defined by the pc manufacturer?

Unfortunately, I am not lucky enough to test a range of these options, but perhaps others can.

Compiler Options

I have investigated compiler options for vector instructions and for OpenMP calculation. 6 options have been used. For vector instructions I have used:

/O2 /Qxhost (include best vector instructions available AVX on i5 and SSE? On Xeon)

/O2 /QxSSE2 (limit vector instructions to SSE2)

/O2 /Qvec- (no vector instructions)

These have been combined with /OpenMP to identify the combined performance improvement that could be possible.

Memory Options

A range of memory footprints from 0.24 MB up to 10 GB have been tested, although the performance levels out once the memory usage footprint significantly exceeds the cache capacity.

For subsequent tests, the array dimension N is increased by 25% for each successive test:

x = x * 1.25

n(i) = nint (x/4.) * 4 ! adjust for 32 byte boundary

call matrix_multiply ( n(i), times(:,i), cycles(i) )

The sample program I am using, calculates a matrix multiplication, where (real(8)) [C] = [C] + [A’].[B].

The advantage of this computation is that OpenMP can be applied at the outer loop, providing maximum efficiency for potential multi-processing. When run it is always achieving about 99% CPU in task manager.

For small matrix sizes, the matrix multiply computation is cycled, although the OpenMP loop is inside the cycle loop. This appears to be working, with a target elapse time of at least 5 seconds (10 billion operations) being achieved.

Idealised Results

From an idealised estimate of performance improvement:

SSE2 should provide 2 x improvement over Qvec-.

AVX should provide 4 x improvement. (assuming 256 bit; the matrix size has been sized as a multiple or 4 so that dot_product calls are on 32-byte alignment)

OpenMP should provide 4 x improvement for 4 CPU’s

This potentially implies up to 16 x for /QxAVX /Openmp, although this never occurs !!

Actual Results

Results have been assessed, based on run time (QueryPerformanceCounter)

Performance has also been calculated as Mflops (million floating point operations) where I have defined a single floating point operation as “s = s + A(k,i)*B(k,j)” (floating point multiplication), although this could be described as 2 operations as there is now little difference between multiplication and addition.

Performance has also been reported as a performance ratio in comparison to /O2 /Qvec- for the same memory size. This gives a relative performance factor for the combination of vector or OpenMP improvement.

The results show that some performance improvement is being achieved by vector or OpenMP computation but not near as good as the ideal case.

While OpenMP always shows a 4x increase in cpu usage, the run time improvement is typically much less. This can best be assessed by comparing the run-time performance of OpenMP to the single cpu run time performance.

The biggest single influence on achieved performance improvement is the memory footprint. For the hardware I am using there is little (none!) improvement from AVX instructions once the calculation is no longer cached. My previous testing for large memory tests appeared to show that I was not getting the AVX instructions to work.

I would have to ask does AVX work for non-cached computation, as I have not shown this occurring with my hardware. Also, if AVX instructions only work from the cache, what is all the talk about alignment, as I do not understand the relationship between memory and cache alignment of vectors.

This AVX operation for non-cached calculations can be masked by the memory access speeds. I need to test with faster memory access speeds using faster memory access hardware.

Reporting

I am preparing some tables and charts of the performance improvements I have identified.

The vertical axis is Mflops or performance ratios.

The horizontal axis is memory footprint. When reported as a log scale, the impact of cache and lack of AVX benefit for large memory runs, as a log scale

Summary

·  The memory access bottleneck is apparent, but I don’t know how to overcome it.

·  From the i5 performance, AVX performance does not appear to be realised.

·  The influence of cache size and memory access rates can be seen in the Mflop charts below.

These tests are probably identifying that notionally better processors are only any good if the main bottleneck on performance is related to the improvement these better processors provide. At the stage I have tested, I appear to get minimal benefit from AVX due to the memory access rate limit.

I would welcome any comments on these results or hope people could run the tests on alternative hardware configurations, noting the main hardware features identified above, including cache size and memory access speed.

I5 Mflop Results

Xeon Mflop Results

I5 Performance Improvement

Xeon Performance improvement

Summary of test differences:

Characteristic / Xeon(R) CPU W3520 / Core(TM) i5-2540M
CPU Clock rate (GHz) / 2.67 / 2.60
Memory access rate (GHz) / 1.067 / 1.333
Memory installed (GB) / 12.0 / 8.0
Cache size / 8mb cache / 3mb L3 cache
Intel compiler / Version 12.1.5.344 Build 20120612 / Version 13.0.1.119 Build 20121008
CPU Driver / 6.1.7600.16385 / 6.1.7600.16385
Driver Date / 21/06/2006 / 21/06/2006

Comparison between Xeon and Core i5

This could be a very telling chart:

It shows the difference between cache performance and reduced performance when the cache is too small.

For OpenMP, both Xeon and i5 have similar performance, with the difference in cache capacity being roughly indicated. Both have very similar performance.

For Single process, the difference between Xeon and i5 is more noticeable. This can be due to either the use of AVX instructions or different memory access rates. I am not sure of which.

The question I have is: Is the difference in the single process due to the use of AVX instructions on the i5 or due to the memory access speed difference ?

Based on other tests, there is little to suggest AVX os contributing on this Core i5.

Further Testing

I have tested an option to provide a temporary vector for each process, with an aim to improve the cache use for key information. The code is:

real*8 :: B_j(n)

….

!==== Matrix multiplication, using ROW to minimise memory access

!z allocate ( B_j(n) ) ! allocate failed but an automatic array ran

do ic = 1,cycles

!$OMP PARALLEL DO PRIVATE (i,j, b_j), SHARED (a,b,c,n)

do j = 1,n

B_j = b(:,j) ! create temporary copy of B for each process to assist cache

do i = 1,n

C(i,j) = C(i,j) + dot_product ( A(:,i), B_j )

end do

end do

!$OMP END PARALLEL DO

end do

There was no significant improvement in performance on either Xeon or i5 using this approach for the OpenMP tests !!