Sample Write-up on Use of Intel(R) Parallel Studio to Optimize the Application for the Intel® Parallel Studio Parallelism Contest & Lab at the Intel® Developer Forum San Francisco 2009

About This Document

Introduction

The problem......

Serial Program Characterization

Intel(R) Parallel Amplifier: Hotspot Analysis

Introducing Threading

Choosing Parallelization Model

Expressing Parallelism

Intel(R) Parallel Inspector: Threading Errors

Intel(R) Parallel Amplifier: Concurrency Analysis and Locks and Waits Analysis

Performance Results

Benchmark Results

Core Utilization

Conclusion

About This Document

This document can be used as a sample write-upby participants of “Intel® Parallel Studio Parallelism Contest”. It contains general recommendations on how to use Intel Parallel Studio to optimize applications and examples of its usage with one of the Intel Parallel Studio product samples - N-queens. We recommend using similar document structure by all contestants.Your submittal should focus on the steps you took to parallelize the bodytrack application using Intel® Parallel Studio giving details as demonstrated in this write-up for N-queens.

Introduction

We will use N-queens sample application to demonstrate usage of Intel(R) Parallel Studio. N-queens source code and project files can be found at <Parallel Studio install directory>\Samples directory. Complete sample code guide can be downloaded from the following URL:

The problem

The ultimate goal of the N-Queens problem is to find the total number of distinct solutions to place a given number of chess queens on a quadratic chess board with a specific number of squares on one side of a board (N) in a way that no two queens would be able to attack each other.

Two queens can attack each others when they are placed on the same horizontal row, vertical column or in one of (2*(2*n - 1)) possible diagonals.

Serial Program Characterization

It is always a good idea to optimize serial application performance before introducing threads. But the very first step is setting a baseline – time that the original version of the application spends running before any changes are applied.

Next step is finding hotspots in your application. Hotspot is a function that executes significant amount of time during application run. It is important to apply optimizations to hotspot functions first becuase improving thier performance will likely improve overall performance of the entire application.

Note: It is recommended that you build your application in release mode for measuring a baseline performance.

Intel(R) Parallel Amplifier: Hotspot Analysis

Intel(R) Parallel Amplifier, one of the components of Intel(R) Parallel Studio, findshotspots in your application. Serial version of N-queens application has only one hotspot, recursive function setQueen, as determined by Intel Parallel Amplifier:

This means that we should focus on optimizing this function to improve overall performace of the application. You can double click on any function name and Intel Parallel Amplifier will take you to the source view where you will see CPU time broken down by function source lines:

This capability might be very helpful during serial optimization. Moreover, whenver you decide to change the code, just double-click on the source line and you will be transferred to the editor of Microsoft Visual Studio*: change the function code, re-build the application, run Intel Parallel Amplifier again, and compare performance profiles.

Note: Optimization is an iterative process. Once you improved performance of first hotspot function, performace profile will change and Intel Parallel Amplifier will show new list of hotspot functions. In this particular example, we didn’t do anything special in terms of serial optimization but it is usually a good idea to first try re-structuring your code if you see an oportunity to improve performance that way, and to try using Intel(R) C++ Compiler and its optimization options.

Introducing Threading

The sample code guide for N-queens example (URL provided in the Introduction chapter) describes usage of all supported threading techniques. In this write-up, we will focus on the version which uses OpenMP*.

Choosing Parallelization Model

Intel(R) Parallel Studio supports multiple methods of expressing parallel solutions in C and C++:

- Intel(R) Parallel Composer supports parallel extentions

- OpenMP*

- Intel(R) Threading Building Blocks

Choosing the right parallelization technique is very important and you should consider many factors while making the decision. Some techniques require compiler support but are easier to use in simple cases, others are not portable but allow for the maximum control. There are solutions that might fit better your programming style and development environment. We recommend you to read the following article which compares some techniques and highlights the factors that should be taken into acoount:

In your own write-up for bodytrack application, you might want to explain how you chose threading model to optimize the app.

Note: Intel(R) IPP library contains highly optimizaed and threaded functions which your program could call and possibly archieve higher performance.

Expressing Parallelism

We used Intel(R) Parallel Amplifier to determine hotspot function - setQueen. And that function was our main target for serial optimizations. However, when you look for the best function to apply parallelism to, hotspot function is not always the best candidate: its running time could be the longest as the aggregated measure but in the most of the cases it happens because it was being called many times each iteration of a loop in some other function at the higher level of the call tree.

Often, threading at a higher level provides better performance improvement because parallelisation of hotspot function may be too fine-grained and may lead to high parallel overheads. Intel(R) Parallel Amplifier shows call trees for every hotspot function which you could use to find right level for parallelisation. In the case of N-Queens, there is a function called ‘solver’ that calls setQueens in the for loop. ‘solver’ is a good candidate for parallelization and, just as an example, we implemented OpenMP* pragma parallel for to make loop execute in parallel (line #3):

OpenMP* requires compiler that supports it. Intel(R) Parallel Composer supports OpenMP*. First, you should configure your Microsoft Visual Studio* project to use Intel(R) C++ Compiler by right-clicking on the progect in Solution Explorer and choose option “Use Intel C++...”:

To enable OpenMP*, you should change your progect properties as shown on a screenshot below:

Note: You could use “Select Build Components” to have Intel(R) Parallel Composer configure your project to use available threading libraries – Intel(R) Threading Building Blocks and Intel(R) IPP:

OpenMP* is not the only way of expressing parallelism. Intel(R) Threading Building Blocks, for example, provides easy to use classes and template functions that represent generic implementation of common parallel programming patterns. In the case of N-queens application, template function tbb::parallel_for could be used to parallelize function ‘solver’. Class tbb::taskcould be used to impelement recursive creation of parallel tasks to parallelize recursive function.

Intel(R) Parallel Inspector: Threading Errors

Parallel programs might contain data races, dead locks, and other threading errors that are usually hard to detect and debug. Intel(R) Parallel Inspector could be used to find those problems. After adding OpenMP* pragma parallel for we ran Intel(R) Parallel Inspector and it found data race in the application. The following screenshot shows the list of errors and remarks which Intel Parallel Inspector found:

For any item from the list, it is possble to get more details by clicking “Interprete Result” button. The below screenshot shows a code snippet for one of the Observations in the selected Problem Set:

It turns out that global variable nrOfSolutions is being accessed and modified by multiple threads (setQueen function) and no synchronization is being used to protect this variable from undeterministic modifications. This is a data race which we could fix in multiple ways.

In the attempt to fix the data race, we used OpenMP* critical section and modified the code in the following way (line # 5):

When we re-ran Intel(R) Parallel Inspector, it didn’t detect any more data races or other threading errors.

Note: Intel(R) Parallel Inspector can also help find memory errors such as memory leaks and incorrect memory usage.

Usually, when parallel program uses locks, it might start performing slower so, we need to do a performance analysis of this parallel application to determine if there are any performance issues caused by locking and to see if our parallelization strategy can be improved.

Note: Intel(R) Parallel Amplifier and Parallel Inspector support analysis of OpenMP* applications compiled by Intel(R) C++ Compiler only. If you do decide to use OpenMP* to optimize bodytrack application, please convert your project to use Intel(R) C++ Compiler (instructions on how to do it can be found in chapter ”Expressing Parallelism”).

Intel(R) Parallel Amplifier: Concurrency Analysis and Locks and Waits Analysis

We ran Intel(R) Parallel Amplifier Concurrncy analysis and while it showed good level of cores utilization (green bars on the below screenshots), the average concurrency level is 1.94 on the dual core test system (number in the red circle on the Summary pane) – the ideal concurrency level for dual core system would be 2:

Often, poor concurrency (red bars) are caused by inefficient use of synchronization objects or work-load imbalance. To find performance issues caused by locks, we ran Intel(R) Parallel Amplifier Locks and Waits Analysis and we noted that OpenMP* critical section that we used to protect access to nrOfSolutions caused threads to stay in an idle state (gray bar) and serialized the program (red bar):

In fact, using critical section or other type of synchronization object just to increment counter is not the most efficient approach. We decided to use OpenMP* atomic instead and, after re-ranning the modified program with Intel(R) Parallel Amplifier, got nearly ideal concurrncy level and better performance:

The changed source code is the following (line #5):

Note: Intel(R) Threading Building Blocks library provides template class tbb::atomic which implements atomic operations with C++ style and can be used with integtal types, enumerations, and pointer tupes.

Performance Results

To collect the final perfromance stats, build your application in release mode and make sure to use all compiler optimization options which make compiler generate the most optimal code. Measure the performance running your application with 1, 2, ... N threads (where N is the number of cores on your system) to study how well your solution scales with the growing number of cores.

Benchmark Results

This section is a placeholder for the performance charts which you should collect for your own write-up. Use the original application’s execution time as a baseline. Adding scalability profile here will be the added advantage.

Note: For precise timing, do not run your benchmarks with Intel(R) Parallel Amplifier becasue it inhibits some analysis overhead. The best way to determine the correct execution time is to run the app from the command line multiple times and use the average time as a final result.

Core Utilization

Use Intel(R) Parallel Amplifier Concurrency Analysis to determine average concurrency level for your optimized bodytrack application. Details on how to measure average concurrency level with Intel(R) Parallel Amplifier are described in chapter “Intel(R) Parallel Amplifier: Concurrency Analysis and Locks and Waits Analysis”.

Conclusion

This section is a placeholder for you to share your experience using Intel(R) Parallel Studio with us. If you have any improvement suggestions for the product we will very much appreciate hearing about them.

1