Multi-core Programming

—Student Workbook

Yang Quansheng(杨全胜)

Base on Intel’s manual

School of Computer Science & Engineering

SoutheastUniversity

1

Lab. 1: Intel® Compiler Switches

In this activity, you will set up environment and compile with both Microsoft* Visual C++.NET (MSVC*) and Intel® C++ Compiler (icl).

Setting up

1. Open an Intel compiler command prompt window (Start -> All Programs ->Intel(R)Software Development Tools -> Intel(R) C++ Compiler 10.0 -> Build Environment for IA-32 Applications).

2. Using Windows Explorer, navigate to the CompilerSwitches folder (this is in thedirectory where you initially copied the class files) and unzip the file RayTrace2.zip, if ithas not already been unzipped.

Compiling with MSVC*

1. Using the Intel compiler command prompt, change to the Raytrace2 directory:

> cd D:\classfiles\CompilerSwitches\raytrace2\source\RayTrace2

2. Make the files and clean up:

> nmake /fraytrace2.makclean

3. Complete the process:

> nmake /f raytrace2.mak CPP=cl.exe

Rendering the Image

1. Render the image by executing the following script:

> raytrace2 320 240

> Press ‘g ’ to begin the render

> Press ‘q ’ to quit the application

2. Record time elapsed ______

Compiling with Intel C++ Compiler

1. Make the files and clean up:

> nmake /fraytrace2.makclean

2. Complete the process:

> nmake /f raytrace2.mak

3. Render the image by executing the following script:

> raytrace2 320 240

> Press ‘g ’ to begin the render

> Press ‘q ’ to quit the application

4. Record time elapsed ______

Using High Level Optimizer (-O3)

1. Compile for high-level optimizations (HLO):

> nmake /fraytrace2.makclean

> nmake /f raytrace2.mak CF="-O3"

2. Render the image.

3. Record time elapsed ______

Using Inter-procedural Optimization (-Qipo)

1. Compile for inter-procedural optimizations:

> nmake /fraytrace2.makclean

> nmake /f raytrace2.mak CF="-Qipo" LF="-Qipo"

2. Render the image.

3. Record time elapsed ______

Using Profile-guided Optimization (-Qprof_gen, -Qprof_use)

1. Compile to create PGO instrumented binary:

> nmake /fraytrace2.makclean

> nmake /f raytrace2.mak CF="-Qprof_gen -Qprof_dir ..\RayTrace2"

2. Render the image.

Note: Time period reported for the instrumented executable would be of considerable length.

3. Record time elapsed ______

4. Compile to ‘use’ PGO information:

> nmake /fraytrace2.makclean

> nmake /f raytrace2.mak CF="-Qprof_use -Qprof_dir ..\RayTrace2"

Note: Ignore messages stating “no .dpi information.”

5. Render the original image (from Step 2).

6. Record time elapsed ______

Using Vectorization (-QxP)

1. Compile for vectorization:

> nmake /f raytrace2.mak clean

> nmake /f raytrace2.mak CF="-QxP"

2. Render the image.

3. Record time elapsed ______

Using All Optimization Options (-O3, -QxP, IPO and PGO)

1. Compile using the following options: -O3, -QxP, IPO, and PGO: (V10.0 is not work)

> nmake /f raytrace2.mak clean

> nmake /f raytrace2.mak CF="-O3 -QxP -Qipo -Qprof_use-Qprof_dir ..\RayTrace2" LF="-Qipo"

Note: You need not collect additional profile information; use the existing profile from “Using Profile-guided Optimization”.

2. Render the image.

3. Record time elapsed ______

Lab. 2: Basics of Intel® VTune™Performance Analyzer

Activity 1:Finding Hotspots

In this activity, you will collect sampling data for the program gzip.exe and identify thefunction that consumes the most execution time.

Collect Sampling Data for the Clockticks Event

1. Make sure that the virus scanner is disabled.

2. Double click the VTune Performance Analyzer icon on the desktop.

3. Click New Project.

4. Click Sampling Wizard.

5. Click OK.

6. Select Window*/Windows* CE/Linux Profiling.

7. Uncheck Automatically generate tuning advice.

8. Click Next.

9. In the Application To Launch text box, type:

D:\classfiles\VTuneBasics\gzip\release\gzip.exe

10. In the Command Line Arguments text box, type:-f testfile.dat

11. Click Finish. The VTune analyzer will now launch and profile gzip.

Figure2.1 Sampling Configuration Wizard

Questions

What function in gzip.exe takes the most time?

Which function in gzip.exe has the highest CPI?

Which line of source in gzip.exe has the most clocktick samples?

Is gzip.exe multithreaded?

Activity 2: Sampling Over Time

In this activity, you learn how to use the Sampling Over Time view. You analyze theperformance of a tbd.

Create a Sampling Activity

1. Create a new Activity by clicking on Activity->New Activity.

2. Click Sampling Wizard.

3. Click OK.

4. Select Windows*/Windows* CE/Linux Profiling.

5. Uncheck Automatically generate tuning advice.

6. Click Next.

7. In the Application To Launch text box, type:

D:\classfiles\VTuneBasics\matrix_mt\release\matrix.exe

8. Click Finish.

9. After the VTune analyzer finishes collecting data, click the Process button.

10. Click CTRL+A to select all processes.

11. Click the Display Over Time View button, shown at left.

12. You can now see when each of the different processes was running. You can do thisfor the Process, Thread, and Module views.

13. To zoom in on a particular time region, select the time region by clicking and draggingover it. Then click the Zoom In button, shown at left.

14. To see the regular sampling view for that time region, click the Regular SamplingView button, shown at left. You will now be able to drill down into your code for thattime region.

Activity 3: Introduction to Call Graph

In this activity, you will collect call graph data for the program gzip and identify the functionthat takes the most time and the functions that call it.

Collect the call graph data

1. Create a new Activity by clicking on Activity->New Activity.

2. Double click Call Graph Wizard.

3. Click Windows*/Linux* Profiling.

4. Click Next.

5. Type: D:\classfiles\VTuneBasics\gzip\release\gzip.exe in the Application to launchtext box.

6. Type -f testfile.dat in the Command Line Arguments text box.

7. Click Finish.

What function has the most time spent in it and what functions are calling it?

Activity 4: Using the WindowsCommand Line Interface

In this activity, you learn how to use the Windows* Command Line Interface to profile anapplication.

Collect Samples Based on the Clockticks Event using theCommand Line Interface

1. Open the command prompt by clicking Start->Programs->Accessories->CommandPrompt.

2. Go to the following directory by typing:CD D:\classfiles\VTuneBasics\gzip\release\

3. Create a new Activity that will launch gzip and collect samples based on theClockticks event by typing:vtl activity gzip -c sampling -app gzip.exe,”-f testfile.dat”

4. Run the last Activity created by typing vtl run.

(Alternatively, you can append the word “run” to the end of the command in Step 3 to

run the Activity immediately after it is created).

View the Profiling Data for gzip

1. Type: vtl view -modules to see the number of samples for each module system-wide.

2. Type: vtl view -hf -mn gzip.exe to see the function level breakdown of the samples ingzip.exe.

Pack the Data and View It in the GUI

1. Type vtl pack gzip.

This creates a file called gzip.vxp. This file can be transported from computer tocomputer and also opened in the VTune analyzer’s GUI.

2. Start VTune Performance Analyzer (For example, by double-clicking the desktopicon, if present.)

3. Click Browse for Existing File.

4. Open: D:\classfiles\VTuneBasics\gzip\release\gzip.vxp

5. Click OK.

6. Click OK.

The GUI displays the performance data.

Lab. 3: Intel® Math Kernel Library

Activity 1: Matrix Multiply Sample

This activity demonstrates performance characteristics of BLAS levels 1, 2 and 3,compared to C source code. In this activity you will inspect, build and run a matrix multiplysample using source code, DDOT, DGEMV and DGEMM.

1. Navigate to the folder D:\classfiles\MKL_Overview\DGEMM, and open the filemkl_lab_solution.c using any editor of your choice. Go through the code quickly toconfirm the 4 implementations of matrix multiply.

2. Examine the Makefiles supplied, and identify the key link steps to enable using MKL.

If necessary, edit the file so that all include or library paths are correct. Use theseMakefiles to build the demo.

> nmake /f makefile

Run the programs and record the timings.

3. Note differences in timings among the different implementations:

rool_your_own:______
using DDOT:______
using DGEMV:______
using DGEMM:______

4. MKL functions assume a default threads value of 1. Change this value by setting theenvironment variable OMP_NUM_THREADS, for example:

set OMP_NUM_THREADS=2

5. Observe the performance at different numbers of threads.

What happens then that the number exceeds the physical processors?

Activity 2: Monte Carlo Calculation of Pi

In this activity, you will modify the Monte Carlo computation of pi to use the randomnumber feature of the Vector Statistical Library (VSL). You will also make use of themultithreading capabilities of VSL.

1. Navigate to the folder D:\classfiles\MKL_Overview\MonteCarloPi, and open the file pimonte.cusing any editor of your choice. Go through the code quickly to understand how therand() function is implemented.

Thought question: Could this loop be threaded? Yes

2. Examine the file pimonte_VSL.c, and understand the changes necessary toimplement a library call to replace rand().

Thought questions: Why is this not a 1:1 substitution for rand()?

What is the purpose of, and sensitivity to, blocksize?

What are the parameters BRNG and VSL_BRNG_MCG31?

Are they the best choices for this computation?

Could this implementation be threaded?

3. Examine the Makefiles supplied, and identify the key differences. Use theseMakefiles to build versions of this application with rand() and with VSL (recall thesyntax "make -f"). Note the impact of the "-xP" switch in the compiler report, in the twoversions. Note also the difference in results for pi, and for the run times of each image.

4. As in the previous exercise, change the number of threads used and observe thechanges in both performance and in the value of p.

Lab. 4: Programming withWindows* Threads

Activity 1: Starting With HelloThreads

Build & Run HelloThreads Program

1. Close Microsoft Visual Studio, if it is started.

2. With Windows Explorer*, open the folder D:\classfiles\Win32Threads\HelloThreads\.

3. With Microsoft Visual Studio, open the file HelloThreads.sln by double-clicking it.

4. From the Build menu, select Configuration Manager and then select Debug build.

5. From the Project menu, click Properties and then click the C/C++ folder.

6. Make sure that Debug options are selected, as shown in Figure 4.1.

7. Make sure that Optimization is disabled, as shown in Figure 4.2.

8. Make sure that thread-safe libraries are selected, as shown in Figure 4.3.

9. Make sure that Debug symbols are preserved during the link phase, as shown inFigure 4.4.

10. From the Build menu, select Build Solution to build your project.

11. From the Debug menu, select Start Without Debugging to run the program.

12. In Microsoft Visual Studio’s Solution Explorer, expand the HelloThreads project andselect the file main.cpp to open it, as shown in Figure 4.5.

13. Modify the thread function to report the thread creation sequence (that is, “HelloThread 0”, “Hello Thread 1”, “Hello Thread 2”, and so on).

Hint: Use the CreateThread() loop variable to give each thread a unique number.

14. Build and execute your program.

In what order do the threads execute?

Do the results look correct?

Why or why not?

Review Questions

The execution order of threads is unpredictable.

True False

What build options are required for any threaded software development?

Figure 4.1. Project Setting – C/C++ Folder – Debug Options

Figure 4.2. Project Settings – C/C++ Folder – Optimization Options

Figure 4.3. Project Settings – C/C++ Folder – Thread-safe Libraries Options

Figure 4.4. Linker Settings – Debugging Folder

Figure 4.5. Solution Explorer – HelloThreads Project

Activity 2: Approximating Pi withNumerical Integration

Build and Run the Serial Program

1. With Windows Explorer, open the folder D:\classfiles\Win32 Threads\Pi\.

2. Start Pi.sln by double-clicking it.

3. From the Build menu, select Set Active Configuration and then select Debug build.

4. From the Project menu, click Properties and click the C/C++ folder. From the Buildmenu, select Build Solution to build your project.

5. From the Debug menu, select Start Without Debugging to run the program.

Is the value of the Pi (3.1415…) printed correct?

Why or why not?

Correct Errors and Validate Results

1. In Microsoft Visual Studio’s Solution Explorer, expand the Pi project and open the file,Pi.cpp.

2. On the C/C++ folder, make sure that thread-safe libraries are selected, as shown inFigure 4.6.

Figure 4.6. Project Settings – C/C++ Folder – Thread-safe Libraries Options

3. Thread the serial code to compute Pi using four threads. The bulk of the computationbeing done is located in the body of the loop. Encapsulate the loop computations intoa function and devise a method to ensure that the iterations are divided amongst thethreads such that each iteration is computed by only one thread.

4. Use a CRITICAL_SECTION to protect shared resources accessed by more than onethread. Locate any data races in your program and correct those errors. Some logicchanges from the serial version might be required to create code that is both correctand safe.

Challenge:Minimize the number of lines of code in the critical section(s).

Hint: Think about local variables.

5. When you have changed the source code, from the Build menu, select BuildSolution to rebuild and then from the Debug menu, select Start Without Debuggingto execute.

6. Keep correcting your source code until you see the correct value of Pi being printed.

The correct value of Pi is 3.141592654.

Hint: A complete solution to the lab is provided in the file PiSolution.cpp, which is located in

the following directory: D:\classfiles\Multi-Core\Windows\Win32 Threads\Pi\

Review Questions

All threads should use the same CRITICAL_SECTION.

True False

Threading errors in software can always be corrected by using only synchronization

objects.

True False

CRITICAL_SECTION objects should always be declared as global variables.

True False

What build options are required for any threaded software development?

Activity 3: Using Semaphores

The application opens an input text file. Threads read in lines from the text file and countthe total number of words in the line, as well as the number of words with an even numberof letters and an odd number of letters. When done, the text file is closed and the finaltotals are printed.

Build and Run the Serial Program

1. With Windows Explorer, open the folder D:\classfiles\Win32Threads\SemaphoreLab\.

2. Double-click on the SemphoreLab.sln icon. Within Microsoft* Visual Studio, you willfind two projects. One is the serial version of the code and the second is the firstattempt at threading that code with Win32 Threads.

3. Be sure the Serial project has been selected as the Startup Project. Build thesolution and run the serial application using the Start Without Debugging commandfrom the Debug menu.

4. Note the output generated:

Total Words: ______
Total Even Words: ______
Total Odd Words: ______

Build & Run Threaded Program

1. Set the Threaded project as the Startup Project. Build the solution and run thethreaded application using the Start Without Debugging command from the Debugmenu. Do you get the same answers as above?

2. Examine the source code. Identify the global variables that are being accessed byeach thread.

3. Rewrite the threaded application to protect the use of these global variables.Wherever mutual exclusion is needed in your solution, use a binary semaphore. Besure to declare the semaphores at the proper level and initialize them before use.

4. Build and run the threaded code until you are able to achieve the same totals as theserial version of the application.

Activity 4: Using Events

The application computes an approximation of the natural logarithm of (1 + x), -1 < x <= 1,using the Mercator series (ln( x + 1 ) = x - x^2/2 + x^3/3 - x^4/4 + x^5/5 - ...). Compute threads are created in a suspended mode. Thesethreads are released to compute the series elements that have been assigned after a"master" thread has been. The master thread waits on a thread count variable that isincremented by each compute thread as it finishes. Once all threads have completedcomputing the partial sums, the master thread does a final summation and terminates. Results are printed by the process thread after cleaning up all the objects and handles.

Build and Run Original Threaded Program

1. With Windows Explorer*, open the folder C:\classfiles\Win32Threads\N_ThreadEvent.

2. Double-click on the N_ThreadEvent.sln icon.

3. Within Microsoft* Visual Studio, examine the source code until you understand howthe threads are created and interact with the others.

4. Build the solution and run the application using the Start Without Debuggingcommand from the Debug menu.

Modify Original Threaded Program to Use Events

1. Modify the threaded code to use events to signal the master thread rather than thecount of a global variable. Some hints for how to do this are given below:

a. Create an array of event handles, one event per thread. This can be done like thethread handle array is allocated. Be sure to initialize each event.

b. Replace the spin-wait in the master thread with a wait on all the events beingsignaled from each compute thread.

c. Replace the protected increment of the thread count variable with a signal of theevent in the array that is indexed with the thread number. (Do you still need the criticalsection?)

2. Build the solution and run the application using the Start Without Debuggingcommand from the Debug menu.