# The Challenges of Writing Portable, Correct and High Performance Libraries for Gpus Or How to Avoid the Heroics of GPU Programming

Portable, Correct and High

Performance Libraries for GPUs or How to Avoid the Heroics of GPU

Programming

Prof. Miriam Leeser

Department of Electrical and Computer Engineering

Northeastern University

Boston, MA

mel@coe.neu.edu Outline

• Biomedical GPU applications at Northeastern

– Sparse Linear Solvers in SCIRun

– Lung Tumor Tracking

• The challenge of writing libraries for GPUs

– Example: Calculating Pi

• Correctness and GPUs

– PUG: A Symbolic Verifier for GPU Kernels

• Floating Point Arithmetic on GPUs

– Floating Point Correctness

• Future Directions GPU Programming

• Currently we are focusing on NVIDIA and CUDA programming

• Also interested in AMD/ATI hardware and OpenCL as a programming model

– CUDA programming tools are more mature at the moment

– AMD has released their Fusion architecture

– Intel recently released Sandy Bridge, a chip that integrates processor and GPU on a single die

• Programming model?

– This is a rapidly changing field!

• For the rest of this talk, for GPU read “NVIDIA” and for GPU programming, read “CUDA” GPU as a General Purpose

Computing Platform

Speedups are impressive!

Real Time Elimination Lattice-Boltzmann Method

Genetic Algorithm

Total Variation Modeling

2300 X 1000 X 1840 X of Undersampling Artifacts for Numerical Fluid Mechanics

2600 X

Monte Carlo Simulation Stochastic Differential K-Nearest Neighbor

Of Photon Migration Equations Search

Fast Total Variation for

Computer Vision

1000 X 675 X 470 X

1000 X

Source: CUDA Zone at How to get 1000x speedup

• Match algorithm parameters to the GPU you are running on

• Even better, pick the algorithm parameters to match the GPU!

• Example: template matching

– Pick template sizes that fit in shared memory

– Pick image sizes that fit into texture memory

– Pick image sizes so that the number of pixels are a multiple of the optimal number of threads

• Hardware changes …. reoptimize! Examples of Heroic programming How to write a Generic GPU library

• Cannot pick your parameters

– Have to work with the problem parameters

• Cannot optimize to the GPU architecture

– Different Nvidia GPUs have different

• Number of Streaming Multiprocessors (SMs)

• Sizes of memories

• Single or double precision Floating Point

• Can we still deliver performance?

• Can we guarantee correct behavior across multiple platforms? Real World Applications

• Some applications we are working on at

Northeastern:

– Linear Systems Solvers for SCIRun

– Lung Tumor Tracking

• Examples of library code for GPUs that must work over a range of parameters University of Utah’s SCIRun

• SCIRun is a biomedical problem solving environment

• Allows scientists to create a network of mathematical functions

• GUI allows easy simulation design with the ability to draw the flow of data to many algorithms

• Many of these algorithms are time consuming

… and display parallelism!

This work was made possible in part by software from the NIH/NCRR Center for Integrative Biomedical Computing,

P41-RR12553-10. www.sci.utah.edu/cibc/software/106-scirun.html SolveLinearSystem Module

• Solves Ax=b using iterative methods

• Matrices are large and sparse

• Contains several different algorithms that exhibit parallelism

• User selectable GPU check-box ParallelLinearAlgebra (CPU)

Original CPU CG Layout

• All general algorithms exist at

SCIRun’s module level

SolveLinearSystem

(Algorithms)

• ParallelLinearAlgebra class implements CPU matrix and vector computations with basic

BCG Others

CG optimizations (loop unrolling)

• Algorithms call these low level math

ParallelLinearAlgebra

(Low-Level Math) functions as an abstraction

Add Multiply

• This structure lends itself to a convenient GPULinearAlgebra sibling GPULinearAlgebra Conjugate Gradient Algorithm

• Iterative Solver

• Data in these problems are generally sparse

• Due to matrix-vector sparse matrix multiplication (SPMV), there is good parallelism

SPMV

• Limited by sparse data organization and algorithmic dependencies (both within and across iterations) Experiment Details

• Test Machine

• CPU – Intel Core 2 E6300 1.86GHz, 2Mb L2 cache, 1066MHz FSB

• GPU - NVIDIA GeForce 280 GTX 1GB RAM PCIe card

• Test conditions

• Tests were run 10 times each and averaged

• Test time is end to end – includes all data transfer and setup overheads involved in GPU version

• Test data

• Heart Ischemia Model

• University of Florida’s Sparse Matrix Collection

14 Input Data

Number of Rows

4000

3500

3000

2500

2000

1500

• The sparse matrices vary in

1000

500

0size from 6.3K to 3.5M rows

Nonzeros

16000

14000

12000

10000

8000

6000

4000

2000

0

• Nonzeros vary from 42K to

14.8M

15 Conjugate Gradient

GPU/CPU End to End Speedup – Nearly identical performance in each of 10 runs

CPU: Intel Core 2 1.86GHz

GPU: NVIDIA GeForce GTX 280

Double Precision is used in all examples

16 Jacobi and Minimal Residual

CPU: Intel Core 2 1.86GHz

GPU: NVIDIA GeForce GTX 280

Double Precision is used in all examples

107K x 107K Heart Ischemia Model

Time (seconds)

Algorithm Speedup

CPU GPU

CG 164.57 31.05

5.3x

3.4x

6.9x

Jacobi 7.42 1.46

MinRes 81.96 11.80

17 Third Party Implementations

• Many third party packages are available as open source

• They may perform better but are more difficult or impossible to incorporate into the user experience of SCIRun

• CNC Number Cruncher (CG implementation)

• Gocad Research Group – Nancy University, France

107K x 107K Heart Ischemia Model

Time (seconds)

Algorithm Speedup

CPU GPU

CG 164.57 31.05

5.3x

5.9x

3rd Party CG 164.57 27.98

L.Buatois, G.Caumon B.Lévy, Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU, High

Performance Computing and Communications, Third International Conference, HPCC, 2007.

18 Validation of Results

Difference in Iterations to Converge

1.2

1

0.8

0.6

0.4

0.2

0

507 1016 1580 2174 3640 4126 5759 8485 15681 25195

CPU Iterations to Converge

• Different orders of operations affect the iterations necessary to achieve desired error

• Double precision is necessary to limit this

• The CPU and GPU differ in the number of iterations needed to converge by less than 1%

19 Discussion

• Speedup was achieved using the original CPU algorithm

• The only added operations are transferring the data to the GPU

• The algorithms were accelerated by a simple technique that can be applied to algorithms throughout SCIRun

Iterative portion of the Jacobi Method solver

20 Where the Performance is Realized

• In SpMV, each row is computed by one thread

• Small number of rows = low utilization of GPU

• The other vector operations (mostly accelerated via the CUBLAS library) are relatively fast but occupy a low % of total time

(ms) (ms)

Calculation CPU GPU Speedup

Time Distribution of Operations

Data Copy and Setup 0.08 190.22 -2377.75x

Preconditioner 145.11 8.26 17.57x

SpMV 130.68 9.37 13.95x

SpMV

Subtract

Subtract 21.09 0.72 29.29x

Dot Product (2)

Dot Product (2 per iter) 12.97 0.53 24.47x

Norm 6.77 0.45 15.04x

Normalize

Scale and Add (2)

Scale and Add (2 per iter) 0.72 27.25x 19.62

Total time per iteration 13.04 17.15x 223.72

21 A video recording of the CPU and GPU versions of the Conjugate Gradient Algorithm

CPU GPU Lung Tumor Tracking

Work with research by Ying Cui, Jennifer Dy, Gregory Sharp, Brian

Alexander, and Steve Jiang from Northeastern University, Massachusetts General Hospital, Harvard

Medical School, and University of California San Diego

• Some tumors move significantly during breathing

– Therapy targets large area with weak radiation – prevents damage to normal tissue

– Goal is to use higher-intensity focused radiation without implanted markers

• Two template-based tracking algorithms proposed:

– Motion-enhanced templates and Pearson’s correlation

– Eigen templates and mean-squared error

23 Tumor Marking Tool

Templates are generated with a MATLAB tool

Tumor is selected by a clinician

24 Motion Enhancement

Static Motion

Image Enhanced

• Motion enhanced image is difference between the image and average of images used for templates

– Moving structures are emphasized

Image: Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic

Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and 25

Biology, Vol. 52, pp. 6229-6242, 2007. Multiple Templates

Image brightness tumor position varies during breathing

Periodic: multiple representative templates from different points in the respiration cycle are used to compensate

Image: Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic

Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and 26

Biology, Vol. 52, pp. 6229-6242, 2007. Lung Tumor Reference Data Sets

Patient Frames Templates Template Size Shift ±V/±H

1442 53×54 18/9 12

348 211/5 13 23×21

259 39/4 10 76×45

290 411 156×116 9/3

273 512 86×78 11/6

210 614 9/2 141×107

• Set of fixed large templates

– Significant variation in template sizes and much larger than typical

• Small search with single ROI per frame

• Different part of the problem space from traditional template matching

27 Similarity Score

• Pearson’s correlation for similarity score

A A B B

MN MN

M N corr2(A, B)

2

2

A A B B

MN MN

M N M N

• Sliding window operation

– Template data (A) fixed for given template

– ROI data (B) depends on window location and frame

• Average subtraction prevents frequency domain Similarity Score

• Pearson’s correlation for similarity score

C

A B B

MN MN

M N corr2(A, B)

2

AD

B B

MN

M N

• Sliding window operation

– Template data (A) fixed for given template

– ROI data (B) depends on window location and frame

• Average subtraction prevents frequency domain Corr2() Numerator

C

A B B

MN MN

M N

• Corr2 implemented as several kernels

– Average, denominator, and numerator

– Numerator is where most of the work is done

• Problem instance complicates mapping to CUDA

– Where to generate parallelism

– Few templates (10 to 14)

– Assume streaming (one ROI at a time)

– Search space is small (95 to 703 total positions) Data Placement

• Common CUDA GPU paradigm: use shared memory to cache working set

– Adjacent template locations share all but one row/column of frame data

– Shared memory blocking reduces global memory accesses

• Does not work for case study

– Only smallest template and ROI fit

– Largest templates too large for constant or shared memory Tiled Template (1)

C

A B B

MN MN

M N

• corr2() numerator outer loops are addition

• Break into tiles and separately process subtemplates

– Generates more blocks for parallelism

– Reduces working set size for shared memory

• Second step to combine the partial results

– Reduction kernel Tiled Template (2)

• Tiles mapped to grid of thread blocks

– Supports arbitrary template sizes

– One thread per shift location

• Efficient tile size may not match arbitrary template dimensions

– Leftover regions

– Corr2() complicates padding Tiled Template Example

• Edge tiles adjust only one dimension of main tiles

• Patient 4 template size: 156 × 116 pixels

Main Tile Main Right Tile Right Bottom Bottom Corner

Size Blocks Size Blocks Tile Size Tiles Tile Size

8×8 19×14 8×4 19 4×8 14 4×4

16×10 9×11 916×6 12×16 11 12×6

4×4 39×29 Remaining Kernels

C

A B B

MN MN

M N corr2(A, B)

2

AD

B B

MN

M N

• Frame averages and denominator contribution similar to numerator

– Tiled and reduction steps

– Simpler than numerator

• Final fraction is an element wise operation Experimental Setup

• Benchmarked arbitrary tile sizes

– 4×4 to 16×16

– Thread counts arbitrarily fixed

• Compared against

– MATLAB (optimized)

– Multi-threaded C (non-vectorized)

• Benchmarking hardware

– Intel Xeon W3580 (4 i7 cores @ 3.33 GHz, 6MB L2)

– Nvidia Tesla C1060 with CUDA 3.1

– Ubuntu 10.04 64-bit (GCC 4.4.3) and MATLAB R2010a Performance

• Good performance across patients

– Includes data transfer

• Limited parallelism available for patient 2

– Small template and small

ROI

GPU vs Multithreaded C:

Patient 123456

Best Tile Size 8×8 16×2 4×4 8×8 16×10 8×8

Template Size 141×107 53×54 23×21 76×45 156×116 86×78

4.09 0.64 Total Speedup 3.61 6.92 5.66 6.86

Performance Analysis

• Numerator dominates

– GPU runtime

– Most complicated

• Processes templates independently

• Other stages applied once per frame

• Still room for optimization Real Applications, Real Data

• Biomedical Applications

– Linear Solvers in SCIRun

– Tumor tracking

• These are real world apps with real data … and real speedup!

• Speedup shown is end to end application speedup

• Algorithms NOT chosen to show off best of a given GPU architecture

• For accelerating SCIRun and Matlab (lung tumor tracking) want generic solutions

– Run on a family of GPUs

– Run on a variety of different data sizes

• Why is this hard?

• How do we know the GPU results are correct? Nvidia GPU differences

Release Floating

GPU SMs | Cores Clock Memory

Date Point

Nov. 2006

SP

GeForce 8800 GTX 16 | 128 575MHz 768MB

SP

GeForce 8800 GS Jan. 2008 12 | 96 550MHz 384MB

GeForce 280 GTX Jun. 2008 30 | 240 602MHz 1024MB

Tesla C1060 Jun. 2008 30 | 240 602MHz 4096MB

SP/DP

SP/DP

• All GPUs in the table have 16KB shared memory per SM

• All launch 32 threads per warp Example: Sequential Computation of Pi

Static long num_steps = 100000; double step; void main ()

{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);

}pi = step * sum;

}

• From OpenMP tutorial given at SC’99 by Tim Mattson and Rudolf

Eigenmann:

Parallelizing Sequential Pi

Static long num_steps = 100000; double step; void main ()

{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);

}pi = step * sum;

}

• Each calculation of x inside the for loop is independent

• First pass: Each CUDA thread calculates x and the 4.0/(1.0+x*x)

• Second pass: reduction addition to compute sum

• Final step pi= step*sum done on the CPU Snippet of CUDA for first pass

__global__ void calculateStepKernel (double *ref_d, double step, int length)

{extern __shared__ double sdata[]; int tid = blockIdx.x * blockDim.x + threadIdx.x; int lid = threadIdx.x; sdata[lid] = (tid length) ? ref_d[tid] : 0; if(tid==0) sdata[lid]=0.0; if(tid 0 tid = length) sdata[lid] = 4.0/(1.0+((tid-0.5)*step)*((tid-0.5)*step)); ref_d[tid]=sdata[lid];

}CUDA for reduction sum if(tid length)

{for (unsigned int s=1; s blockDim.x; s *= 2)

{int index = 2 * s * threadIdx.x; if (index blockDim.x index+s = length) sdata[index] += sdata[index+s];

__syncthreads();

}

}We run reduction code

Release

GPU SMs | Cores Clock Memory Result

Date

Nov. 2006

GeForce 8800 GTX 16 | 128 575MHz 768MB

X

√

√

√

GeForce 8800 GS Jan. 2008 12 | 96 550MHz 384MB

GeForce 280 GTX Jun. 2008 30 | 240 602MHz 1024MB

Tesla C1060 Jun. 2008 30 | 240 602MHz 4096MB

• It works on 3 out of 4 GPUs. Why? Sync_threads is in wrong place

Correct

Incorrect for (unsigned int s=1; s blockDim.x; s *= 2) if(tid length)

{

{int index = 2 * s * threadIdx.x; for (unsigned int s=1; s blockDim.x; s *= 2)

{if(tid length) int index = 2 * s * threadIdx.x; if (index blockDim.x index+s = length) sdata[index] += sdata[index+s];

{if (index blockDim.x index+s = length) sdata[index] += sdata[index+s];

}

__syncthreads();

}

__syncthreads();

}

}

• CUDA development forums say:

• the behavior is undefined if you disobey the rule to hit all the same synchronizations with all threads

– CUDA gdb did not find the bug Pi calculation lessons

• Cannot test my CUDA (or OpenCL) library kernels on every GPU model

• Library developer needs better tools to check for correctness

– CUDA GDB is not sufficient

– Other more powerful tools, but not symbolic

• PUG: A Symbolic Verifier for GPU Kernels developed at the University of Utah by Ganesh

Gopalakrishnan’s group

• http://www.cs.utah.edu/formal_verification/PUG/ PUG Contributions (Ganesh Gopalakrishnan)

• PUG is more comprehensive than other approaches

• PUG guarantees the absence of these errors (two threads modeled):

– Data races

– Mismatched barriers ( __syncthread() )

– Assertions on values placed by users

• PUG helps understand ranges of parameters / dimensions

– Many “bugs” are correct code run outside allowed dimensions

• V0.3 of PUG (under development) can verify for arbitrary number of threads:

– A large class of practical kernels

• Useful for comparing unoptimized and optimized kernels

• PUG gives valuable information on the degree of bank conflicts

• PUG uses a symbolic FV approach

– Symbolic approach can be very effective for CUDA

48 Floating Point Correctness

• Most published work on floating point correctness predates the release of double precision GPUs

• Now GPUs have double precision

– Do we need to worry about correctness anymore?

• Correctness of individual operations

– add, sub, div, sqrt

– These are all IEEE Floating Point Compliant

• Does that mean that code produces the same answer on every CPU or GPU that is IEEE Floating Point Compliant? Example: Sequential Computation of Pi

Static long num_steps = 100000; double step; void main ()

{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);

}pi = step * sum;

}

• From OpenMP tutorial given at SC’99 by Tim Mattson and Rudolf

Eigenmann:

What is this code doing?

• It is performing an integral of the of the derivative of the arctangent.

• The derivative of the arctangent is 1/(1+x2).

• A definite integral from 1 to 0 gives arctan(1) - arctan(0)

– which is pi/4 - 0 or just pi/4

• The code folds the factor of 4 into the integral. Approximating an Integral

• The code is doing a Riemann sum approximation of the integral using segment midpoints with num_steps segments.

Static long num_steps = 100000; double step; void main ()

Δx

{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);

}pi = step * sum;

}

• What I learned in high school math:

– The more rectangles, the closer to the exact solution Single Precision – No Optimization

Pi Results - CPU

3.150

3.145

3.140

Intel CPU

3.135

3.130

Number of Steps Single Precision – No Optimization

Pi Results - CPU vs GPU

3.150

3.145

3.140

Intel CPU

3.135

NVIDIA GTX 280

3.130

Number of Steps Correctness of Operations vs.

Correctness of Applications

• Floating Point is not associative!

• Manycore processing reorders operations

– Results are not necessarily the same

– Is the CPU version correct?

• Example: Summation: Floating Point Arithmetic and Pi

• Are the results the same on every platform?

• We tested on 2 CPUs and 2 GPUs:

– CPU - System 1: Intel Xeon W3580 @ 3.33GHz

– CPU - System 2: Intel Xeon E5520 @ 2.27GHz

– GPU - System 1: Tesla C1060

– GPU - System 2: ATI Radeon 5700

• Compared results to Wikipedia Pi:

– 3.14159265358979323846264338327950288419

7169399375105820974944 … Different Pi Calculations Tested

• Single Precision with no optimization

– On the CPU, calculate the step and keep a running sum over the loop to use to compute pi (as written)

– On the GPU, one kernel to calculate step at each iteration of the loop (at once in an array), then a reduction sum of the array to compute pi

• Single Precision with optimization

– On the CPU, instead of one accumulator, 100 elements are summed at a time, then this partial sum is added to the accumulator

• Double Precision with no optimization

– Same as Single Precision but in double

• Double Precision with optimization

– Same as Single Precision but in double Single Precision with no optimization

• CPUs get identical results to each other (expected).

– reach their most correct value of Pi at 512 steps, but as you go to more steps the result gets worse and at 4 million steps the result is not even accurate to 2 digits.

• The GPUs perform identical to each other

– Reach their best result at 1024 steps (a closer Pi than the CPU ever achieved by 1 order of magnitude error).

– As long as the GPU versions work (up to 4mil or more steps) they equal their best Pi result.

• The reduction style of calculation limits the loss of precision when summing large amounts of data. Single Precision with Optimization

• CPUs now both achieve the same results as the GPUs at 1024 steps and still hold a good value of Pi through 67 million steps

– after 1024 steps the CPU result still degrades a bit from its best value

– the GPU acts the same as it did before optimization

Pi Results - CPU vs GPU

3.150

Intel CPU

3.140

3.130

NVIDIA GTX

280

Number of Steps

02000000 4000000 Double Precision

• Double Precision with no optimization

– CPUs achieve their best Pi at about 1mil steps,

– GPU doesn't reach it's best until 16mil steps

• last run that fit due to ATI GPUs 256mb memory limit

– After 1mil steps, the CPU degraded as it did in single precision tests

– GPU may still get even better values beyond 16mil steps, as it increased until this point

• Double Preision with optimization

– CPU reaches best Pi at 8mil steps this time Floating Point and Pi

• Reduction sum gives more accurate floating point result than accumulated sum

– there is a point at which the CPU starts degrading in performance while the GPU results remain close to their best value through much larger tests

• Could code the CPU using reductions to match

GPU accuracy

– Results in slower run times

• Calculating Pi is pretty simple! Definitions of FP correctness

• Herbrand Equivalence -- the strongest

– Two symbolic expressions S(X) are Herbrand equivalent if and only if they are exactly equal. Two Herbrand equivalent programs will produce the same results, independently of the way in which the arithmetic operations are implemented.

• IEEE Equivalence -- the middle road

– Two IEEE equivalent programs must produce the same output on any platform implementing IEEE arithmetic

– There are a number of identities for real arithmetic that also hold for IEEE arithmetic

• x + y = y + x, xy = yx, and 1x = x1 = x + 0 = 0 + x = x=1 = x

– Two symbolic expressions S(X) are considered to be IEEE equivalent if one can be transformed to the other by a finite sequence of transformations corresponding to such identities.

•

•

S(X) denotes symbolic expressions over a set of symbolic constants X = fx1; x2; : : :g

Combining Symbolic Execution with Model Checking to Verify Parallel Numerical Programs Stephen F.

Siegel, Anastasia Mironova, George S. Avrunin, and Lori A. Clarke, ACM TOSEM 17, 2008 Definitions of FP correctness (2)

• Real Equivalence -- the weakest

– Two elements of S(X) are considered to be equivalent if one can be transformed to the other using any identities of real numbers, including those that do not hold for IEEE arithmetic:

• The associativity of addition or multiplication, he distributive property, …

• Two real equivalent programs would produce the same results if all computations were performed as real arithmetic, but they may produce different results when run on an actual computer, even one that implements