The Challenges of Writing Portable, Correct and High Performance Libraries for Gpus Or How to Avoid the Heroics of GPU Programming

The Challenges of Writing Portable, Correct and High Performance Libraries for Gpus Or How to Avoid the Heroics of GPU Programming

The Challenges of Writing
Portable, Correct and High
Performance Libraries for GPUs or How to Avoid the Heroics of GPU
Programming
Prof. Miriam Leeser
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
mel@coe.neu.edu Outline
• Biomedical GPU applications at Northeastern
– Sparse Linear Solvers in SCIRun
– Lung Tumor Tracking
• The challenge of writing libraries for GPUs
– Example: Calculating Pi
• Correctness and GPUs
– PUG: A Symbolic Verifier for GPU Kernels
• Floating Point Arithmetic on GPUs
– Floating Point Correctness
• Future Directions GPU Programming
• Currently we are focusing on NVIDIA and CUDA programming
• Also interested in AMD/ATI hardware and OpenCL as a programming model
– CUDA programming tools are more mature at the moment
– AMD has released their Fusion architecture
– Intel recently released Sandy Bridge, a chip that integrates processor and GPU on a single die
• Programming model?
– This is a rapidly changing field!
• For the rest of this talk, for GPU read “NVIDIA” and for GPU programming, read “CUDA” GPU as a General Purpose
Computing Platform
Speedups are impressive!
Real Time Elimination Lattice-Boltzmann Method
Genetic Algorithm
Total Variation Modeling
2300 X 1000 X 1840 X of Undersampling Artifacts for Numerical Fluid Mechanics
2600 X
Monte Carlo Simulation Stochastic Differential K-Nearest Neighbor
Of Photon Migration Equations Search
Fast Total Variation for
Computer Vision
1000 X 675 X 470 X
1000 X
Source: CUDA Zone at How to get 1000x speedup
• Match algorithm parameters to the GPU you are running on
• Even better, pick the algorithm parameters to match the GPU!
• Example: template matching
– Pick template sizes that fit in shared memory
– Pick image sizes that fit into texture memory
– Pick image sizes so that the number of pixels are a multiple of the optimal number of threads
• Hardware changes …. reoptimize!

Examples of Heroic programming How to write a Generic GPU library
• Cannot pick your parameters
– Have to work with the problem parameters
• Cannot optimize to the GPU architecture
– Different Nvidia GPUs have different
• Number of Streaming Multiprocessors (SMs)
• Sizes of memories
• Single or double precision Floating Point
• Can we still deliver performance?
• Can we guarantee correct behavior across multiple platforms? Real World Applications
• Some applications we are working on at
Northeastern:
– Linear Systems Solvers for SCIRun
– Lung Tumor Tracking
• Examples of library code for GPUs that must work over a range of parameters

University of Utah’s SCIRun
• SCIRun is a biomedical problem solving environment
• Allows scientists to create a network of mathematical functions
• GUI allows easy simulation design with the ability to draw the flow of data to many algorithms
• Many of these algorithms are time consuming
… and display parallelism!
This work was made possible in part by software from the NIH/NCRR Center for Integrative Biomedical Computing,
P41-RR12553-10. www.sci.utah.edu/cibc/software/106-scirun.html

SolveLinearSystem Module
• Solves Ax=b using iterative methods
• Matrices are large and sparse
• Contains several different algorithms that exhibit parallelism
• User selectable GPU check-box

ParallelLinearAlgebra (CPU)
Original CPU CG Layout
• All general algorithms exist at
SCIRun’s module level
SolveLinearSystem
(Algorithms)
• ParallelLinearAlgebra class implements CPU matrix and vector computations with basic
BCG Others
CG optimizations (loop unrolling)
• Algorithms call these low level math
ParallelLinearAlgebra
(Low-Level Math) functions as an abstraction
Add Multiply
• This structure lends itself to a convenient GPULinearAlgebra sibling GPULinearAlgebra Conjugate Gradient Algorithm
• Iterative Solver
• Data in these problems are generally sparse
• Due to matrix-vector sparse matrix multiplication (SPMV), there is good parallelism
SPMV
• Limited by sparse data organization and algorithmic dependencies (both within and across iterations) Experiment Details
• Test Machine
• CPU – Intel Core 2 E6300 1.86GHz, 2Mb L2 cache, 1066MHz FSB
• GPU - NVIDIA GeForce 280 GTX 1GB RAM PCIe card
• Test conditions
• Tests were run 10 times each and averaged
• Test time is end to end – includes all data transfer and setup overheads involved in GPU version
• Test data
• Heart Ischemia Model
• University of Florida’s Sparse Matrix Collection
14 Input Data
Number of Rows
4000
3500
3000
2500
2000
1500
• The sparse matrices vary in
1000
500
0size from 6.3K to 3.5M rows
Nonzeros
16000
14000
12000
10000
8000
6000
4000
2000
0
• Nonzeros vary from 42K to
14.8M
15 Conjugate Gradient
GPU/CPU End to End Speedup – Nearly identical performance in each of 10 runs
CPU: Intel Core 2 1.86GHz
GPU: NVIDIA GeForce GTX 280
Double Precision is used in all examples
16 Jacobi and Minimal Residual
CPU: Intel Core 2 1.86GHz
GPU: NVIDIA GeForce GTX 280
Double Precision is used in all examples
107K x 107K Heart Ischemia Model
Time (seconds)
Algorithm Speedup
CPU GPU
CG 164.57 31.05
5.3x
3.4x
6.9x
Jacobi 7.42 1.46
MinRes 81.96 11.80
17 Third Party Implementations
• Many third party packages are available as open source
• They may perform better but are more difficult or impossible to incorporate into the user experience of SCIRun
• CNC Number Cruncher (CG implementation)
• Gocad Research Group – Nancy University, France
107K x 107K Heart Ischemia Model
Time (seconds)
Algorithm Speedup
CPU GPU
CG 164.57 31.05
5.3x
5.9x
3rd Party CG 164.57 27.98
L.Buatois, G.Caumon B.Lévy, Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU, High
Performance Computing and Communications, Third International Conference, HPCC, 2007.
18 Validation of Results
Difference in Iterations to Converge
1.2
1
0.8
0.6
0.4
0.2
0
507 1016 1580 2174 3640 4126 5759 8485 15681 25195
CPU Iterations to Converge
• Different orders of operations affect the iterations necessary to achieve desired error
• Double precision is necessary to limit this
• The CPU and GPU differ in the number of iterations needed to converge by less than 1%
19 Discussion
• Speedup was achieved using the original CPU algorithm
• The only added operations are transferring the data to the GPU
• The algorithms were accelerated by a simple technique that can be applied to algorithms throughout SCIRun
Iterative portion of the Jacobi Method solver
20 Where the Performance is Realized
• In SpMV, each row is computed by one thread
• Small number of rows = low utilization of GPU
• The other vector operations (mostly accelerated via the CUBLAS library) are relatively fast but occupy a low % of total time
(ms) (ms)
Calculation CPU GPU Speedup
Time Distribution of Operations
Data Copy and Setup 0.08 190.22 -2377.75x
Preconditioner 145.11 8.26 17.57x
SpMV 130.68 9.37 13.95x
SpMV
Subtract
Subtract 21.09 0.72 29.29x
Dot Product (2)
Dot Product (2 per iter) 12.97 0.53 24.47x
Norm 6.77 0.45 15.04x
Normalize
Scale and Add (2)
Scale and Add (2 per iter) 0.72 27.25x 19.62
Total time per iteration 13.04 17.15x 223.72
21 A video recording of the CPU and GPU versions of the Conjugate Gradient Algorithm
CPU GPU Lung Tumor Tracking
Work with research by Ying Cui, Jennifer Dy, Gregory Sharp, Brian
Alexander, and Steve Jiang from Northeastern University, Massachusetts General Hospital, Harvard
Medical School, and University of California San Diego
• Some tumors move significantly during breathing
– Therapy targets large area with weak radiation – prevents damage to normal tissue
– Goal is to use higher-intensity focused radiation without implanted markers
• Two template-based tracking algorithms proposed:
– Motion-enhanced templates and Pearson’s correlation
– Eigen templates and mean-squared error
23 Tumor Marking Tool
Templates are generated with a MATLAB tool
Tumor is selected by a clinician
24 Motion Enhancement
Static Motion
Image Enhanced
• Motion enhanced image is difference between the image and average of images used for templates
– Moving structures are emphasized
Image: Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic
Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and 25
Biology, Vol. 52, pp. 6229-6242, 2007. Multiple Templates
Image brightness tumor position varies during breathing
Periodic: multiple representative templates from different points in the respiration cycle are used to compensate
Image: Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic
Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and 26
Biology, Vol. 52, pp. 6229-6242, 2007. Lung Tumor Reference Data Sets
Patient Frames Templates Template Size Shift ±V/±H
1442 53×54 18/9 12
348 211/5 13 23×21
259 39/4 10 76×45
290 411 156×116 9/3
273 512 86×78 11/6
210 614 9/2 141×107
• Set of fixed large templates
– Significant variation in template sizes and much larger than typical
• Small search with single ROI per frame
• Different part of the problem space from traditional template matching
27 Similarity Score
• Pearson’s correlation for similarity score


A  A B  B

MN MN
M N corr2(A, B) 
2
2


A  A B  B
 
MN MN
M N M N
• Sliding window operation
– Template data (A) fixed for given template
– ROI data (B) depends on window location and frame
• Average subtraction prevents frequency domain Similarity Score
• Pearson’s correlation for similarity score
C

A B  B

MN MN
M N corr2(A, B) 
2
AD
B  B


MN
M N
• Sliding window operation
– Template data (A) fixed for given template
– ROI data (B) depends on window location and frame
• Average subtraction prevents frequency domain Corr2() Numerator
C

A B  B

MN MN
M N
• Corr2 implemented as several kernels
– Average, denominator, and numerator
– Numerator is where most of the work is done
• Problem instance complicates mapping to CUDA
– Where to generate parallelism
– Few templates (10 to 14)
– Assume streaming (one ROI at a time)
– Search space is small (95 to 703 total positions) Data Placement
• Common CUDA GPU paradigm: use shared memory to cache working set
– Adjacent template locations share all but one row/column of frame data
– Shared memory blocking reduces global memory accesses
• Does not work for case study
– Only smallest template and ROI fit
– Largest templates too large for constant or shared memory Tiled Template (1)
C

A B  B

MN MN
M N
• corr2() numerator outer loops are addition
• Break into tiles and separately process subtemplates
– Generates more blocks for parallelism
– Reduces working set size for shared memory
• Second step to combine the partial results
– Reduction kernel Tiled Template (2)
• Tiles mapped to grid of thread blocks
– Supports arbitrary template sizes
– One thread per shift location
• Efficient tile size may not match arbitrary template dimensions
– Leftover regions
– Corr2() complicates padding Tiled Template Example
• Edge tiles adjust only one dimension of main tiles
• Patient 4 template size: 156 × 116 pixels
Main Tile Main Right Tile Right Bottom Bottom Corner
Size Blocks Size Blocks Tile Size Tiles Tile Size
8×8 19×14 8×4 19 4×8 14 4×4
16×10 9×11 916×6 12×16 11 12×6
4×4 39×29 Remaining Kernels
C

A B  B

MN MN
M N corr2(A, B) 
2
AD
B  B


MN
M N
• Frame averages and denominator contribution similar to numerator
– Tiled and reduction steps
– Simpler than numerator
• Final fraction is an element wise operation Experimental Setup
• Benchmarked arbitrary tile sizes
– 4×4 to 16×16
– Thread counts arbitrarily fixed
• Compared against
– MATLAB (optimized)
– Multi-threaded C (non-vectorized)
• Benchmarking hardware
– Intel Xeon W3580 (4 i7 cores @ 3.33 GHz, 6MB L2)
– Nvidia Tesla C1060 with CUDA 3.1
– Ubuntu 10.04 64-bit (GCC 4.4.3) and MATLAB R2010a Performance
• Good performance across patients
– Includes data transfer
• Limited parallelism available for patient 2
– Small template and small
ROI
GPU vs Multithreaded C:
Patient 123456
Best Tile Size 8×8 16×2 4×4 8×8 16×10 8×8
Template Size 141×107 53×54 23×21 76×45 156×116 86×78
4.09 0.64 Total Speedup 3.61 6.92 5.66 6.86
Performance Analysis
• Numerator dominates
– GPU runtime
– Most complicated
• Processes templates independently
• Other stages applied once per frame
• Still room for optimization Real Applications, Real Data
• Biomedical Applications
– Linear Solvers in SCIRun
– Tumor tracking
• These are real world apps with real data … and real speedup!
• Speedup shown is end to end application speedup
• Algorithms NOT chosen to show off best of a given GPU architecture
• For accelerating SCIRun and Matlab (lung tumor tracking) want generic solutions
– Run on a family of GPUs
– Run on a variety of different data sizes
• Why is this hard?
• How do we know the GPU results are correct? Nvidia GPU differences
Release Floating
GPU SMs | Cores Clock Memory
Date Point
Nov. 2006
SP
GeForce 8800 GTX 16 | 128 575MHz 768MB
SP
GeForce 8800 GS Jan. 2008 12 | 96 550MHz 384MB
GeForce 280 GTX Jun. 2008 30 | 240 602MHz 1024MB
Tesla C1060 Jun. 2008 30 | 240 602MHz 4096MB
SP/DP
SP/DP
• All GPUs in the table have 16KB shared memory per SM
• All launch 32 threads per warp Example: Sequential Computation of Pi
Static long num_steps = 100000; double step; void main ()
{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);
}pi = step * sum;
}
• From OpenMP tutorial given at SC’99 by Tim Mattson and Rudolf
Eigenmann:
Parallelizing Sequential Pi
Static long num_steps = 100000; double step; void main ()
{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);
}pi = step * sum;
}
• Each calculation of x inside the for loop is independent
• First pass: Each CUDA thread calculates x and the 4.0/(1.0+x*x)
• Second pass: reduction addition to compute sum
• Final step pi= step*sum done on the CPU Snippet of CUDA for first pass
__global__ void calculateStepKernel (double *ref_d, double step, int length)
{extern __shared__ double sdata[]; int tid = blockIdx.x * blockDim.x + threadIdx.x; int lid = threadIdx.x; sdata[lid] = (tid length) ? ref_d[tid] : 0; if(tid==0) sdata[lid]=0.0; if(tid 0 tid = length) sdata[lid] = 4.0/(1.0+((tid-0.5)*step)*((tid-0.5)*step)); ref_d[tid]=sdata[lid];
}CUDA for reduction sum if(tid length)
{for (unsigned int s=1; s blockDim.x; s *= 2)
{int index = 2 * s * threadIdx.x; if (index blockDim.x index+s = length) sdata[index] += sdata[index+s];
__syncthreads();
}
}We run reduction code
Release
GPU SMs | Cores Clock Memory Result
Date
Nov. 2006
GeForce 8800 GTX 16 | 128 575MHz 768MB
X



GeForce 8800 GS Jan. 2008 12 | 96 550MHz 384MB
GeForce 280 GTX Jun. 2008 30 | 240 602MHz 1024MB
Tesla C1060 Jun. 2008 30 | 240 602MHz 4096MB
• It works on 3 out of 4 GPUs. Why? Sync_threads is in wrong place
Correct
Incorrect for (unsigned int s=1; s blockDim.x; s *= 2) if(tid length)
{
{int index = 2 * s * threadIdx.x; for (unsigned int s=1; s blockDim.x; s *= 2)
{if(tid length) int index = 2 * s * threadIdx.x; if (index blockDim.x index+s = length) sdata[index] += sdata[index+s];
{if (index blockDim.x index+s = length) sdata[index] += sdata[index+s];
}
__syncthreads();
}
__syncthreads();
}
}
• CUDA development forums say:
• the behavior is undefined if you disobey the rule to hit all the same synchronizations with all threads
– CUDA gdb did not find the bug Pi calculation lessons
• Cannot test my CUDA (or OpenCL) library kernels on every GPU model
• Library developer needs better tools to check for correctness
– CUDA GDB is not sufficient
– Other more powerful tools, but not symbolic
• PUG: A Symbolic Verifier for GPU Kernels developed at the University of Utah by Ganesh
Gopalakrishnan’s group
• http://www.cs.utah.edu/formal_verification/PUG/ PUG Contributions (Ganesh Gopalakrishnan)
• PUG is more comprehensive than other approaches
• PUG guarantees the absence of these errors (two threads modeled):
– Data races
– Mismatched barriers ( __syncthread() )
– Assertions on values placed by users
• PUG helps understand ranges of parameters / dimensions
– Many “bugs” are correct code run outside allowed dimensions
• V0.3 of PUG (under development) can verify for arbitrary number of threads:
– A large class of practical kernels
• Useful for comparing unoptimized and optimized kernels
• PUG gives valuable information on the degree of bank conflicts
• PUG uses a symbolic FV approach
– Symbolic approach can be very effective for CUDA
48 Floating Point Correctness
• Most published work on floating point correctness predates the release of double precision GPUs
• Now GPUs have double precision
– Do we need to worry about correctness anymore?
• Correctness of individual operations
– add, sub, div, sqrt
– These are all IEEE Floating Point Compliant
• Does that mean that code produces the same answer on every CPU or GPU that is IEEE Floating Point Compliant? Example: Sequential Computation of Pi
Static long num_steps = 100000; double step; void main ()
{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);
}pi = step * sum;
}
• From OpenMP tutorial given at SC’99 by Tim Mattson and Rudolf
Eigenmann:
What is this code doing?
• It is performing an integral of the of the derivative of the arctangent.
• The derivative of the arctangent is 1/(1+x2).
• A definite integral from 1 to 0 gives arctan(1) - arctan(0)
– which is pi/4 - 0 or just pi/4
• The code folds the factor of 4 into the integral. Approximating an Integral
• The code is doing a Riemann sum approximation of the integral using segment midpoints with num_steps segments.
Static long num_steps = 100000; double step; void main ()
Δx
{ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1; i = num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x);
}pi = step * sum;
}
• What I learned in high school math:
– The more rectangles, the closer to the exact solution Single Precision – No Optimization
Pi Results - CPU
3.150
3.145
3.140
Intel CPU
3.135
3.130
Number of Steps Single Precision – No Optimization
Pi Results - CPU vs GPU
3.150
3.145
3.140
Intel CPU
3.135
NVIDIA GTX 280
3.130
Number of Steps Correctness of Operations vs.
Correctness of Applications
• Floating Point is not associative!
• Manycore processing reorders operations
– Results are not necessarily the same
– Is the CPU version correct?
• Example: Summation: Floating Point Arithmetic and Pi
• Are the results the same on every platform?
• We tested on 2 CPUs and 2 GPUs:
– CPU - System 1: Intel Xeon W3580 @ 3.33GHz
– CPU - System 2: Intel Xeon E5520 @ 2.27GHz
– GPU - System 1: Tesla C1060
– GPU - System 2: ATI Radeon 5700
• Compared results to Wikipedia Pi:
– 3.14159265358979323846264338327950288419
7169399375105820974944 … Different Pi Calculations Tested
• Single Precision with no optimization
– On the CPU, calculate the step and keep a running sum over the loop to use to compute pi (as written)
– On the GPU, one kernel to calculate step at each iteration of the loop (at once in an array), then a reduction sum of the array to compute pi
• Single Precision with optimization
– On the CPU, instead of one accumulator, 100 elements are summed at a time, then this partial sum is added to the accumulator
• Double Precision with no optimization
– Same as Single Precision but in double
• Double Precision with optimization
– Same as Single Precision but in double Single Precision with no optimization
• CPUs get identical results to each other (expected).
– reach their most correct value of Pi at 512 steps, but as you go to more steps the result gets worse and at 4 million steps the result is not even accurate to 2 digits.
• The GPUs perform identical to each other
– Reach their best result at 1024 steps (a closer Pi than the CPU ever achieved by 1 order of magnitude error).
– As long as the GPU versions work (up to 4mil or more steps) they equal their best Pi result.
• The reduction style of calculation limits the loss of precision when summing large amounts of data. Single Precision with Optimization
• CPUs now both achieve the same results as the GPUs at 1024 steps and still hold a good value of Pi through 67 million steps
– after 1024 steps the CPU result still degrades a bit from its best value
– the GPU acts the same as it did before optimization
Pi Results - CPU vs GPU
3.150
Intel CPU
3.140
3.130
NVIDIA GTX
280
Number of Steps
02000000 4000000 Double Precision
• Double Precision with no optimization
– CPUs achieve their best Pi at about 1mil steps,
– GPU doesn't reach it's best until 16mil steps
• last run that fit due to ATI GPUs 256mb memory limit
– After 1mil steps, the CPU degraded as it did in single precision tests
– GPU may still get even better values beyond 16mil steps, as it increased until this point
• Double Preision with optimization
– CPU reaches best Pi at 8mil steps this time Floating Point and Pi
• Reduction sum gives more accurate floating point result than accumulated sum
– there is a point at which the CPU starts degrading in performance while the GPU results remain close to their best value through much larger tests
• Could code the CPU using reductions to match
GPU accuracy
– Results in slower run times
• Calculating Pi is pretty simple! Definitions of FP correctness
• Herbrand Equivalence -- the strongest
– Two symbolic expressions S(X) are Herbrand equivalent if and only if they are exactly equal. Two Herbrand equivalent programs will produce the same results, independently of the way in which the arithmetic operations are implemented.
• IEEE Equivalence -- the middle road
– Two IEEE equivalent programs must produce the same output on any platform implementing IEEE arithmetic
– There are a number of identities for real arithmetic that also hold for IEEE arithmetic
• x + y = y + x, xy = yx, and 1x = x1 = x + 0 = 0 + x = x=1 = x
– Two symbolic expressions S(X) are considered to be IEEE equivalent if one can be transformed to the other by a finite sequence of transformations corresponding to such identities.


S(X) denotes symbolic expressions over a set of symbolic constants X = fx1; x2; : : :g
Combining Symbolic Execution with Model Checking to Verify Parallel Numerical Programs Stephen F.
Siegel, Anastasia Mironova, George S. Avrunin, and Lori A. Clarke, ACM TOSEM 17, 2008 Definitions of FP correctness (2)
• Real Equivalence -- the weakest
– Two elements of S(X) are considered to be equivalent if one can be transformed to the other using any identities of real numbers, including those that do not hold for IEEE arithmetic:
• The associativity of addition or multiplication, he distributive property, …
• Two real equivalent programs would produce the same results if all computations were performed as real arithmetic, but they may produce different results when run on an actual computer, even one that implements