Nathaniel HartWinter 2014 Term ReportPage 1
UW Bothell Computing & Software SystemsTerm Report:
Winter 2014, MASS CUDA Parallel-Computing Library
Nathaniel Hart
3-19-2014
Contents
Work Summary
CUDA Resources
Hardware
Language
Advanced Programming Methodologies
Source Files
Programming Patterns
Suggested Improvements
Future Work
Bibliography
Appendix A
Work Summary
For the past quarter, I have been pursuing the following goals:
- Understand CUDA programming
- Understand other versions of MASS (Java, C++)
- Write the CUDA specification
The overarching goal was to prepare for a quarter of CSS 600 independent study where I could spend my time implementing the specification that I wrote this quarter. This paper describes how I went about attaining these goals, and how this has prepared me for the implementation phase of this project. This paper is also intended to serve as a guide to future students who work on this project in order to accelerate their ramp up on CUDA and the library’s source code.
In pursuit of these goals, I sought out resources on the CUDA language, reviewed all readily available publications about CUDA and GPGPU (general purpose GPU, or the practice of using GPUs for non-graphics applications). I also reviewed the structure and code of the CUDA and C++ versions, read prior students’ term reports and theses, conducted minor testing to explore potential architectural solutions, and adapted the MASS C++ specification to the unique needs and requirements of the CUDA project.
The majority of the work put forth this quarter was pure learning. As such, there is no executable code to show for this quarter. However, the existing code can be found on the UWB Linux file system at:
/net/metis/home3/MASS/multi-gpu-mass/master/src
This is the code that was generated by a previous student, Rob Jordan, and is explained by his term paper. As part of my examination of this code, I created a class diagram showing the relationships between the various classes and resources (see Appendix A).This diagram should be updated prior to making any changes the source code, as it will help others understand, use, and modify this library.
CUDA Resources
There are three primary areas of study when it comes to CUDA: the hardware architecture and its limitations, the CUDA language constructs and syntax that are different from conventional C++, and advanced programming methodologies to either exploit or work around the unique strengths and limitations of the hardware.
Hardware
To learn the CUDA language and how it applies to GPU hardware architecture, I primarily used a Coursera course entitled “Heterogeneous Parallel Programming”. I would highly recommend this course to any future CUDA students, as it covers the majority of the high-level concepts necessary to become self-sufficient on CUDA. I also used the accompanying textbook. This course covers the architecture of a GPU, the hardware threading model, the limitations of SIMD (Single Instruction Multiple Data) processors and PCIe communication, and the CUDA language syntax.
The type of GPU being used also has an impact on the programming techniques you can use. As such, I’ll note the GPUs that Hercules has two GPUs:
- A Quadro NVS 295 with compute capability 1.1. This is a severely limited computing capability, and only supports a subset of the C programming language.
- A GeForce GTX 680 with compute capability 3.0. This is good, as it can support virtual functions and inheritance.
Compute capability has a strong effect on the commands you can issue in CUDA and the programming architecture you can use. For instance, the Quadro NVS 295’s capability of 1.1 will not allow the use of virtual functions or inheritance, whereas the GeForce GTX 680’s capability of 3.0 does. As inheritance and virtual functions are key to the software architecture of the MASS CUDA library, this means that testing any multi-GPU functionality will need to occur on a different computer, or will require the purchase and installation of a second GPU with a compute capability of 3.0 or greater.
Language
Once you have familiarized yourself with the hardware, you should also look at the CUDA C Programming Guide. This goes into much greater depth on the CUDA language, including programming the full API, technical specifications, and code samples. Between the Heterogeneous Parallel Programming course and this document, any student should be able to fully understand the existing MASS CUDA code base.
Further readings should include the term papers by other CUDA project contributors, specifically Tosa Ojiru and Piotr Warczak. Tosa’s thesis gives an excellent presentation of the entire problem space, from the high-level goal of MASS to the specifics of exchanging data across GPUs. It is an excellent read and will contribute greatly to the understanding of this project.
Furthermore, Piotr’s last publication “Coordinating Multiple GPU Devices to Run MASS Applications” provides an excellent overview of how CUDA programs execute and the benefits provided by GPGPU.
Advanced Programming Methodologies
There are some more advanced programming techniques that will be necessary to understand in order to fully understand the project at hand and prevent performance degradation. The following are the most vital:
Branching Statements:
The SIMD processors in a CUDA GPU require that all threads being executed in a warp (batch) be executing the same instruction at the same time, albeit with different data. This means that a GPU can easily handle a statement like ‘x += 5’ even though each thread has a different value of x. The following statement might be a problem:
1:if( x < 10 ) {
2: x *= 3;
3:} else {
4: x += 5;
5:}
If all threads contain a value of x such that the conditional statement evaluates to either true or false for all threads in the warp, there is no problem. If, however, some of the threads in this warp have a value less than 10, and some greater, a SIMD processor MUST execute the same instruction in all threads at the same time, regardless of the branching flow of control. To solve this problem, both brancheswill be executed for all threads in the warp, first one, then the other, and each thread will keep the result from the branch that “belongs” to that thread. This means that each block of a branching statement is executed sequentially by the GPU.If this type of branching occurs frequently, you can start to see a significant performance degradation.
Interleaving Data Transfer with Streams:
NVidia’s hardware allows for multitasking when performing IO operations and computing results.If you are processing a large amount of data, it can be processed in chunks while minimizing the amount of time the GPU is not processing results. This means that you can simultaneously be copying data into the GPU, performing computation on data already in the GPU, and copying out results from a previous computation. This is achieved with CUDA streams, and involves queuing up instructions to the GPU asynchronously. While powerful, this requires the use of three streams per GPU, and coordinating this flow is quite complex to achieve.
In reality, the process of copying data in and out typically takes significantly longer than the computation, so there is very little performance hit if you only use two streams: one to copy in data, and one to issue kernel instructions copy out results. In the case of simple computations, use of two streams will be adequate for most applications.
Source Files
There are many existing files, but documentation is spotty at best. Here is a list of the files, with a short description of what each does. Review this in conjunction with the class diagram in Appendix A.
•bytes.h
◦Contains byte conversion macros go from bytes to KB, MB, or GB.
•cudaUtil.h & .cu
◦Contains error checking macros to use when calling device or global functions.
◦Contains multi-gpu util functions to sync multiple devices.
•gpulauncher.h
◦Contains three kernel functions to support creation of places, places.updateAll(), and places.callAll().
◦Labeled as a .h file, but contains full implementation details due to compiler constraints.
•grid2d.h
◦Labeled as a .h file, but contains full implementation details due to compiler constraints.
◦No documentation as to purpose of this class.
◦No private field documentation.
•main.cu
◦This is the main method for the implementation of Wave2D.
•mass.h & .cu
◦Implements the one of the three main portions of the MASS library.
•place.h
◦Labeled as a .h file, but contains full implementation detailsdue to compiler constraints.
◦Each place has a dimension? Does it need to know about overall grid to function?
◦Knows about neighbors? Can't coordinates be used to access neighbors?
•Places.h & .cu
◦Better documentation.
◦Spend time studying the “border” functions (exchangeBOrders, computeBorderCells). These appear to require extensive communication, and exact functionality isn't clear at this time.
•Range2d.h
◦Labeled as a .h file, but contains full implementation detailsdue to compiler constraints.
◦Zero documentation!
◦Appears to define a range as an upper left and lower right point in 2d cartesian plot.
◦Nothing seems to enforce this ordering of start and end (i.e. start.x could be larger than end.x) This could lead to an effective range of zero. USE CAREFULLY.
◦GetDimensions() returns a statically allocated dim3 object. Wouldn't this go out of scope and be unreliable?
•Stripe2d.h
◦Labeled as a .h file, but contains full implementation details due to compiler constraints.
◦Contains mysterious “TODO handle ghosts == 0?” What happens when there is no shadow space?
Programming Patterns
There are some excellent patterns already existing in the CUDA source code. The first is an excellent composition of Place indexing via a layering of various indexing modules. The most basic element is a point, which is mostly an X and Y value. This will need a Z value to become a true 3D spatial simulator. Each point is part of a range, which provides the logic of distance and global indexing. A range is used to describe a stripe, which is a portion of the Places object that is assigned to a single GPU. A grid is comprised of all stripes across all GPUs and contains the logic for tracking the location of each Place in the distributed Places array. This is a good separation of concerns and allows for the maximum reuse of code, as a multiple grids can be layered to create a 3D Places environment.
Also, the details of CUDA kernel functions is hidden from the end user by wrapping each kernel function in a standard C++ function call. This way, the end user can use a familiar API to use the GPU without actually needing to know anything about CUDA.
This pattern allows for the creation of place and agent objects in parallel directly in the GPU rather than creating them on the host computer and then copying them to the GPU. This is potentially a large speed up, and should be maintained in future versions.
Suggested Improvements
My initial review of the code led to some suggestions regarding the documentation and details surrounding the project. The number one obstacle to understanding this code is the lack of comments detailing WHY something was done, or how a future developer could use a feature.
There needs to be an effort put into documenting not what the code does, but why the code was written in the way it was.
- Have MASS::init( ) do more to define the number of GPUs, device streams, etc… Too much of that is handled by the main function right now.
- Documentation could be done via doxygen or a similar tool.
- Documentation needs to be addedfor many classes as to purpose, or possible use of class.
- It may be possible to simplify development within the library and add stability by using the CUDA Thrust library. This library contains many helper functions like sort, and are designed for GPGPU.
- May be made more readable/user friendly by wrapping CUDA functions in well labeled wrappers that contain error checking and calculate common parameters automatically. i.e:
1:void allocate(void *args){
2:ERROR_CHECK( cudaMalloc(args, sizeof(args) ) );
3:}
- Would be better to make abstract methods in place and agent, and have end user extend the class and override those methods. This would hide many of the implementation details from the end user. Now possible with compute capability 3.0.
- Add“const correctness” on functions that pass by reference.
Future Work
I plan on approaching this project using Test Driven Development to ensure quality. Thus, before beginning work, I need to determine a test plan, not only to demonstrate performance gains, but to evaluate if the software I implement meets the specification above. Whether this involves writing an actual test program, or just crunching random numbers as quickly as possible, has yet to be determined. Once a testing plan is in place, I will plan a roadmap for the next 10 weeks, laying out checkpoints, milestones, and goals to guide the implementation.
As soon as that occurs, I will place the existing source code into a revision control system and begin implementing the new specification, modifying the code where necessary and adding the new agent features. Each new change or feature will require unit tests, which will be written prior to changing the existing code. This will limit the amount of time spent debugging code, and reduce the number of errors that need to be solved as the project progresses, as well as providing future students tests that can be run when they modify the code themselves.
This will hopefully result in high-quality, maintainable code that will meet the immediate needs of the project and be maintainable as the project adaptsto unforeseen requirements of future users.
Bibliography
Hwu, W.-m. W. (2014). Heterogeneous Parallel Programming. Retrieved from Coursera:
Jordan, R. (2012, December). MASS: A Parallelizing Library for Multi-Agent Spatial Simulation. Retrieved from dslab home page: https://depts.washington.edu/dslab/MASS/reports/RobJordan_au12.pdf
Kirk, D. B., & Hwu, W.-m. W. (2010). Programming Massively Parallel Processors: A Hands-on Approach. Burlington, MA: Elsevier Inc.
NVidia. (2013, July 19). CUDA C Programming Guide. Retrieved from NVidia Developer Zone:
NVidia. (2013, March). Thrust. Retrieved from NVIDIA Developer Zone:
Ojiru, T. (2012). Implementing the Multi-agent spatial simulation (MASS) library on the Graphics Processor Unit. Retrieved from dslab home page: https://depts.washington.edu/dslab/MASS/reports/TosaOjiru_thesis.pdf
Warczak, P. (2012). Coordinating Multiple GPU Devices to Run MASS Applications. Retrieved from dslab homepage: https://depts.washington.edu/dslab/MASS/reports/PiotrWarczak_sp12.docx
Appendix A
Figure 1 Wave2D Class Diagram
UW Bothell Computing & Software Systems