Implementing MASS on GPU

Fall 2011

Introduction

Developing applications that run with the best average performance is what drives researchers and the industry to find ways of improving the performance of their applications. Graphics applications like rendering, and ray tracing, and non-graphics applications like weather-prediction applications would be unusable without significant thought in improving the runtimes through parallelism and multiple processing units. The key to such performance improvement is taking advantage of data parallelism and/or task parallelism. For example, in a rendering program, a single threaded program can be written to draw one pixel, data parallelism can be applied here to perform the same operation – draw pixel – on every pixel in the program thereby reducing the total time it would take to complete by running in parallel across multiple threads (Nickolls & Dally, 2010).

Background on MASS

Multi-Agent Spatial Simulation (MASS) is a parallel computing library whose design is based on multiple agents each acting as a simulation entity residing in a network-independent array spanning multiple computing nodes. The application space of the MASS librarycovers the single program multiple data programming model i.e. applications that employ parallel algorithms. Example applications of the MASS library are: molecular dynamics, Schrodinger's wave equation, Fourier's heat equation, battle games, computational fluid dynamics, etc.

MASS library is a collection of APIs that abstract the low-level parallelism required to speed up an application that is written using the library functions. Harnessing the power of the GPUs, these low-level libraries spread their computation across thousands of parallel threads that can be executed on multiple computing devices that house these GPUs.

GPU APIs

Programmable GPU APIs have been available since 2001(Nickolls & Dally, 2010), first for only graphics applications with the NVIDIA’s GeForce 3 that can be programmed using OpenGL and DirectX 8. By 2006, NVIDIA introduced the first unified graphics and computing GPU architecture programmable in C, CUDA, DirectX 10, and OpenGL.OpenCL and DirectCompute[1]are other types of GPU APIs that are similar to CUDA’s programming model. OpenCL, originally developed by Apple Inc[2] and has been adopted by GPU manufacturers like Intel, AMD, NVIDIA, and ARM. OpenCL API’s programming model is based on support for a heterogeneous set of GPUs while NVIDIA’s CUDA API is mainly compatible with GPU’s made by NVIDIA. In this study, NVIDIA’s CUDA is the choice API for implementing MASS library.

Why CUDA?

CUDA was developed by NVIDIA in 2006, and is simply an extension to the C programming language that enables functions to be defined to run the NVIDIA GPUs. More recently, as of CUDA 4.1, advances in the CUDA API makes it makes it possible in the near future to run on other GPU manufacturers like AMD and Intel. CUDA has also been used to write thousands of applications(Owens et al., 2008) and there are several scientific researches on GPU computing with CUDA being the GPU API of choice.

Implementing CallAll() using CUDA

CallAll() functionality is to execute a user-specified function on every place in places, and using CUDA’s API to spawn a GPU thread that would run in parallel across all places. The

The output of running the test program testplaces.cu is given below. It shows the result of parallel GPU threads used to update the index of each Place in Places.

Performance Evaluation

In terms of performance, the above CallAll() spawns 1 thread for each place, thereby updating the index for each place simultaneously. For a larger array of places i.e. 500 place elements, each Place element is modified by calling set_place_indexes() and updating each Place index. Therefore, for N < Places->size, the performance of running the function set_place_indexes() on every place in places is O(1). However, for a function, sample_function() with a performance of O(N), the performance of running this function on CUDA with CallAll()for all Placeswill be O(N).

Limitations

Despite the mouth-watering gains that can be achieved by using GPUs to implement the MASS library functions there are some drawbacks mentioned below:

No support for automatic memory allocation on GPU

MASS functions can be implemented to run on the GPU using GPU APIs like CUDA or OpenCL, however, these APIs require memory to allocated on the GPU for every function that needs to execute on the GPU device beforehand. This poses a huge challenge to implementation of a CPU function by a MASS library user that wants to take advantage of the GPUs power without having to rewrite this CPU function to run on the GPU. This limitation arises because memory is not automatically allocated on the GPU when the GPU API function is called. Automatic memory allocation is supported[3] for C/C++ on the CPU, but APIs CUDA and OpenCL (extensions of C), do not provide support for automatic memory allocation on the GPU. This makes it difficult to write a generic C function that would have a fair chance of executing on the GPU without knowing how much memory this function would use beforehand.

Client has to know how to program in target GPU API

The limitation above that describes missing support for automatic memory allocation to result in the developer learning how to program in the target API to have the specific function run on the GPU. This might not be such a bad thing considering the minimal learning curve required to perform parallelization on GPU APIs like CUDA. However, this goes against the fundamental design of MASS library which requires users to extend MASS by writing CPU functions that would “invisibly” run on the GPU installed.

Conclusion

This independent study investigated the use of GPUs to significantly improve the performance of the MASS library function by introducing a low-level implementation layer on the GPU through parallelism on lightweight GPU threads. Developers implementing applications using MASS library functions can take advantage of GPU parallelism without having to worry about implementing parallel algorithms. The major limitation of the CUDA API and other GPU APIs like OpenCL in implementing GPGPU programs is the inability to implicitly allocate memory on the GPU for client programs that need to run a GPU function. One way to overcome this limitation is to implement a preprocessor that would convert a CPU function into a GPU function and also parse all the data structures needed for the computation so that memory is allocated before this function is run on the GPU.

References

Nickolls, J., & Dally, W. J. (2010). The GPU computing era.Micro, IEEE, 30(2), 56-69.

Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing.Proceedings of the IEEE, 96(5), 879-899.

[1]

[2]

[3]