DARPA Data Transposition Benchmark on a Reconfigurable Computer

The DARPA Data Transposition Benchmark on a Reconfigurable Computer

S. Akella, D. A. Buell, L. E. Cordova

Department Computer Science and Engineering,

University of South Carolina,

Columbia, South Carolina, 29208.

{akella | buell | cordoval}@cse.sc.edu

J. Hammes

SRC Computers, Inc.

4240 North Nevada Avenue

Colorado Springs, Colorado80907

MAPLD, September 7-9, 2005

Abstract

The Defense Advanced Research Projects Agency has recently released a set of six discrete mathematics benchmarks that could be used to measure the performance of high productivity computing systems. Benchmark 3 requires transposition, bit by bit in blocks, of a long bit-stream. In this paper we present the design and implementation of this benchmark on a high performance reconfigurable computer – the SRC Computers SRC-6. We evaluate the performance of this machine by benchmarking our implementation against a standard C based implementation of the same algorithm on a single processor Pentium PC.

Index Terms—Reconfigurable Architectures, FPGA, Verilog, C, Matrix Transposition.

1.Introduction

The six DARPA Discrete Mathematics Benchmarkscan be used to measure the performance of high productivity computing systems [1]. The benchmarks have been written for 64-bit machines and coded in Fortran 77 or C. In addition some of the benchmarks are available in MPI, Shmem, and UPC. DARPA is interested in experimenting with all of these six algorithms and desires performance improvement using novel methods for implementations of these algorithms. The six benchmarks are described briefly on the University of South Carolina reconfigurable computing research group’s website [2]

In this paper we look at DARPA Benchmark 3, the Data Transposition algorithm. The problem definition for this benchmark is given below.

Let {Ai} be a stream of n-bit integers of length L. Consider each successive block of n integers as an n xn matrix of bits. For each such matrix, transpose the bits such that bit bij is interchanged with bit bji. The resulting stream {A’i} will have the property that:
A’1, A’n+1, A’2n+1, etc will hold sequentially high order bits of {Ai};
A’2, A’n+2, A’2n+2, etc with hold sequentially next-to-high order bits of {Ai}
Output the stream {A’i}
Parameters for this benchmark are:

Length of Stream / Bit-width (n) / No. of Iterations
107 / 32 / 400
107 / 64 / 230
107 / 1024 / 12

We see that the Data Transposition benchmark provides three different set of sub-benchmarks which need to be implemented. They vary based on the bit-width (n) of the integers and the number of iterations we need to run the input stream of length L set to ten million bits for all the three sub-benchmarks.

2.Previous Research

There has been relatively little previous research on this data transposition problem in which the transposition operation is done at the bitlevel. Most previous work has assumed that each element of the matrix was a word by itself. Several methodologies such as parallel matrix transpose algorithms [3, 4], mesh implementations [5], and hypercube algorithms [6] have been investigated, but none of them is relevant to what we do here. We therefore look at implementing this algorithm in our methodology that suits the underlying architecture on which we implement it. First we will look at the software implementation, and then we will discuss the SRC-6 implementation.

3.Software implementation

We first look at a software implementation of this benchmark. The DARPA benchmark “rules of the game” are that at first we should make and time the original code, making only changes necessary for correct execution. The original code was written in FORTRAN and has a recursive transposition algorithm and is written for a 64-bit architecture which is difficult to run on the available 32-bit Pentium Processor PC. It also assumes that all input words could be accessed at the same time, which is not suitable for direct porting to the SRC-6 architecture. We therefore implemented the benchmark in C using a nested loop construct for the purposes of simplicity and for easier portability to the SRC-6 computer. It works on one word of data at a time which is adherent to the SRC-6 hardware architecture. The loop construct is similar for 32-bit and 64-bit benchmarks. The code for the 1024-bit benchmark has slight modifications to work on an array of 64-bit words instead of the entire 1024-bit word. The methodology used for the software implementation is as given below:

Read the bit-stream in words of nbits (32/64/1024 bits based on the benchmark)
There are n n-bit inputs, and n-bit outputs. The outputs represent the transposed values of the inputs.
Each of the n-bits of the ith input would form ith bit of all the n output words.
The innerloop runs ntimes, once for each output generated.
In each iteration of the inner loop, the ith bit of the n-bit input read is picked and placed in the ith position of the corresponding output.
The outer loop runs the inner loop n times, once for each output to be generated.

This setup is slightly different for the 1024-bit benchmark since the natural word length of the PC is restricted to 64-bits and the natural data transfer unit to the reconfigurable resources on the SRC-6 is a 64-bit word. We use sixteen 64-bit words to represent one 1024-bit word. Thus while working on one 1024-bit word input we actually work on sixteen 64-bit words each time in the inner loop. Similarly, when generating each of the 1024-bit word outputs, we generate sixteen 64-bit words which when concatenated together would form the 1024-bit word.

The code for each benchmark is run on a computer which has dual Intel Pentium 4 processors, and the timing results collected are presented in Table 1.

4.The SRC-6 reconfigurable computer

The SRC-6, by SRC Computers, [7] is one of the limitednumbers of reconfigurable computers commercially available. This computer allows the programmer to be able to overlook the details of underlying the hardware architecture and focus on the implemented function. This approachhelps in decreasing the time to solution by facilitating software development by programmers and mathematicians.

4.1.The Hardware Architecture [8]

The SRC-6 system architecture includes Intel® microprocessors running the Linux operating system. The reconfigurable logic resource is referred to by SRC as a MAP®; the MAP® boards normally come in pairs attached to these microprocessors. Each MAP® consists of two Xilinx® FPGA chips and a control processor.Code for the Intel processors is written in standard C or Fortran. Code for the MAP® hardware can be written in MAP C or MAP Fortran and compiled by an SRC-proprietary compiler that targets the MAP® components. Calls to execute on the MAP® are function/subroutine calls from the standard C or Fortran modules.

The MAP® consists of both fixed and reconfigurable hardware. The fixed part includes general control logic, the DMA engine in the control FPGA processor and six 4MB banks of on board memory referred to as OBMs. The reconfigurable hardware consists of two Xilinx® XC2V6000 FPGA chips referred to as User logic. The architectural block diagram is as given in Figure 1.

Figure 1. MAP® interface block diagram [8]

Most computations will first require a transfer of input data to the OBMs from the microprocessor memory through the control processor interface. The FPGA user logic then performs computations on the input data by reading it from the OBMs, perhaps writing intermediate or final results to the OBMs, and then transferring back from the OBMs to the microprocessor memory through the control processor interface.

An important point to notice, especially when comparing 32-bit and 64-bit computations like this benchmark, is that the reconfigurable hardware is strongly oriented toward 64-bit words. The OBM is organized as six banks of memory organized as 64-bit words to be read or written. The DMA data transfers to and from the microprocessor in 64-bit quantities. We would expect, therefore, that a 32-bit benchmark would actually be hampered by the 64-bit orientation of the machine, and we would have as a goal that the 1024-bit benchmark would function very nearly like an array of sixteen 64-bit-word transpositions. This is something of an oversimplification but might serve to place the implementations in perspective.

4.2.The Programming Environment

The SRC-6 programming environment involves traditional software based compilation process along with a specialized MAP® compilation process that is used to compile the code that is to run on the MAP® hardware. Code that is to be run on the Intel® processors is compiled separately. The application code has two main source files, one that is supposed to run on the Intel® processors and one that is supposed to run on the MAP® hardware. The two source files can be written in either C or Fortran. The MAP® source file has functions that can call user hardware macros to be executed on the MAP® hardware. The hardware macros can be built-in or user-defined in either the VHDL or Verilog hardware description languages.

The microprocessor C or Fortran compiler and the MAP® compiler object files are linked together to form the final application executable.

5.The SRC-6 Implementation

The SRC-6 environment provides us with the options of programming in C or Fortran. We have chosen to implement all the DARPA benchmarks in C. We have implemented the benchmarksin two ways:

The transposition operation is implemented using pure C code which we refer to as a C Map implementation.
The transposition operation is written in Verilog and a function call is made to this macro from the MAP® C source file. We refer to this as the Verilog Map implementation.

The SRC-6 implementation used was different from the FORTRAN code in that the FORTRAN implementation assumes the capability to access multiple words at the same time. The SRC-6 hardware would require additional clock cycles for accessing multiple words from the same bank. Thus implementations suitable for this underlying architecture which work on one word, two words and then four words of data at a time were designed. The two methodologieshavedifferent performance results.

We initially look at the basic 32-bit and 64-bit C Map and Verilog Map implementations, then the basic 1024-bit C Map and Verilog Map implementations. Later we scale up the architecture and look at multi-unit parallel implementation and implementations with 128-bit data transfers for all three sub-benchmarks.

6.The C Map Implementation

A pure C implementation was done for each of the 32-bit, 64-bit and the 1024-bit benchmarks.. The basic code format is similar for all three. A C main program reads the input bit-stream from an input file, calls a MAP function for performing the transposition operation, and then writes the results to an output file.The input bit stream is stored in one long array in the main program. The data arrangement within the input array for the 32-bit and 64-bit input data is the same. Once the input data is stored within the array, it is transferredfrom the common memory to the bank through DMA calls. The MAP® has six OBM banks of 524288 words, each word being 64bits. The input bit stream is of 10 million words transferred in blocks of 262144 words at a time, since we cannot load the entire input data into the OBM banks.

6.1.32-bit and 64-bit benchmarks.

The main C program and the map C code for the 64-bit benchmark are presented in Figures 2 and 3.

The code is similar for the 32-bit benchmark. The MAP code constructs are mostly similar to the C constructs. Certain constructs are vendor-specific and are used instead of traditional C constructs for the purposes of current version compiler compatibility. In the MAP C code, we have the vendor-specific macro ‘selector_64’ that is used instead of a large switch statement. A sample code is presented below:

selector_64 (i==0, temp0, temp, &temp);

The macro selects variable temp0 and assigns it to temp when the condition i==0 is true. We can use the above construct on different conditions that represent different cases of a switch statement. We note that constructs like this are necessary for ensuring that efficient code is generated; although the transposition itself is a relatively simple operation, expressing the bit-level selection in a 64-bit word can be textually tedious to the point of obscuring the underlying algorithm.

The data transposition is implemented using shift and bit-wise or operations as shown in the code of Figure 3.

Figure 2. Version 1 main.c code for the 64-bit C Map implementation

Figure 3. Version 1 map C code for the 64-bit C Map implementation

6.1.1.Code Improvements. The version 1 code presented in Figure 3works similar to the software implementation. Its inner loop generates oneoutput data at a time and the outer loop runs the inner loop until all the outputs are generated. This code is not productive and results in little performance benefit, as shown in the timing results of Table2. We would like to exploit the inherent parallel nature of the FPGA architecture, so we modify the code to operate upon on all the n output values (for each block or one n x n matrix of data) at the same time.

We use temporary variables for each of the n output values which warrant the use of additional space on the FPGA. Each time one n-bit input word is read, we obtain one bit of each of the output values. The ith input would give us the ith bit of each of the outputs and is thus placed in the ithposition of all the temporary variables using shift and or operations. Thus, when we are done reading the nth input value, we have all the transposed values. We then transfer these values onto the OBM. This modification allows us to generate the n output values in n cycles instead of n2 cycles as was originally the case.

This modification works quite well for the 32-bit benchmark, but with the 64-bit benchmark we run into synthesis problems because we run out of FPGA resources. This comes from an increase in the number of temporary variables (32 to 64) and a corresponding increase in the number of shift and or operators. To overcome the increase in resource usage we must modifythe shift operations portion of the code. The original code uses variable-distance shifts and thus generates shift operators of different sizes. This can be modified to perform shifts of constant distances, which take little or no logic to implement. A code sample is provided below:

temp17 |= ((AL[l] & (1ULL<46)) < 17) > i;

The statement can be replaced by the following statement that uses constantdistance shifts.

temp17 = (temp17 < 1) | (1 & (AL[l]>46));

The original functionality is retained, but the resource usage is drastically reduced, especially because we have 64 such modifications for the 64-bit transposition operation. The modified code for version 2 is presented in Figure 4. This code synthesizes well using resources well below the total available on the XC2V6000 chip.

Finally we use parallel sections using compiler pragmas to overlap DMA transfer of input data with computation on the data that has in a previous step been transferred onto the OBMs. This provides great performance benefits, as was shown in previous work on the DARPA Boolean equation benchmark [9], by permitting full overlap of data movement with computation.The code for the 32-bit and 64-bit benchmarks has been run on the SRC-6 computer, and the timing results are presented in Table 3.

The 1024-bit benchmark requires either a multi-unit implementation where multiple 64-bit transposition units are employed for conductingthe 1024-bit transposition operation or calling the 64-bit unit multiple times for transposing the 1024-bit matrix. We first look at the Verilog Map implementation of the 32-bit and 64-bit benchmarks before we discuss the 1024-bit implementation for both the C Map and the Verilog MAP implementations.

Figure 4. Final version map C code for the 64-bit C Map implementation

6.2.The Verilog Map implementation

As we have mentioned earlier, the SRC-6 programming environment allows us to incorporate user-defined hardware macros designed using either VHDL/Verilog. The C Map implementation provides a good speed–up over a standard processor, but we wanted to explore the possibilities of designing the application using a hardware description language that gives us a highly customized implementation. This is similar to traditional software programming methodology where critical functions are rewritten in assembly language for better performance. The hardware description languages provide us the ability to operate at the bit level instead of at the word level in C. This added advantage can be exploited in designing the transposition algorithm that mainly requires bit-manipulation. We therefore model the transposition algorithm in Verilog and implement it on the SRC-6 platform. An additional benefit from this extra implementation is to measure the quality of the code synthesized from C by the SRC MAP C compiler.

The basic idea and setup used for the Verilog design is similar for all three versions of the benchmarks to the Map C code. The basic hardware architecture of a 64-bit transposition module is given in Figure 5.

Figure 5. Basic architecture of the 64-bit Verilog transposition unit

The module consists of 64 units (dt_op) in parallel, each working on one of the 64 output values. Each unit holds a 64-bit shift register and a 64-bit or and performs the following operation:

temp <= (temp < 1) | ({63'b0, A});

This is different from the C version in that we operate on one bit of the input held in the variable A. Also, the concatenation of A with the word 63’b0 (63-bit word having a value of zero) is not possible in C using a simple bit-level operation as in the code above but would in C need shift and addition operations. A setup similar to that of the 64-bit is used for the 32-bit benchmark, with the inputs and outputs of the macro being 32-bit words. The 32-bit and 64-bit benchmarks using Verilog macros were implemented on the SRC-6 and the timing results we collected are presented in Table 4.