Morphosys : an Integrated Reconfigurable System for Data-Parallel Computation-Intensive

MorphoSys: An Integrated Reconfigurable System for Data-Parallel Computation-Intensive Applications

Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi and Nader Bagherzadeh,

Department of Electrical and Computer Engineering,

University of California, Irvine, CA 92697

Abstract: In this paper, we propose the MorphoSys reconfigurable system, which is targeted at data-parallel and computation-intensive applications. This architecture combines a reconfigurable array of processor cells with a RISC processor core and a high bandwidth memory interface unit. We introduce the system-level model and describe the array architecture, its configuration memory, inter-connection network, role of the control processor and related components. We demonstrate the flexibility and efficacy of MorphoSys by simulating video compression (MPEG-2) and target-recognition applications on its behavioral VHDL model. Upon evaluating the performance of these applications in comparison with other implementations and processors, we find that MorphoSys achieves performance improvements of more than an order of magnitude. MorphoSys architecture demonstrates the effectiveness of utilizing reconfigurable processors for general purpose as well as embedded applications.

Index Terms: Reconfigurable processors, reconfigurable cell array, SIMD (single instruction multiple data), context switching, automatic target recognition, template matching, multimedia applications, video compression, MPEG-2.

1.Introduction

Reconfigurable computing systems are systems that combine reconfigurable hardware with software programmable processors. These systems have some ability to configure or customize a part of the hardware unit for one or more applications [1]. Reconfigurable computing is a hybrid approach between the extremes of ASICs (Application-specific ICs) and general-purpose processors. A reconfigurable system would generally have wider applicability than an ASIC and better performance than a general-purpose processor.

The significance of reconfigurable systems can be illustrated using an example. Many applications have a heterogeneous nature, and comprise several sub-tasks with different characteristics. Thus, a multi-media application may include a data-parallel task, a bit-level task, irregular computations, some high-precision word operations and perhaps a real-time component. For these heterogeneous applications with wide-ranging sub-tasks, the ASIC approach would mandate a large number of separate chips, which is uneconomical. Also, most general-purpose processors would very likely not satisfy the performance constraints for the entire application. However, a reconfigurable system may be so designed, that it can be optimally reconfigured for each sub-task through a configuration plane. This system would have a very high probability of meeting the application constraints within the same chip. Moreover, it would be useful for general-purpose applications, too.

Conventionally, the most common devices used for reconfigurable computing are field programmable gate arrays (FPGAs) [2]. FPGAs allow designers to manipulate gate-level devices such as flip-flops, memory and other logic gates. However, FPGAs have full utility only for bit-level operations. They are slower than ASICs, have lower logic density and have inefficient performance for 8 bit or wider datapath operations. Hence, many researchers have proposed other models of reconfigurable computing systems that target different applications. PADDI [3], rDPA [4], DPGA [5], MATRIX [6], Garp [7], RaPiD [8], REMARC [9], and RAW [10] are some of the systems that have been developed as prototypes of reconfigurable computing systems. These are discussed briefly in a following section.

1.1MorphoSys: An Integrated System with Reconfigurable Array Processor

In this paper, we propose MorphoSys, as an implementation of a novel model for reconfigurable computing systems. This design model, shown in Figure 1, involves having a reconfigurable SIMD component on the same die with a powerful general-purpose RISC processor, and a high bandwidth memory interface. The intent of MorphoSys architecture is to demonstrate the viability of this model. This integrated architecture model may provide the potential to satisfy the increasing demand for low cost stream/frame data processing needed for multimedia applications.

Figure 1: An Integrated Architectural Model for Processors with Reconfigurable Systems

For the current implementation, the reconfigurable component is in the form of an array of processor cells which is controlled by a basic version of a RISC processor. Thus, MorphoSys may also be classified as a reconfigurable array processor. MorphoSys targets applications with inherent parallelism, high regularity, computation-intensive nature and word-level granularity. Some examples of these applications are video compression (DCT/IDCT, motion estimation), graphics and image processing, DSP transforms, etc. However, MorphoSys is flexible and robust to also support complex bit-level applications such as ATR (Automatic Target Recognition) or irregular tasks such as zig-zag scan, and provide high precision multiply-accumulates for DSP applications.

1.2Organization of paper

Section 2 provides brief explanations of some terms and concepts used frequently in reconfigurable computing. Then, we present a brief review of relevant research contributions. Section 4 introduces the system model for MorphoSys, our prototype reconfigurable computing system. The next section (Section 5) describes the architecture of MorphoSys reconfigurable cell array and associated components. Section 6 describes the programming and simulation environment and mView, a graphical user interface for the programming and simulation of MorphoSys. Next, we illustrate the mapping of a set of applications (video compression and ATR) to MorphoSys. We provide performance estimates for these applications, as obtained from simulation of behavioral VHDL models and compare them with other systems and processors. Finally, we present conclusions from this research effort in Section 8.

2.Taxonomy for Reconfigurable Systems

In this section, we provide definitions for parameters that are frequently used to characterize the design of a reconfigurable computing system.

(a)Granularity (fine versus coarse): This refers to the data size for operations. Bit-level operations correspond to fine-grain granularity but coarse-grain granularity implies operations on word-size data. Depending upon the granularity, the reconfigurable component may be a look-up table, a gate, an ALU-multiplier, etc.

(b)Depth of Programmability (single versus multiple): This pertains to the number of configuration planes resident in a reconfigurable system. Systems with a single configuration plane have limited functionality. Other systems with multiple configuration planes may perform different functions without having to reload configuration data.

(c)Reconfigurability (static versus dynamic): A system may need to be frequently reconfigured for executing different applications. Reconfiguration is either static (execution is interrupted) or dynamic (in parallel with execution). Single configuration systems typically have static reconfiguration. Dynamic reconfiguration is very useful for multi-configuration systems.

(d)Interface (remote versus local): A reconfigurable system has a remote interface if the system’s host processor is not on the same chip/die as the programmable hardware. A local interface implies that the host processor and programmable logic reside on the same chip.

(e)Computation model: For most reconfigurable systems, the computation model may be described as either SIMD or MIMD. Some systems may also follow the VLIW model.

3.Related Research Contributions

There has been considerable research effort to develop reconfigurable computing systems. Research prototypes with fine-grain granularity include Splash [11], DECPeRLe-1 [12], DPGA [5] and Garp [7]. Array processors with coarse-grain granularity, such as PADDI [3], rDPA [4], MATRIX [6], and REMARC [9] form another class of reconfigurable systems. Other systems with coarse-grain granularity include RaPiD [8] and RAW [10].

The Splash [11] and DECPeRLe-1 [12] computers were among the first research efforts in reconfigurable computing. Splash, a linear array of processing elements with limited routing resources, is useful mostly for linear systolic applications. DECPeRLe-1 is organized as a two-dimensional array of 16 FPGAs with more extensive routing. Both systems are fine-grained, with remote interface, single configuration and static reconfigurability.

PADDI [3] has a set of concurrently executing 16-bit functional units (EXUs). Each of these has an eight-word instruction memory. The EXU communication network uses crossbar switches. Each EXU has dedicated hardware for fast arithmetic operations. Memory resources are distributed among EXUs. PADDI targets real-time DSP applications (filters, convolvers, etc.)

rDPA: The reconfigurable data-path architecture (rDPA) [4] consists of a regular array of identical data-path units (DPUs). Each DPU consists of an ALU, a micro-programmable control and four registers. The rDPA array is dynamically reconfigurable and scalable. The ALUs are intended for parallel and pipelined implementation of complete expressions and statement sequences. The configuration is done through mapping of statements in high-level languages to rDPA using DPSS (Data Path Synthesis System).

MATRIX: This architecture [6] is unique in that it aims to unify resources for instruction storage and computation. The basic unit (BFU) can serve either as a memory or a computation unit. The 8-bit BFUs are organized in an array, and each BFU has a 256-word memory, ALU-multiply unit and reduction control logic. The interconnection network has a hierarchy of three levels. It can deliver upto 10 GOPS (Giga-operations/s) with 100 BFUs when operating at 100 MHz.

RaPiD: This is a linear array (8 to 32 cells) of functional units [8], configured to form a linear computation pipeline. Each array cell has an integer multiplier, three ALUs, registers and local memory Segmented buses are used for efficient utilization of inter-connection resources. It achieves performance close to its peak 1.6 GOPS for applications such as FIR filters or motion estimation.

REMARC: This system [9] consists of a reconfigurable coprocessor, which has a global control unit for 64 programmable blocks (nano processors). Each 16-bit nano processor has a 32 entry instruction RAM, a 16-bit ALU, 16 entry data RAM, instruction register, and several registers for program data, input data and output data. The interconnection is two-level (2-D mesh and global buses across rows and columns). The global control unit (1024 instruction RAM with data and control registers) controls the execution of the nano processors and transfers data between the main processor and nano processors. This system performs remarkably well for multimedia applications, such as MPEG encoding and decoding (though it is not specified if it satisfies the real-time constraints).

RAW: The main idea of this approach [10] is to implement a highly parallel architecture and fully expose low-level details of the hardware architecture to the compiler. The Reconfigurable Architecture Workstation (RAW) is a set of replicated tiles, where each tile contains a simple RISC processor, some bit-level reconfigurable logic and some memory for instructions and data. Each RAW tile has an associated programmable switch which connects the tiles in a wide-channel point-to-point interconnect. When tested on benchmarks ranging from encryption, sorting, to FFT and matrix operations, it provided gains from 1X to 100X, as compared to a Sun SparcStation 20.

DPGA: A fine-grain prototype system, the Dynamically Programmable Gate Arrays (DPGA) [5] use traditional 4-input lookup tables as the basic array element. DPGA supports rapid run-time reconfiguration. Small collections of array elements are grouped as sub-arrays that are tiled to form the entire array. A sub-array has complete row and column connectivity. Reconfigurable crossbars are used for communication between sub-arrays. The authors suggest that DPGAs may be useful for implementing systolic pipelines, utility functions and even FSMS, with utilization gains of 3-4X.

Garp: This fine-grained approach [7] has been designed to fit into an ordinary processing environment, where a host processor manages main thread of control while only certain loops and subroutines use the reconfigurable array for speedup in performance. The array is composed of rows of blocks, which resemble CLBs of Xilinx 4000 series [13]. There are at least 24 columns of blocks, while number of rows is implementation specific. The blocks operate on 2-bit data. There are vertical and horizontal block-to-block wires for data movement within the array. Separate memory buses move information (data as well as configuration) in and out of the array. Speedups ranging from 2 to 24 X are obtained for applications, such as encryption, image dithering and sorting.

4.MorphoSys: Components, Features and Program Flow

Figure 2 shows the organization of the integrated MorphoSys reconfigurable computing system. It is composed of an array of reconfigurable cells (RC Array) with its configuration data memory (Context Memory), a control processor (Tiny RISC), a data buffer (Frame Buffer) and a DMA controller.

Figure 2: Block diagram of MorphoSys (M1 chip)

The correspondence between this figure and the architectural model in Figure 1 is as follows: the RC Array with its Context Memory corresponds to the reconfigurable processor array (SIMD co-processor), the Tiny RISC corresponds to the Main Processor, and the high-bandwidth memory interface is implemented as the Frame Buffer and the DMA Controller.

4.1 System Components

Reconfigurable Cell Array: The main component of MorphoSys is the 8 x 8 RC (Reconfigurable Cell) Array, shown in Figure 3. Each RC has an ALU-multiplier and a register file and is configured through a 32-bit context word. The context words for the RC Array are stored in Context Memory.

Figure 3: MorphoSys 8 x 8 RC Array with 2-D Mesh and Complete Quadrant Connectivity

Host/Control processor: The controlling component of MorphoSys is a 32-bit processor, called Tiny RISC. This is based on the design of a RISC processor in [14]. Tiny RISC handles general-purpose operations and also controls operation of the RC array. It initiates all data transfers to and from the Frame Buffer and configuration data load for the Context Memory.

Frame Buffer: An important component is the two-set Frame Buffer, which is analogous to a data cache. It makes the memory accesses transparent to the RC Array, by overlapping of computation with the data load and store, alternately using the two sets. MorphoSys performance benefits tremendously from this data buffer. A dedicated data buffer has been missing in most of the contemporary reconfigurable systems, with consequent degradation of performance.

4.2Features of MorphoSys

The RC Array follows the SIMD model of computation. All the RCs in the same row/column share same configuration data. However, each RC operates on different data. Sharing the context across a row/column is useful for data-parallel applications. In brief, important features of MorphoSys are:

Coarse-level granularity: MorphoSys differs from bit-level FPGAs and other fine-grain reconfigurable systems, in that it operates on 8 or 16-bit data. This ensures better silicon utilization (higher logic density), and faster performance for word-level operations as compared to FPGAs. MorphoSys is free from variable wire propagation delays, an undesirable characteristic of FPGAs.

Configuration: The RC array is configured through context words. This specifies an instruction opcode for the RC, and provides control bits for input multiplexers. It also specifies constant values for computations. The configuration data is stored as context words in the Context Memory.

Considerable depth of programmability: The Context Memory can store up to 32 planes of configuration. The user has the option of broadcasting contexts across rows or columns.

Dynamic reconfiguration capability: MorphoSys supports dynamic reconfiguration. Context data may be loaded into a non-active part of the Context Memory without interrupting RC Array operation. Context loads and reloads are specified through Tiny RISC and actually done by the DMA controller.

Local/Host Processor and High-Speed Memory Interface: The control processor (Tiny RISC) and the RC Array are resident on the same chip. This prevents I/O limitations from affecting performance. In addition, the memory interface is through an on-chip DMA Controller, for faster data transfers between main memory and Frame Buffer. It also helps to reduce reconfiguration time.

4.3Tiny RISC Instructions for MorphoSys

Several new instructions were introduced in the Tiny RISC instruction set for effective control of the MorphoSys RC Array operations. These instructions are summarized in Table 1. They perform the following functions:

data transfer between main memory (SDRAM) and Frame Buffer,
loading of context words from main memory into Context Memory, and
control of execution of the RC Array.

There are two categories of these instructions: DMA instructions and RC instructions. The DMA instruction fields specify load/store, memory address, number of bytes to be transferred and Frame Buffer or Context Memory address. The RC instruction fields specify context for execution, Frame Buffer address and broadcast mode (row/column, broadcast versus selective).

Table 1: Modified Tiny RISC Instructions for MorphoSys M1 Chip

Mnemonic

/ Description of Operation
LDCTXT / Initiate loading of context into Context Memory
LDFB, STFB / Initiate data transfers between Frame Buffer and Memory
CBCAST / Execute (broadcast) specific context in RC Array
SBCB / Execute RC context, and read one operand data from Frame buffer into RC Array
DBCBSC,
DBCBSR / Execute RC context on one specific column or row, and read two operand data from Frame Buffer into RC column
DBCBAC,
DBCBAR / Execute RC context, and read two operand data from Frame Buffer into RC column
WFB,
WFBI / Write data from specific column of RC Array into Frame Buffer (with indirect or immediate address)
RCRISC / Write data from RC Array to Tiny RISC

4.4MorphoSys Program Flow

Next, we illustrate the typical operation of the MorphoSys system. The Tiny RISC processor handles the general-purpose operations itself. Specific parts of applications, such as multimedia tasks, are mapped to the RC Array. The Tiny RISC processor initiates the loading of the context words (configuration data) for these operations from external memory into the Context Memory through the DMA Controller (Figure 2) using the LDCTXT instruction. Next, it issues the LDFB instruction to signal the DMA Controller to load application data, such as image frames, from main memory to the Frame Buffer. At this point, both configuration and application data are ready.