An Ultimate Minimal RISC Processor for Space Applications

Elisha K. Kane[*]

Mesa State College, Grand Junction, Colorado, 81501

Many embedded applications, from signal acquisition to process monitoring and control benefit from programmability. This paper discusses the design and implementation of an “ultimate” minimal RISC processor suitable for embedded processing. Among the advantages of this approach are low gate count and the ability to configure the processor exactly as needed with a minimum of complexity. The simplicity of this processor and the current state of FPGA technology provides the ability to develop and provide low cost and low power processors in embedded space applications.

I.The Advantages of an Ultimate Minimal RISC Design

A comparison between the number of gates used in the ARM 968E-S soft processor core and the ultimate RISC design presented in this paper helps to illustrate the advantages of lower gate count. The 968E-S is designed for real time embedded systems, and uses about 68K gates in a minimal configuration. The ultimate minimal RISC processor described here uses under 100 flip-flops and just under 2000 gates (estimated). Lower gate counts mean lower power requirements. Propagation delays are also reduced, which helps to achieve required timing constraints. In addition, for systems operating in harsh environments, the lower gate count means that larger sized gates can be used which results in a more robust device. Lower gate count allows more processors to be used in the same die area, an advantage for applications like image processing where one processor per pixel might be desirable. The simplicity of the ultimate RISC makes it easy to integrate into a custom ASIC or FPGA solution.

II.What is an “ultimate” minimal RISC machine?

It has been shown that it is possible to implement a fully functional processor with only one instruction1. That instruction can either be subtract memory from memory and branch if negative, or move memory-to-memory. A move memory-to-memory processor is capable of computation if the ALU and the accumulator are mapped to memory. This approach is known as a move machine. The design of the ultimate minimal RISC processor described in this paper is a move machine based on one described by Ref. 2 and is referred to herein as the Really Simple Processor, or RSP. The simplicity of this design allows a very low gate count.

It might seem as though a processor with only one instruction that moves data from one place to another is incapable of performing a conditional branch. However, if the program counter (PC) is also mapped to memory, then a move machine like the RSP can perform conditional branching. Other common instructions are also therefore possible, as in any RISC architecture that performs complex operations. For example, in the assembler written to test this processor, instructions for dereferencing memory locations have been implemented and used to write a sort. The code sample below illustrates how the conditional branch instruction “jn – jump on negative” is done through a series of moves. Other complicated instructions are similarly achieved.

;setup to test the jn (jump on negative)

mov 0x01, acc;move immediate to accumulator

mov 0x03, sub;move immediate to subtract – result should be negative

;the jn implementation

mov ccn, acc;move negative flag to accumulator

mov 0x02, and;test if set

mov acc, shr;shift flag right one bit

mov (code_ptr + 2), add;if ccn was 0, result is address following jn

mov acc, pc;if ccn was 1, result is address after next; move result to program counter

;control goes here when jn fails – moving the address of a label to the PC here makes one branch

mov labeladdress, pc;if result was not negative (ccn == 0), control goes here

;control goes here when jn succeeds – this is the other branch

mov stuff, acc;if result was negative (ccn == 1), control goes here

...etc.

Because the move machine can perform computations, read and write data, and implement conditional and unconditional branches, it should be Turing equivalent. This processor design is just as capable as any other.

III.The Really Simple Processor

The RSP in this paper is an 8-bit address, 16-bit data processor. There is no reason that it could not be designed for any arbitrary bit width for addresses and data. Memory mapping in the RSP is arranged so that the code segment begins at location 0x00 and extends upward. Each addressable register and ALU function is mapped from the top of memory downward. In this 8/16 version of the RSP, the PC, other registers and ALU functions all occupy the upper 16 words, between 0xf0 and 0xff. The data segment begins at 0xef and extends downward. The initial tool used to design the RSP is a graphical Register Transfer Level simulator called Retro3. Currently the RSP is being redesigned in Verilog, for implementation in a Xilinx FPGA. The memory block in the Retro simulator is capable of loading contents from a file. A small macro assembler written in Perl allows programs to be written for the processor, then loaded and run in the simulator. The processor has run a simple bubble sort, demonstrating the ability to compare, swap and conditionally branch. Figure 1 shows a block diagram of the RSP.

Figure 1.RSP Block Diagram

There are six registers in this design. The PC contains the current location of the source and destination operands in memory. The SRC and DST registers are loaded with source and destination addresses, respectively. The TMP register is used to transfer data between the SRC and DST addresses, and contains data between the read and write operations. The ACC (accumulator) register is loaded with the appropriate result of the ALU, selected by the destination operand. The Flags register is read only, and contains the flags for negative, zero, underflow, carry and the left and rightmost bits from the shift operations.

The execution cycle is a five-step process:

1. Load the SRC and DST registers with the contents of the current memory location.

2. Put the SRC onto the address bus; increment the PC

3. Read from the source address and write that into the TMP register.

4. Put the DST address on the bus.

5. Write the contents of TMP into the destination address.

The clock pulses that provide timing are shown in Figure 2. Pulse 0 loads the SRC and DST registers. Pulse 1 puts SRC onto the address line. Pulse 2 puts DST onto the address line. Pulse 3 puts the contents of TMP onto the data bus. Pulse 4 loads the PC with the next address, either from an increment or from the data bus. Pulse 5 loads TMP with what is on the data bus from the memory. Pulse 6 places either the ACC or the Flags register contents on the data bus. Pulse 7 selects the source from which the PC gets loaded, either from an increment or from the data bus.

Figure 2.Clock Pulses

Timing in the simulator runs in increments of nanoseconds. The entire 5-step execution cycle takes 70 nanoseconds. A great improvement to this execution cycle can be achieved by using dual ported memory. It should be possible to reduce the cycle to two steps:

1 Put the source and destination addresses from the current memory location onto the source and destination address busses; increment the PC.

2. Write data to memory.

This arrangement would also eliminate the necessity of the TMP register, further saving on the overall size and power requirements of the processor.

Currently at 70 nanoseconds per move instruction, that comes out to about 14.3 million moves per second. There are 23 instructions supported in the assembler for the RSP. Seven of those take more than one move to accomplish. The shift right and left instructions take up to 8 moves each. The rest take only one move. Not counted in this performance estimate are the rotate instructions, because it seems that the rotate instructions are not as commonly used. They take 5 moves per bit rotation. The average number of moves per instruction is about 2.1, which results in about 6.8 million instructions per second. A dual ported memory design with an execution cycle of 30 nanoseconds results in about 15.8 million instructions per second. These numbers give a very rough estimate of integer performance. Of course, with faster cycles the performance will increase.

The RTL layout schematic in Retro is shown in Figure 3. This represents a snapshot of the design. Note that there are some things in this layout that are there for experimental and debugging purposes such as the instruction counter register in the upper left, or the expanded set of gates performing the bitwise and function.

Figure 3.RSP Layout in Retro Simulator

IV.Improvements for the RSP

As mentioned above, the use of dual ported memory will shorten the execution cycle of the processor significantly. It will also reduce the total number of flip-flops needed, reducing the size and power requirements of the processor. Once the design has been successfully described in Verilog, synthesis for a particular FPGA will allow a better estimate of how fast the processor can actually run on hardware.

The current design lacks I/O lines. Mapping one or more memory locations to I/O lines is an easy way to provide them. A logical place for the mapping would be between the internal resource mappings, such as PC and accumulator, and the top of the data segment.

As designed, the RSP has no way to receive and respond to interrupts. With the addition of an interrupt line, and placing the PC on the data bus, the processor can respond to external interrupts by saving the PC and jumping to an interrupt service routine. This allows the processor to be used in monitoring and control applications. The added complexity of this functionality is minimal, especially if dual ported memory is used to eliminate the need for the TMP register. Total gate count should not go up significantly, especially compared to other small RISC soft processors.

Improvements can be made to the processor by taking advantage of resources provided by FPGA devices. For example, the Xilinx XC3S200 has 18x18 bit multipliers built in. These provide fast multiplication in the ALU for DSP applications. This particular FPGA also includes blocks of dual ported memory on-chip, which reduces the number of external lines needed. It also allows the elimination of the TMP register and a decrease in the number of steps in the execution cycle. The programs for the RSP should be able to fit into any on-chip memory, boosting performance by acting like a level 1 cache.

V.Applications for an Ultimate RISC Processor

This processor (or a variation of it) is suited for image processing, for example performing edge detection in a parallel array of them, one per pixel. Other massively parallel applications could benefit from such small processors because so many more of them can fit onto a given device. Adaptive computing platforms may benefit from the use of such simple processors. They could be used for dynamic reconfiguration of FPGA devices, providing embedded systems the ability to perform more than one function with the same hardware, increasing a system's capabilities. These processors could provide dedicated encryption/decryption engines for a system.

VI.Conclusion

An ultimate RISC processor, designed as a move machine like the RSP described in this paper, is very simple but is still a very capable one. It requires a very small number of gates and flip-flops to implement - estimated to be less than 2000 and 100, respectively. The low gate count means lower power requirements. Additionally, the reduced area required for implementation allows more functionality in a given device. The small size allows processing power to be included in very small devices that lack the resources for more complicated processor designs. Lower gate count also means that larger gates can be used to increase robustness in harsh environments. The simplicity of the processor does not limit its capabilities – it is well suited for image processing, monitoring systems, and performing configuration. Designers can easily customize the processor for the application being developed.

VII.References

1Laplante, P.A., A Novel Single Instruction Computer Architecture, ACM Computer Architecture News, Volume 18, Issue 4, December 1990, 22-26. ACM Press, New York

2Jones, D.W., The Ultimate RISC. ACM Computer Architecture News, Volume 16 Issue 3, June 1988, 48-55, ACM Press, New York.

3Chansavat, B.; Register Transfer Object Hardware Simulator, The University of Western Australia, Dept. of Electrical & Electronic Engineering, Centre of Intelligent Information Processing Systems,

4Guo, Zhi; Najjar, Walid; Vahid, Frank; Vissers, Kees, A quantitative analysis of the speedup factors of FPGAs over processors, Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, 162 - 170, ACM Press, New York, 2004

5Aletan, Samuel O., An Overview of RISC Architecture, Proceedings of the 1992 ACM/SIGAPP Symposium on Applied Computing: Technological Challenges of the 1990s, 11-20, ACM Press, New York, 1992

6Laplante, P.A.; Gilreath, W.; One Instruction Set Computers for Image Processing, The Journal of VLSI Signal Processing, Volume 38, Number 1, August 2004, 45-61. Springer Science + Business Media B.V.

7Mavis, D.; Cox, B.; Adams, D.; Greene, R.; A Reconfigurable, Nonvolatile, Radiation Hardened Field Programmable Gate Array (FPGA) for Space Applications, Session B of the Military and Aerospace Applications of Programmable Devices and Technologies Conference 1998;

8Mohanram, K.; Closed-form simulation and robustness models for SEU-tolerant design, Department of Electrical and Computer Engineering, Rice University, Houston Texas, 77005,

[*] Undergraduate Student, Computer Science