Time is Out of Joint

A Tomasulo OOE Processor with Reorder Buffer

By

Knaves Unthrifty The

A Company of Shakespeare’s Finest Coders

Carl Wang / cs152-calwang
Howard Tsai / cs152-howardt
Shelley Chen / cs152-shelleyc
Will Leven / cs152-wleven


Abstract

The project we present is a significantly scaled down version of our original 68-point beast. We implemented a Tomasulo-style out of order execution processor with a reorder buffer. But since there are no floating point operations, the only multi-cycle instructions are loads and stores. In addition, we don’t have any branch prediction—static or dynamic—which made development significantly faster. However all the interfaces were designed to support additions such as branch prediction in case we had extra time.

♣Division of Labor

Since this project consisted of many different parts, we split up the individual parts and designed them somewhat separately. While designing and implementing, we kept in constant communication (verbal, electronic or through online notebooks) with each other so that we would avoid interfacing errors.

Carl Wang / Load/store unit / Instruction Issuer
Howard Tsai / Reservation stations / Integer unit
Instruction Queue / Instruction Decode/Issuer
Branch Functional Unit
Shelley Chen / Reorder Buffer\Commit Unit / Debugging
Will Leven / Register Status Table / CDB Arbitration
Monitors / Instruction Fetch

♣Detailed Strategy

To implement the Tomasulo processor, we scrapped most of our pipelined processor from the previous labs. The sections that we kept mostly unmodified included the caches, memitrator and the register file. With the exception of basic building blocks, such as muxes, everything else was designed from scratch to support the new architecture.

Instruction Fetch Unit

The instruction fetch unit monitors the instruction queue and cache and manages the PC accordingly. Upon the queue detecting a branch or jump, this unit will freeze the PC and write one more instruction(delay slot inst.) into the queue. It then waits for a branch/jump complete signal from the datapath before moving on with the correct PC. Upon a instruction cache miss, the unit will freeze the PC and wait until the valid instruction appears and writes that instruction into the queue before continuing.

Instruction Queue

The instruction queue permits instruction fetch to be de-coupled from the rest of the processor. It takes an instruction, the pc, and pc+4 from the instruction fetch unit and stores these into a queue. It also examines instructions to locate branch/jump instructions and signals the instruction fetch accordingly. The queue stalls the instruction fetch unit if it is full and stalls the decoder if it is empty.

Decode and Issue

Decoding and issuing instructions is more complicated than the original five-stage pipeline decode stage. Since values for registers can be found either in the register file, broadcasted on the CDB, or coming from a reservation station. The “name” for the value that an instruction might need is no longer a register, with its number stored in the instruction itself. Now that we have register renaming with our Tomasulo processor, we need to look in the register status table for the name of the ROB number with that register number. The register status number will say if the register is busy or not. If so, the decode unit needs to find the value from the associated reservation station. And if the reservation station is busy, the only other place to find the missing data is on the CDB. However, if the instruction hasn’t been committed yet, the decode unit load the instruction and whatever values for operands it.

Also, instructions can go to one of three different functional units. The decode unit uses the opcode and the funct field to determine which functional unit it goes into. Since this is a very important part in the life of an instruction, there is a monitor that prints out some information every time an instruction is issued.

Branch Ops Functional Unit

This unit handles all PC-changing instructions. There are four reservations stations, although in our current non-branch predicting Tomasulo only one station will ever be used. Again, the branch logic block was from the 5-stage pipeline. Technically, the only instructions that require the functionality of a reservation station are branches and jump registers. But to simply our datapath, all jumps are loaded into this Unit, they are ‘ready’ by default. Note, that since all these instructions are done in the execute stage, there is always an extra “jump/branch” delay slot invisible to the programmer. The only instruction that requires the CDB is ‘jal’, which broadcasts $ra. This unit also interfaces with the instruction fetch unit since it needs to provide information about what new PC to use and if a branch is finally calculated.

Integer Ops Functional Unit

The I-OPS unit handles all the non-PC changing and memory access MIPS instructions. It consists of a control, four reservation station units, and a non-pipelined execution datapath. The execution datapath is constructed out of the parts from the 5-stage pipeline. The functional unit has an interface to the issue unit as well as the CDB. The functional unit’s control maintains the state of the reservation stations, arbitrates request for the execution datapath, and sends request for CDB control.

Here we optimized for high throughput and ease of interface. To get the functional unit as busy as possible, arbitration is performed on every edge, thus execution occurs on every cycle. The control for deciding which station to write to next is handled inside he unit.

Memory Unit
The load store unit had an additional requirement in that it needs to handle both memory and register dependencies. The reservation stations will handle the register dependencies but nothing really handles memory dependency. To ensure that memory dependencies don't cause problems, we impose the rule that if a store has been issued, no loads after it can execute until the store has committed. This is so that any loads that might load from an uncommitted instruction won't load garbage. While this policy does make execution slower, this approach makes it easier to create a working processor, which is our number one goal.
To keep the addresses in order, we use a buffer that keeps track of which instruction was issued first. The controller and the arbiter will look in the buffer to find the oldest-issued instruction. The buffer ends up containing pointers to reservation stations (within the load/store unit). The controller and the arbiter pick pointers from the buffer to decide who can go. If an instruction (namely a load since stores will never happen out of order) executes before the older ones in the queue, then the other instructions behind it will squish the instruction that was just removed from the reorder buffer.
The reservation stations never need to know what order its instruction was issued in. In fact, the reservation stations are just there to look for register dependencies in the instructions. Otherwise, they just serve as fancy registers.
For choosing which instruction to commit, the arbiter will first see if there is a store in the oldest slot. If so, and the ROB says the store can commit, the load/store unit will send the store to memory. Otherwise, it will look for the first ready load that is not behind a store. This way, priority is given to older instructions since it is likely that they have more dependent instructions.
Unit Arbiter

The Unit Arbiter determines which functional unit gets access to the CDB. The arbiter always grants the jump/branch unit access when it needs it. We made this decision because we have no branch prediction and want to make decisions on branches as quickly as possible in order to continue fetching instructions. The arbiter implements a round-robin scheme to choose between the integer and memory units.

Reorder Buffer
On the surface, the reorder buffer seems very straightforward. It just a table which keeps track of the order that instructions are issued. However, in our processor, the reorder buffer ended up doubling up as a commit unit as well. Not only does the reorder buffer keep the order that instructions were issued, but it also writes the destination values to the register file once the instruction has finished execution. Thus, it needs to interact with multiple components in the system, which includes the register status table, the CDB, and the register file.
The reorder buffer is a queue data structure, so we needed a head and tail pointer to keep track of the state and the size of the buffer. The head pointer keeps track of the next empty slot in the reorder buffer while the tail pointer keeps track of the next instruction to commit. The buffer control can tell that the buffer is empty if the head pointer equals the tail pointer and the tail entry is invalid.
Register Status Table

The Register Status Table (RST) tracks dependencies of registers and is the critical element for performing register renaming. The RST contains thirty-two intelligent data storage elements, one for each register. The elements are uniquely programmed with register number. The elements have storage for a Reorder Buffer (ROB) number and a busy bit. They also have the ability to detect if any of four register number inputs match their value. The RST needs to handle up to four requests simultaneously. The Decode/Issue Unit queries the RST for the status of up to two register numbers. The RST returns either “not busy” if the register’s value is in the register file, or “busy” and a ROB number to identify a dependency. If necessary, the Decode/Issue Unit also sets the destination register entry to busy and assigns its ROB value. Finally, the RST also allows the ROB to change the status of a register.

Monitors

The issue monitor tracks the issuing of instructions from the Decode/Issue Unit. It outputs the cycle number, PC, instruction, and Tag or ROB number. This unit could be expanded by tracking the status of the registers when the instructions are issued. The CDB monitor tracks the status of the Common Data Bus. It records the cycle number, the tag, and the functional unit that has control. One general design flaw was our failure to pass PC’s and complete instructions to reservation stations. This complicated debugging, although we could usually identify functions by their ROB number. The commit monitor tracks instructions leaving the ROB. This monitor displays the same information as the issue monitor.

Reservation Station

Four of these generic reservation stations are used in every functional unit. It is designed as a hybrid of registers and control that watches the CDB for values that it needs. The standard entries are supported: Vj, Vk, Qj, Qk, Tag, & Op. There are also other signals used to interface with the rest of the datapath, such as CDB. The write enable is falling-edge triggered.

The station determines when it’s ‘ready’ to execute and signals a request for the execution datapath when all dependencies have been clear. It monitors the CDB for the correct tag and latches in the bus value. In the interest of time, the reservation station is implemented in VHDL.

The design was geared towards maximum usage of the station. Thus, a station’s ready logic is asynchronous and will signal that it is ready before the falling clock. This allows execution to start immediately on the falling edge. This requires that there be a comparison set-up time. Furthermore, its execution request signal is cleared on a rising edge, thus allowing it be raised again on the following falling edge, thereby eliminating any overhead stalls in reusing these stations.

Datapath Highlights

The datapath is divided into the stages of inst. Fetch, decode & issue, execution, and commit. The memory and register file sits on the side, accessible by the different stages.

One key performance aspect we aimed for was that dependent instructions can run back-to-back. To do so would mean the result of the execution of the first is ready for use in the stalling instruction in the very next cycle. This required that either a) we forward the output of a unit to every reservation station, or much simpler b) combine the EX and Write stage. This is analogous to the “removal” of the WB stage was handled in our 5-stage pipeline since writing to the register file happen on a clock edge. Here the execution and request for the CDB happens in parallel. We are guaranteed by the ‘CDB arbiter’ that on the coming falling edge, the result of the execution can be latched into the ROB and any reservation station.

A second thing we want to highlight the optimization of the way branch and jump delay slots are fetched. Since we do not have branch prediction, one of our biggest bottlenecks is the resulting stalls the fetch. As an effort to alleviate the problem somewhat, the fetch control makes sure the delay slot instruction is fetched before we stallthe PC thus allowing its issue while the branch is calculated.

Areas of Difficulty

In order to convert our processor into the Tomasulo architecture, we pretty much needed to scrap everything we had done up until lab 6. The only component that was reusable was the memory system. Just redesigning and implementing everything in itself is very time consuming. In addition to this, testing everything was very difficult as well. Since everyone worked on different independent components of the processor, it was very hard to determine the interfaces between these components. Many times it wasn’t obvious what the inputs and outputs to certain components were until we tried to hook things up together. This makes for a lot of debugging and redesigning of components.

♣Results

Critical Path

The critical path for our Tomasulo processor is illustrated above. The critical path continues to be a memory access, more specifically on a cache hit in the load/store unit. Bascially, the Reorder buffer controller outputs the tail_store signal (letting the load/store unit know that a store instruction needs to be committed), which goes into the load / store abiter, who outputs asserts the enable signal to the cache, passing along the address that needs to be read or modified. The cache checks the instruction’s tag with that of the address in the cache set (with a comparator) and outputs the data for the cache line that the hit was for. The breakdown of the critical path is below:

Reorder Buffer Control:6ns

Load / Store Arbiter:6ns

Address calculation16ns

32x2 multiplexor (to choose address from pipeline):1.4 ns

Comparator:10 ns

Cache line output muxes:2.5 ns + 1.4 ns

Total delay:43.2 ns

Performance Results

To test the performance of our processor, we just did a side-by-side comparison between our Tomasulo processor and our 5 stage pipelined processor from lab 6. The calculated CPIs are shown below for three different programs:

Program / CPI with Tomasulo processor / CPI with 5 Stage Pipeline
Lab 5 mystery program / 1.68 / 2.68
Quick Sort / 1.89 / 2.63
Partial Sums / 1.95 / 1.82

From the above results, it’s clear that the Tomasulo architecture does not improve the performance of the CPU for all programs. Partial Sums doesn’t have a lot of load and store instructions for the Tomasulo architecture to take advantage of.

Conclusion

This project has taught us much more than we ever had hoped to learn about designing a processor. At the start of the project, none of us really understood how the Tomasulo architecture really worked, so the initial learning curve was steep. A lot of time spent sitting down and discussing how the different components would interface with each other.

The downside of implementing a Tomasulo processor is that it is not designed for speed when the instruction set doesn’t include multicycle instructions. However, the insruction set that our processor supported didn’t include multiplication/division. Given more time, we would have definitely implemented a branch predictor. The more instructions that you issue, the faster the program should run on a Tomasulo processor. Also, we would have put in more reservation stations so that we wouldn’t have such a big problem with structural hazards.

Appendix I (notebooks):

ShelleyChen (116 Hours)