Computer Science 152David Lee

Jack Kang

Benjamin Lee

Report – Final15 May 2003Lyle Takacs

Contents

Abstract

Division of Labor

Strategy

Superscalar Processor

Superscalar Datpath [Design | Test]

Instruction Cache Block [Design | Test]

Issue Unit [Design | Test]

Forwarding [Design | Test]

Branch Prediction

Branch Translation Buffer [Design | Test]

Branch History Table [Design | Test]

Static Branch Prediction [Design]

General Testing

Results

Conclusion

Note: Appendices included in supplemental files. All hyperlinks reference these files.

Appendix I – Notebooks

Jack Kang

Benjamin Lee

David Lee

Lyle Takacs

Appendix II – Schematics

Top Pipeline Schematic

Bottom Pipeline Schematic

Cache Block Schematic

Decode Block Schematic

Execute Block Schematic

Instruction Cache Block Schematic

Memory Block Schematic

Superscalar Datapath (Lower Right) Schematic

Superscalar Datapath (Upper Left) Schematic

Superscalar Datapath (Lower Left) Schematic

Superscalar Datapath (Upper Right) Schematic

Superscalar Datapath (Overall) Schematic

Appendix III– Verilog

Appendix IV – Testing

Issue Unit

Forwarding

Branch Prediction

General

Abstract

The purpose of this lab is to enhance the 5-stage pipelined processor from previous lab work with a major sub-project and one or more minor sub-projects as stated in the lab specifications. This team’s selection of projects included a 2-way superscalar processor with two pipelines, one instruction stream, and one cache. The team also implemented dynamic branch prediction with a branch target buffer and a branch history table.

The performance gains achieved by a superscalar processor result from the ability to exploit instruction level parallelism and achieve a greater instruction throughput than would be possible in our previous implementations of our pipelined processor. The ability to execute instructions in parallel is limited by the ability to issue instructions in parallel. A significant design issue and functional component of our superscalar processor is an issue unit that checks dependencies between instructions and determines the number and sequence of instructions to be executed on the two pipelines. In addition to the implementation of a dual issue fetch/decode unit, the two pipelines must also allow for forwarding within each pipeline as well as forwarding between pipelines.

The number of delay slots increase with a superscalar implementation, compounding the effects of data and control hazards. This provides the motivation for our minor sub-project. Branch prediction is intended to reduce the performance impact of the extra delay slots. The branch target buffer will buffer the target addresses of past branch instructions and allows the target address to be available for the PC in the fetch stage. The branch history table contains the taken history of branches and allows the taken signal to be available for the PC source mux in the fetch stage. By adding these buffers, the number of delay slots is reduced and the performance penalties due to control hazards are reduced.

Static branch prediction is also enabled as a result of a stream buffer supplying instructions to the issue unit. In the event that a branch is issued, the instructions in the buffer will continue to be supplied to the issue unit. In the event that a branch is taken, the stream buffer will be flushed. The entire scheme represents an implementation of static branch prediction such that a branch is always predicted not taken and the stream buffer is flushed when a branch is taken.

The final result of this lab is a five-stage two-way superscalar processor implemented as a Harvard architecture. The design of the pipeline stages were derived from our past implementations of these stages. These stages have been grouped in schematic and blocked such that the final superscalar processor was assembled in schematic. The new forwarding, control signals, and logic for dual issue were implemented in Verilog. The design process included dependency checking and forwarding as well as dynamic branch prediction.

Division of Labor

Jack Kang

Design and implementation of the dual issue unit, including the FIFO, instruction cache, and issuer.

General testing and debugging in simulation and board.

Benjamin Lee

Design and implementation of branch history table and branch target buffer

Datapath assembly

General testing and debugging in simulation

David Lee

Design and implementation of the dual issue unit, including the FIFO, instruction cache, and issuer

General testing and debugging in simulation and board

Lyle Takacs

Design and implementation of forwarding unit.

Modifications to the monitor module.

Datapath assembly.

General testing and debugging in simulation.

Strategy

Part I – Superscalar Processor

Superscalar Datapath

Superscalar Datapath (Lower Right) Schematic

Superscalar Datapath (Upper Left) Schematic

Superscalar Datapath (Lower Left) Schematic

Superscalar Datapath (Upper Right) Schematic

Superscalar Datapath (Overall) Schematic

Design – Superscalar Datapath

The superscalar datapath requires the duplication of the execution stage in the pipeline. Specifically, the processor has one instruction fetch stage, one instruction decode stage, two execute stages, one memory stage and one write back stage. Although instructions can execute in parallel, only one memory stage exists which means that only one of the two instructions issued in parallel may access memory. The corresponding stage in the other pipeline has no functionality but must still propagate its data and control signals through a “memory” stage in order to keep the same number of stages in each pipeline. Since there is only one pipeline has a memory stage, there is only one DRAM and two caches (instruction and data) and no modifications were necessary for the memory interface.

The final high level schematic includes all five stages of the pipeline modularized into several larger inclusive blocks:

Decode Stage Verilog

Decode Stage

The decode stage module is the symbol corresponding to the Verilog instantiations of a FIFO queue and the issue unit. The issue unit checks the dependencies between instructions prior to issuing them to the pipelines (a detailed discussion of dual issue follows). The FIFO queue is used to buffer issued instructions, trying to ensure that the pipelines are always issued instructions. This queue serves to hide the fact that instructions may be stalled and/or swapped by the issue unit from the pipelines, trying to provide a continuous stream of decoded instructions to the pipelines. The decode stage also interfaces with the RegFile in order to decode the instruction and fetch the corresponding operands and the instruction cache block to receive the instructions from memory. The instruction cache block has been modified to support fetching enough instructions to support the issue unit (a detailed discussion of the modified instruction cache block follows)

Decode Block

Execute Block

Decode Block

In prior implementations of our five-stage pipelined processor, a portion of the execution of R-type instructions occurred in the decode stage. Specifically, a branch is resolved and its address calculated in this stage. A shift instruction is also executed in this stage. The superscalar implementation of our processor required a separation of the decoding portion of this stage (in the Decode Stage block) and the executing portion of this stage (in the Decode Block block). The Decode Block contains an extender and a shifter that was originally used for branch calculations but is now used to support the shift operation in the next pipeline stage. It also contains forwarding muxes for the decode stage such that values can be forwarded and used for branch target calculations. The actual branch calculations now occur in the issue unit and these forwarded values are passed into the issue unit accordingly.

Execute Block

The execute block encapsulates the ALU, the forwarding muxes for ALU operands, the SLT unit, and the Shifter unit. This block is included in both the top and bottom pipeline.

Memory Block Schematic

Memory Block

The memory block encapsulates memory-mapped I/O, the data cache, and the associated read and write logic.

Top Pipeline

Bottom Pipeline

Pipeline Top

The top pipeline contains the three remaining pipeline registers (ID/EX, EX/MEM, MEM/WB). Between the ID/EX and EX/MEM registers is the execute block. Between the EX/MEM and MEM/WB registers is a single mux that combines the results from the execute block into a single value to forward into the decode and execute stage.

Pipeline Bottom

The bottom pipeline contains the three remaining pipeline registers (ID/EX, EX/MEM, MEM/WB). Between the ID/EX and EX/MEM registers is the execute block and two muxes taking in the original register values and the next PC as seen by a jump register instruction. These muxes provide support for a jump register instruction followed by an instruction that uses $R31 (e.g. jr loop, addiu $31, $31, 4).

The memory block is located between the EX/MEM and MEM/WB registers. It is also supported by a new multiplexor that takes the executed values from the top pipeline and uses them as input to memory. This allows for an instruction to execute and write a value into the register file in the top pipeline and a store of the same value to occur in the bottom pipeline (e.g. addu $1, $2, $3, sw $1, 0($0))

Forward Verilog

Forward Schematic

Forward

The forwarding unit is positioned between the two pipelines, taking inputs from various pipeline stages to check for dependencies (a detailed discussion of the forwarding unit follows). It also outputs the select signals to both pipelines to control the forwarding in various stages.

Branch Translation Buffer Verilog

Branch History Table Verilog

Branch Translation Buffer

The branch target buffer connects to the decode stage block to provide the PC with a branch address in the fetch stage before the actual address is calculated in the decode stage (a detailed discussion of the branch target buffer follows). The target buffer holds the last target address of the same branch. The buffer is direct mapped and will replace entries upon a conflict miss. Our team has implemented a 32-entry, 256-entry, and 2048-entry buffer. We had decided against the 2048-entry buffer since such a large module would take a significant amount of time mapping to board.

Branch History Table

The branch history table connects to the decode stage to provide the taken signal to the PC source multiplexer in the fetch stage before the actual comparator generates the true taken signal in the decode stage (a detailed discussion of the branch history table follows). The history table is a 2-bit predictor and employs hysteresis to improve prediction accuracy. Our team has implemented a 32-entry, 256-entry, and 2048-entry table. Again, we had decided against the 2048-entry table since such a large module would take a significant amount of time mapping to board.

The regfile was changed considerably. There had to be two sets of Ra and Rb outputs, one for the top pipeline and one for the bottom pipeline. Correspondingly, there were two sets of inputs selecting which registers to read. During writes, if both pipelines tried writing to the same register, the bottom pipeline would take precedence, as the bottom pipeline contained the later instruction. Additionally, there was a special write input port for register $31 and an output port for register $31. This was necessary so we could run the jal instruction in the decode stage and immediately write in to register 31. This was done because otherwise the forwarding unit would have to come through the issuer in order to take care of instructions immediately after the jal that modify register 31, (since the address is calculated in decode). This created a WAW hazard; that is, if another instruction that was modifying register 31 was already in the pipeline, it would overwrite this value. We fixed this by disabling pipeline writes to register 31 for 2 cycles after a jal.

Testing – Datapath

Testing of the superscalar datapath was only possible after the issue unit was attached to the two pipelines. These two components were thoroughly tested with various small assembly files to test specific cases in conjunction with more comprehensive tests. A detailed discussion of datapath testing may be found in the part on general testing (III).

Instruction Cache Block

Instruction Cache Block Schematic

Instruction Cache Controller Verilog

Two Way Cache Verilog

Design – Instruction Cache Block

Because the superscalar processor must issue 2 instructions per cycle, the instruction cache was changed to have a 128-bit output, or 4 instructions per cycle. The cacheline was striped across four 32-bit SRAM blocks so that upon reading, each instruction could be accessed simultaneously. The instruction cache fetches 4 instructions each time because this will keep the processor busy while the cache is fetching new instructions. It also allows for the issue unit to analyze the instruction stream so that it knows if a branch or jump is coming, then it should begin fetching the delay slot if the delay slot is not in the current block.

The instruction cache is now direct-mapped as opposed to the two-way set associative cache in Lab6 because in general the instruction stream is sequential so having a two-way set associative cache would not give us much benefit. Rather, it would complicate the cache because we would need to have muxes that take in 256 bits and output 128 bits and slow down the cache.

Testing – Instruction Cache Block

To test the instruction cache, we loaded up the instruction cache with blocks of instructions by telling the instruction cache to load each instruction individually. Then we tried to read from the cache by giving it an address and reading the block. To make sure that no two blocks overlapped we loaded several blocks next to each other and read out the final block values.

Issue Unit

Issue Unit Verilog

FIFO Verilog

Design – Issue Unit

The issuer is broken into 2 pieces: a FIFO buffer and an issue unit. The FIFO unit can store up to 8 instructions (2 blocks of instructions), which it grabs from the instruction cache. The reason that it stores 2 blocks of instructions is because if a branch or jump is the last instruction of a block then its delay slot must be in the next block. By having 2 blocks in the FIFO, we can be sure that when the branch or jump instruction reaches the top of the buffer, then the delay slot will have been fetched. The top 2 instructions in the FIFO are visible to the issue unit at all times. The issue unit analyzes these 2 instructions, and upon deciding to issue one, two or zero instructions to the rest of the datapath, will choose to then remove one, two, or zero instructions from the top of the queue.

The FIFO is implemented as 8 registers with 2 pointers. One pointer points to the current top entry in the FIFO. This and the next entry are the visible entries to the issue unit. The other pointer keeps track of the next empty slot. This slot, plus the 3 slots immediately after it, are the 4 slots that are filled when 128 bits are read from the cache. Finally, there is a counter that keeps track of the number of valid entries in the FIFO. When this drops to a certain level, then the FIFO unit will grab more instructions from the cache. The command to grab more instructions comes from the issue unit, which is also responsible for sending the proper pc to the instruction cache.

The issue unit is basically a controller that grabs two instructions from the top of the FIFO and analyzes them. There are a few cases worth noting:

If there are no structural hazards or data forwarding hazards between the two instructions then we can issue 2 instructions, with the first instruction in the top pipeline and the second in the bottom pipeline.
Because we only have one memory stage, memory instructions (lw and sw) must be issued in the bottom pipeline, which has the memory stage. If we get two memory instructions in a row, we can only issue one at a time. If, however, the memory instruction comes with a 2nd instruction that has no dependency on the memory instruction, we can swap them so that the later instruction executes in the top pipeline and the memory instruction goes into the bottom pipeline. There was a small optimization done here in the case of two load instructions loading to the same register. If we saw that case, we would only issue the 2nd instruction and discard the first.
Branches and jumps must be issued in the top pipeline with its delay slot instruction in the bottom pipeline. This is so that we can preserve the number of delay slots after a branch or jump instruction. No matter where the branch or jump is located in a block of instructions, the number of delay slots after it is a constant of one slot.
If there are any dependencies between 2 instructions, only one instruction is issued, otherwise 2 instructions are issued. A signal is sent back to the FIFO to remove one, two, or zero instructions and output the next pair of instructions.

Testing – Issue Unit