Division of Labor

Abstract

The final design is a super-scalar MIPS microprocessor. This allows the processor to handle two instructions in parallel if there are no dependencies between them. Furthermore, because more instructions are being executed per cycle as compared to the single-feed processor, instruction cache misses become a significant bottleneck for the processor. Therefore, a stream buffer is included to relieve this problem. Additionally, a victim cache is also added to reduce the miss rate of the data-cache. Overall, these two features are added as solutions to the problem of a very costly DRAM access. Finally, a CAM style cache organization has been added to the first level cache.

Division of Labor

Implementation of the super-scalar design was split up into three parts: schematic drawing and wiring, control logic and hazard handling, and memory. Memory consists of the memory controller for the new DRAM component, the stream buffer and the victim cache. The assignment breakdown was Jeff handled the memory controller, Qian worked on the datapath drawing/wiring, Steve worked on the stream buffer, and Kent worked on the victim cache. Each member was responsible for testing and verification of their assigned task or component. In addition everyone contributed to the control logic and hazard handling due to the complexity of the subject.

Detailed Strategy

Datapath

The structure of the datapath itself is fairly simple; much like the single-issue, 5-stage pipeline covered in class, the super-scalar version features the same stages, but many compoments are duplicated to allow for complete parallelism. Moreover, the datapath is also aligned in a way such that only even instructions can go through the top pipeline, while only odd instructions can enter the odd one. A high-level diagram is given below.

In general, many components of the datapath were duplicated for the parallel execution of the even and the odd instructions, while some components were modified with more ports. As a results, the instruction bandwidth of the datapath increases, and so does the CPI of the processor.

The following table presents a big picture of how the classic 5 stage datapath was changed at each pipeline stage.

Table1. Datapath Modification

PIPElINE STAGES / COMPONENTS DUPLICAted / COMPONENTS WITH WIDER BANDWIDTH
IF / Pipeline registers / Instruction Cache outputs
ID / Instruction Decode Controller
Extender
Branch Comparator
Forwarding muxes
Branch PC bus
Jump PC bus
Pipeline registers / Register File outputs
Hazard Controller inputs, outputs
Stall Unit inputs
Forwarding muxes inputs
EX / ALU
SLT instructions Unit
Shifter
Pipeline registers / None
ME / Pipeline register
Memory Source Muxs / None
WB / Register File Source Select Mux / Register File inputs
Monitors

In the instruction fetch stage, the output bandwidth of the cache is doubled so that the datapath can fetch both the even and odd instructions in parallel. Since the even and the odd instructions are in the same 2-word cache line of the cache, the change was made in the cache datapath. On the other hand, the data cache of MEM stage does not need to be modified because our memory system only allows one memory access at a time. And the Stall units of the decode stage will separate two parallel memory-access instructions.

The major modifications of the datapath are in the instruction decode stage and the execution stage. The components, which operate only for an individual instruction, are duplicated, such as branch comparator, extender and ALU. Hazard and stall units however have to take instructions of ID, EX, and MEM stages from both even and odd pipelines in order to determine the dependency. One special case is the register file. In order to work with supers-scalar scheme, it should be able to write and read two different registers as well as the same register in parallel. If both the even and the odd instructions write back the same register file location, we must let odd instruction win in order to prevent WAW hazard.

Beyond the general modifications, 2-way super-scalar pipeline adds complexity in calculating new PCs. The following table shows the PC calculation of the different types of instructions.

Table 2. PC Calculation

Instruction types /

New PC calculations

Arithmetic, Logic, Memory / PC = PC + 8
Even Branch / PC = PC + 4 + Branch Value
Odd Branch / PC = PC + 8 + Branch Value
Jump, Jr / PC = Jump Value
Even JAL / PC = Jump Value; $R31 = PC + 4
Odd JAL / PC = Jump Offset; $R31 = PC + 8

To give a more intuitive view of the implementation, schematic captures of branch and Jump PC calculation are presented as following:

Picture 1 shows the implementation of branch PC calculation. We cannot use PC[31:0] for the calculation of even branch address since PC = PC + 8 in the super-scalar pipeline. Therefore, PCD[31:3] (PC of the decode stage, the “current” PC) are used as the base PC. The result signal of the even branch comparator is used as the third bit which in effect add 4 to the base PC.

Picture 1. Branch PC calculation

Picture 2 shows the implementation of the Jump PC calculation. The first level selects between Jump and Jr instructions, whereas the second level mux selects the even and odd instructions.

Picture 2. Jump PC calculation

Super-scalar Structure

As mentioned before, the microprocessor features two pipelines for parallel operation. Thus, complicated hazard and stall cases arise from this.

Stalls

There are six different stall cases that must be handled. The first two occur when either one of the instructions in the execution stage is a branch or jump instruction. When that is the case, the delay slot must be handled. More precisely, if the branch occurred in the even location, then the delay slot has already been executed in parallel with it. Thus, the instructions in the decode stage (both even and odd) must be replaced with no-ops. (See figure 1)

However, if the branch is in the odd pipeline, then the even instruction at the decode stage needs to be executed while the odd one must be swapped with a no-op. (See figure 2)

The third case occurs when the instruction in the execution stage is a JAL or a LW and the odd, decode instruction depends on it. When that is the case, forwarding cannot solve this case and both instructions in the decode stage must be stalled.

The fourth stall case is when the same thing happens for the odd instruction at the decode stage. Here, the even instruction is allowed to execute while the odd instruction is stalled for one cycle. Afterwards, the even instruction is replaced with a no-op since it has already been executed once, and the odd instruction is permitted to enter the execution stage. The fifth case is when both instructions in the decode stage accesses the memory. Since there is only one data cache, a parallel access is impossible and so the even instruction will go first while the odd will be stalled until the following cycle. Finally, the last case occurs when the odd instruction in the decode stage has a dependency on the even instruction in the same stage. Because both instructions are in the decode stage, nothing has been calculated so forwarding is impossible. Therefore, this is handled identically to the previous case, that is, the odd instruction is stalled for one cycle. (Figure 4)

Because of the alignment of the pipelines, a major problem occurs when there is a branch or jump to an odd instruction. This is because instructions come in pairs for this processor, and so when the target of the branch is the odd instruction, it is crucial that the even instruction associated with it (which is the instruction before the target one) is not executed. Thus, there is also a branch handler which watches for branches and jumps to odd addresses. Once the branch is handled, the stall handler previously discussed will handle the delay slot such that only one instruction is executed there. Next, after the new PC is calculated and its corresponding instruction fetched, the branch handler will check the target PC. If the target PC is even, then the pipeline acts like normal whereas if the target is odd, then the handler will ignore the even instruction. (see figure 5)

Stream Buffer

Stream buffer locates in between the first-level cache and the DRAM. With a slow memory access time, an instruction fetch on every cycle will be very inefficient. The addition of a stream buffer will improve the efficiency because it will do a burst to get four instructions on every miss in the stream buffer. One normal read from the DRAM will take 9 cycles, but one burst (two consecutive reads) will take 10 cycles. There are three cases in which a burst will occur.

Case 1: At the very beginning of the program, the stream buffer has nothing in it and needs to fetch first four instructions in the DRAM.

Case 2: When you need to branch/jump to a new instruction that is not in the stream buffer or in the first-level cache.

Case 3: Two hits in the stream buffer followed by a miss in the stream buffer, meaning that the next instruction has not been prefetched.

Stream Buffer Controller

The stream buffer controller works on the positive edge of a phase-shifted clock. The delay of the phase shift is 10 ns. The reason for this phase-shifted clock signal is that the registers in the first-level cache work on the negative edges of the clock. The request signal from the first-level cache goes up after 5 ns if there is a cache miss. Therefore, by using the phase-shifted clock, we can send a fake wait_sig to the cache to tell it to do an instruction stall (before the negative edge of the normal clock) if there is a miss in the stream buffer. The trick is to operate between the phase-shifted clock for the stream buffer controller and the negative edge of the normal clock for the cache and pipelined datapath.

Without the stream buffer, the first-level cache operates mainly by looking at the wait_sig from the DRAM. However, now with the stream buffer, the stream buffer controller will take in the wait_sig from the DRAM and send a fake wait_sig to the first-level cache. This is necessary because during the burst mode, the DRAM controller will send a wait_sig low for one cycle. This wait_sig from the DRAM cannot be sent directly to the first-level cache. Therefore, the fake wait_sig is required. The fake wait_sig is a continuous high signal like the specification. The dip is important for the stream buffer controller because it uses that to enable the first half of the stream buffer to store the first two instructions. And when the wait_sig goes down for the second time, the controller will enable the second half of the stream buffer to store the next two instructions. An example of the wait_sig timing from the DRAM during a burst is shown below.

The following is the pseudo code:

If (no match in buffer) then

If (request_from_cache = ‘0’) then

Wait_sig_to_cache := 0;

Request_to_DRAM := 0;

Elsif ((request_from_cache = ‘1’) and (wait_sig_from_DRAM = ‘0’)) then

Request_to_DRAM := ‘1’;

Wait_sig_to_cache := ‘1’;

If (wait_sig_from_DRAM = ‘0’ for the first time) then

Enable_for_first_half_register := ‘1’;

Enable_for_second_half_register := ‘0’;

Elsif (wait_sig_from_DRAM = ‘0’ for the second time) then

Enable_for_first_half_register := ‘0’;

Enable_for_second_half_register := ‘1’;

Else

Enable_for_first_half_register := ‘0’;

Enable_for_second_half_register := ‘0’;

End if;

Else

Wait_sig_to_cache := 0;

Request_to_DRAM := 0;

End if;

Once the four instructions are in the stream buffer and every time the first-level cache gets a miss, the miss penalty will only be one cycle. Every time the cache controller sends a request and the PC to the stream buffer, the stream buffer will compare the PC with the address tag associated with each half of the buffer and select the matched half to output to the cache based on the results of the comparators.

High-Level Schematic of the Stream Buffer

Victim Cache

The implementation of the victim cache is similar to the first level cache in terms of control logic and schematic design. The main differences are in the handling of data input and output. The victim cache contains four cache lines worth of information, is fully-associative, and uses a FIFO (first in, first out) replacement policy. The reason for using FIFO instead of random as the first level cache is that random replacement is more effective for larger cache sizes. The small size of the victim cache makes it probable that certain cache blocks will be replaced more often. Because the victim cache outputs to the DRAM on a cache miss, replacing blocks on a more frequent basis will result in a higher AMAT (determined by hit time + miss rate * miss penalty). A FIFO replacement policy ensures each block has the same chance to be replaced. Theoretically a LRU is the best replacement policy, but is too difficult to implement with more than two sets.

Top Level Schematic of Victim Cache in Memory Hierarchy

Because the victim cache is fully associative, each cache block component contains a comparator to determine a hit with the address and tag. However, the input to each cache block only comes from the first level cache. Thus the victim cache blocks are simplified version of the first level cache blocks. Schematic-wise, the victim cache sits between the first level cache and the DRAM. The goal is to make the victim cache transparent to system so the first level cache thinks it is requesting directly to the DRAM. Thus the victim cache intercepts all intermediate signals and must output data as well as control lines. Muxes are used to select which cache block to output and where to output (first level cache or memory). To avoid losing the cache data while swapping the two cache lines, two additional cache registers are added to hold the output from the first level cache and victim cache. In addition to sending data back to the first level cache, the victim cache must also send the dirty bit. The first level cache needs to know whether the data being sent up is dirty or not, so the correct replacement behavior is used on subsequent memory accesses.

Changes to the datapath are local to the memory system. The first level cache needs extra output signals to send the tag, valid and dirty bits to the victim cache. In addition, it needs an extra input to receive the dirty bit from the victim cache. The DRAM needs a mux to choose the correct burst signal from the victim cache or the stream buffer.

Victim Cache Control

The first level cache control works on the rising edge of the clock while the memory controller works on the falling edge. To act transparently in the system, the victim cache must work within this time to output control signals at the correct time for both first level cache and memory controller to work. A delayed clock is used to set up this timing situation. The following RTL explains the behavior of the victim cache control.

if (first_level_cache_request = 1) then

reg1<= 1st level cache line

reg2<= victim cache line (Hit or Replace)

if (replace_block = dirty) then

DRAM<= reg2

if (victim_cache_hit = 1) then

1st level cache <= reg2

else

1st level cache<= DRAM

victim cache<= reg1

Control Signals

-To first level cache

Because the first level cache interacts with the DRAM solely on the wait signal, the victim cache must simulate the wait signal to get the appropriate response from the first level cache. The wait signal must be set high when the victim cache is writing a cache line to the DRAM or when the first level cache is trying to read from the DRAM.

-To DRAM

Because the victim cache only has one memory access at a time, the burst signal is set low. The request signal comes from only the victim cache. Any request from the first level cache is first interpreted by the victim cache to determine which memory requests to perform if any.

CAM-based cache

The major difference between the CAM-based cache and standard cache is how the address tags are stored and handled. With the CAM-based cache, the addresses are store in a completely different register from the data words. Each tag address register is accompanied by a comparator, the function of which is to determine whether the register contains a match with the input address. The match signals for each register are used to quickly output a hit data word instead of waiting in the control.

Top Level Schematic of CAM-Based Cache