CS152 Lab 7
Cache and Main Memory
Team Name: I Wish I Was In 150

Aka No Nome

May 10th, 2001

Daniel Patterson (cs152-patter)

William “Billy” Martin (cs152-bem)

Hsinwei “Frank” Chou (cs152-hchou)

Michael Barrientos (cs152-mbarrien)

Abstract:

For the final project we implemented a deep pipeline with 8 stages. We also implemented branch prediction and jump register prediction, and we made a few cache optimizations.

The objective for this lab was to build as fast a processor as possible, that works. Our primary objective for speedup was to reduce the cycle time as much as possible. We also tried to increase CPI by reducing the cost of branches (with prediction) and loads (with cache optimization). The theoretical approach to our datapath yields a setup that should have a very low cycle time, but in practice does not.

Division of Labor:

Each person implemented the component of their choice. Billy did the deep-pipelining, which included designing the ALU and designing the datapath. Frank did branch prediction, which involved keeping track of history, FSMs, and whatever else needed to get a high prediction rate. Dan worked on the cache, optimizing it as much as possible and adding new features. Mike created a JR prediction unit.

Detailed Strategy:

Branch Prediction with history table

There is a branch prediction unit whose job is to solely look up a table of 256 prediction results and output one of them based on the current history of the global history. We decided to use 8 bits index history table (256 slots) rather than the traditional 1024 index slots because we didn’t think we would have enough branches to fill up all 1024 slots. The prediction unit takes as input the current global history and outputs a 1 bit predict value based on the finite state machine stored at that history slot in the table. The prediction unit also takes in the actual branch result from a branch and a branch known flag signal that gets asserted when the branch result is known. When branch_known is asserted, the prediction unit takes in the history associated with that prediction and updates the finite state machine of that particular history.

The history unit takes as input the predicted result and updates the current history to reflect the predicted value change. This unit then outputs and passes the history along the cycles so a later instruction can use it. The history has a flag signal called branch_history_shift that shifts a 1 in whenever we predict a branch_taken and 0 otherwise. The history unit needs to be squashed and restarted from time to time because when misprediction occurs, the history will have to be reverted back to the old history, in case branches occurred after branches and the first branch hasn’t gotten the result before the second branch requires a prediction. So, when branch miss and renew_branch[7:0 is asserted, we flush the instructions in the pipeline and revert back to the old history.

Optimization of the memory system

The main goal was to lower the cache miss rate and penalty. Several strategies were implemented to allow this. They were:

Fully associative cache with LRU replacement: By using an algorithm to determine the least recently used block in the cache, the most efficient replacement policy was implemented. It was implemented using a matrix of dff’s that held the order of used blocks. Associativity, by eliminating conflict misses, the number of cache misses was minimized.

Interleaved memory with burst mode: Burst mode is used for instruction fetch, to get successive instructions as quickly as possible. The latter is so that two word blocks of memory can be fetched simultaneously.

CAM cache: Helps to lower hit time by calculating hit right at cache block.

Victim cache: This enables blocks to be accessed after they have been overwritten in memory without the delay of a memory access. Our implementation operates in one cycle, allowing both a read and write on both the victim cache and the standard cache simultaneously.

8 stage pipeline

We implemented an 8-stage pipeline for our processor. We based it on the suggested 2IF, 1DEC, 2EXE, 2MEM, 1WB model. We tried to design it to allow for the smallest cycle time possible.

Stage IF1

In this stage the instruction is fetched from the cache. Since we are using a CAM cache, this access can be completed in one cycle. The CAM cache is capable of fetching an instruction on a hit (or outputting stall on a miss) in about 16ns. This leaves little room for other manipulations, so no further action is taken in this stage.

Stage IF2

This stage is really more of a decode stage than an IF stage, since the instruction was fetched in the first IF stage. But we called it IF2 anyway, in order to conform with the assignment.

In this stage the instruction is processed to determine if it is a branch, and the branch or jump target address is calculated. Also, some control signals that will be necessary in the future stages are calculated now. Branch prediction and JR prediction is also done in this stage. The critical path for this stage is the time to calculate the branch address (using a 10-bit adder) plus the time to select the NPC (using tri-states).

Stage DEC1

This is the stage when the registers are looked up in the regfile. The regfile is implemented using registers and tri-states, so it should be quite fast. A forwarding mux also exists in this stage which forwards values from the ALU and MEM stages if necessary. Calculating all necessary hazards and stalls also takes place in this stage, and this is most likely where the critical path for this stage lies.

Stage EXE1

In this stage the first half of arithmetic instructions is computed, and logic instructions which have their values ready are also computed. The arithmetic instructions are passed through a carry-save addition/subtraction component and then through a partial carry-lookahead adder/subtractor. This units together take about 11ns, but values that need to be forwarded are ready as soon as they come out of the carry-save component, which means that forwarding can take place in parallel to the carry-lookahead computation and thus does not impact the critical path for this stage. (more on forwarding using a CSA later)

The critical path for this stage most likely involves the logical operations (including shifts), which must be performed as well as selected, and then forwarded.

Stage EXE2

In this stage the second half of arithmetic instructions is computed, and logic instructions that didn’t have their values ready by EXE1 are computed here as well. This second-stage logic operation greatly complicates forwarding, since the start and end times for logic operations is variable. But, it eliminates otherwise un-avoidable stalls.

The second half of the arithmetic here essentially just completes the CLA operation and also computes SLT and SLTU values. This information must then be selected by a mux before passed to the next stage, and this is most likely where the critical path for this stage lies.

Branches are also computed in this stage. This allows the branch to have all the information it needs to decide whether or not to branch, as a prior instruction completing in EXE2 can forward its value to the branch.

The branch is implemented using a comparator and also checks the prediction for this branch, and squashes instructions in the pipeline on a mispredict. These branch calculations may be the critical path for this stage, but if so that can easily be fixed by passing on the branch information to the next stage and finishing the calculation then.

The address for loads and stores is also calculated here, since these addresses can be computed with a 10-bit adder and thus don’t need to depend on the ALU.

Stage MEM1

In this stage values are stored and fetched from memory. Since the cache takes a relatively long time, nothing else is done in this stage.

Stage MEM2

In this stage loaded values from MEM1 are selected and forwarded as necessary. Since data from a load cannot actually be forwarded until this stage, an instruction that needs the data must stall before entering EXE1 until the load reaches this stage. The data can then be forwarded.

Ideally, an instruction that depends on the load but doesn’t actually need the value until EXE2 (such as a branch, a logic op, or another load) could stall for one less cycle than normal. However, we did not have time to implement this distinction, so all instructions are treated the same for the purposes of load stalling (except of course for SW instructions, which do not stall if they follow a load since they don’t need the loads value until they reach the mem1 stage. There is enough time between the start of the load in MEM2 and when the SW really needs the value in MEM1 to implment SW following LW with no stall).

Stage WB1

The writeback stage. What can I say? You can’t really see this stage on the schematic, it is the regfile loading in the writeback value.

Super-duper CSA forwarding

Normally with a 2 stage ALU an instruction that needs the value of the preceeding instruction would need to stall for a cycle in order to work. However, we have worked out a way for nearly all instructions to correctly forward partial answers to the next instruction.

For logical operations this is not too interesting because they only take one cycle anyway. But for arithmetic operations it is interesting to note that they can execute consecutively without any stalls at all! Further, the forwarding required to implement this does not increase the cycle time at all.

This is done by using a carry-save addition/subtraction unit to calculate and forward partial values from EXE1 to the next EXE1. The CSA unit is very fast, so its value can be calculated and forwarded without delay. Otherwise, its value is fed into a normal CLA adder/subtractor that is split across the two EXE stages. Getting this to work for all cases was very tricky, but we got it working and the processor almost never stalls!

What made this tricky is that the carry save adder had to be re-designed to handle both addition and subtraction – in combination! This results in a set of carries each of which can be either positive or negative. If doing that for the CSA weren’t hard enough, I had to make sure the CLA unit knew how to handle these bizarre carries as well. Thus the CLA unit had to propagate and generate both positive and negative carries (in combination) and come up with a useful result in the end. Whew!

The strangest case that had to be handled was when an addition or subtraction added a register to itself. Since the CSA can only take in one set of carries at a time, it can’t forward both values simultaneously (originally I wanted to do a 3-stage ALU with the CSA having its own separate stage. This proved impossible for this very reason). However this case is handled by recognizing that the two numbers being added or subtracted are the same, and thus simply doubling the carry (left shift by one) and feeding that to the CSA works well and produces a correct result.

There are some cases where the ALU simply must stall. Here’s one of them:

Addu $1 $2 $3

And $4 $1 $7

Addu $9 $4 $6

Here, the and needs a complete result to finish, and thus cannot be calculated until the second EXE stage. Then the second add needs the value of the and, which now won’t be done until the end of the second EXE, after the add will have wanted to start. The processor must thus stall the second add. It is rather ironic that normally an and is much faster to compute than an add, yet for our processor replacing the above and with an add results in no stalls at all. Sadly, I couldn’t figure out how to incorporate logic into a CSA adder. Another thing to note is that there can be any number of logic operations (each depending on the instruction before it) in place of the and and the processor will still need to stall. It is hoped that such code sequences don’t come up very often.

Another time when the processor must stall is on an arithmetic or SLT instruction that depends on the result of an SLT. Nothing can be done for this, since SLT does not produce anything remotely useful until the end of EXE2, and ariths and other SLTs need their values by EXE1. However, an SLT that follows an arithmetic operation can use the carry from that operation to do its subtraction and this does not stall the processor. It seems unlikely at any rate that someone would want to perform an arithmetic operation on an SLT.

What stalls our processor most is a LW instruction. Nothing can be done, its value is not ready until MEM2. An instruction that needs a LW by EXE1 thus needs 3 delay slots to execute. Normally an instruction that needs a LW only by EXE2 only needs 2 delay slots, but we did not have time to implement this distinction.

What we probably shouldn’t have done

One mistake that we made when designing our processor was to implement almost everything in schematic (really, almost everything. Go ahead, click on some compoment you think is VHDL. Yup, it’s a schematic). VHDL components are nice because their delay times are fixed, and they are unaffected by fan-in and fan-out. Since we have almost no VHDL components, calculating our critical path is virtually impossible. Although our schematics are highly modularized, their delays are non-constant. Thus calculating worst-case critical path is impossible. To make matters worse, we are constantly haunted by fan-out. In designing my components, I tried to always consider fan-out and use fanning sets of Inv4xs. But often you might tie a load to too many things without even realizing it. Fan-out bugs are very hard to deal with.

In the end, our processor does not realize its theoretical cycle time (the longest thing should be the cache, which takes about 16ns) and this is most likely due to fan-out bugs. Had we started out with a VHDL approach instead of schematic, it would have been much easier to arrange the VHDL components in an efficient manner, even though in theory the schematic versions should be faster.

Results

Our processor runs at somewhat sporadic clock rates. Since we can’t nail down a critical path, we don’t know exactly what the cycle time is. A cycle time of 30ns seems to work, but in theory we should be able to run with a cycle time of close to 16ns. This was not realized however.

Conclusion

We feel that we had a lot of creative ideas for our processor, but we were unable to reach their potential because our approach was flawed and also because we had great difficulty getting our simulations to run. Our files were constantly being corrupted, so we had to spend a lot of time backing up and re-copying projects and looking for corrupted files and little time optimizing our processor. If we had more usable software, our processor would have been much better.

Still our processor is very good anyway. Even without its full potential, the processor has respectable performance, and the abstract design is excellent. The components, branch prediction, JR prediction, and deep-pipelining, work very well together and they fill eachothers’ weaknesses nicely. Best of all, it works!

For testing we almost exclusively used the features of Digital Fusion to debug our processor. Viewdraw marks each wire with its value, which is infinitely superior to any monitor utility we could write. In previous labs, the monitor helped us very little in debugging (except of course for debugging the monitor). We used very small tests most of the time, and also relied on the completeness of lab5_mystery to verify our processor was correct (we had to break it up into small pieces at first).

Particularly nasty bugs included a crucial forwarding arrow we forgot to implement the first time around, and of course fan-out related bugs!

Appendix I – Notebooks:

Dan Patterson - journal.doc

Mike Barrientos - mikelog.txt

Hsinwei Chou - frank_log6.txt

William Martin - billylog6.txt

Appendix II – Schematics:

Too many to list. We have >100 schematics : U:\cs152\mbarrien\lab7frank\sch\

The highest level ones are:

The overall datapath: workdamnit\sch\deeppipeline.1

The ALU: workdamnit\sch\alu2stg.1

The Memory: workdamnit\sch\idmem.1

The Branch prediction : workdamnit\sch\branch_predict.1

Appendix III – VHDL Files:

There are too much to list, please see our directory

U:\cs152\mbarrien\lab7frank\behv\

Appendix IV – Testing:

Test using lab5_mystery: mystery program lab 5

Test using branch predict tests: branch predict test

Test using history test: ..\cs152-hchou\lab7\sch\history_test.1