Computer Science 152

Final Project Report

Dual-Processor and Memory Enhancements

The Drunken Hooligans

Glenn Chiu, Jeryl Soto Contemprato, Jeremy Lau,

Raymond Liu, Vishnu Ramaswamy

Dis W 2-4

10 May 01

Abstract

The goal of our final project was to get performance better than a simple 5-stage pipeline with just a data and instruction can achieve. Our strategy was to not drastically change the pipeline, but to stall less frequently for the DRAM and to hook up two identical processors so we could possibly issue two instructions per cycle, and when one processor is stalling on DRAM, have another that might still be doing useful work independent of the other processor. The key issue was sharing memory, which involves arbitrating between requests to the DRAM, as well as keeping all memory stored in caches and buffer valid with the most up to date values.

To decrease misses to the DRAM, we implemented victim caches, write buffers, and stream buffers. Our result was a processor with a smaller average memory access time and with greater instruction throughput for programs that can be split up between two processors. The problem is that sometimes values are not read properly by the dram.

Division of Labor

The first component of the project was taking two of our processors as with just one data and instruction cache each and putting them together. We implemented synchronization variables, modifications to the processor to be aware of synchronization variables, and a widened arbiter to take requests from more sources. We tested this processor with programs that wrote to memory at the same time but not to the same addresses and used synchronization variables.

Another part of the project was to work on the victim cache without integrating it into the data memory part of the processors. In addition to creating the victim cache, this part involved changing the cache controller to output victim data to the victim cache and accepting data from sources like the victim cache. It also required changes to the cache lines to support signals to communicate with the victim cache. It was tested somewhat but not enough to ensure that it would work with the cache. This would be part of integration

There was also work on the write buffer. This component consisted completely of vhdl code. We would not test this component until integration

Last before integration, there was the Stream Buffer. The stream buffer would only improve performance with the burst mode, so we had to implement that in this part as well. This was tested in a separate folder with along with the new burst capable controller and a DRAM.

After all these parts are done, we integrate them together and debug the processor with all subprojects completed. We also have to check for cache coherency between the blocks in the processors.

Here is who was responsible for the different parts, though not always the only person to work on it. Everyone had to alter the Processor schematic at some time for design, integration, or debugging purposes

Synchronization Variables – Jeremy

Arbiter – Jeryl

Victim Cache – Vishnu

Write Buffer – Glenn

Stream Buffer – Glenn

Burst Memory - Vishnu

Everyone worked on integration and debugging, which involved changing a little bit of every vhdl component and the processor and cache schematics.

Strategy

At the highest level, we have the two processors, the shared synchronization variables, the arbiter, and the DRAM. All accesses to the DRAM must go through the arbiter. The processors communicate directly with each other when a lw and a sw are in the memory stage to ensure the validity of memory values.

The Synchronization variables were implemented as a state machine and separate registers. The synchronization variables send a wait signal back to the processors that freeze when asserted, not because of access time but because of conflicts

The arbiter is the same in concept as earlier ones, except now we have to arbitrate between many more requests. The arbiter has an idle state and a state for every request it might be handling. When it is in the state to handle request X, it tells X to wait until the DRAM asserts that the data is ready. When one is in control of the arbiter, the others are told to wait (if they were requesting). When a request is fulfilled, the arbiter goes back to the idle state and another requester may gain control.

Now lets take a look inside the processors

If the instruction we want is not inside the instruction cache, it might be in the stream buffer. We check the stream buffer in parallel with the instruction cache. If the instruction misses in both the instruction cache and the stream buffer, stream buffer issues a request to the arbiter. The stream buffer requests the next several instructions after the one that missed in burst mode. Having the stream buffer increases the maximum cost of an instruction cache miss. However, getting several instructions in burst mode is faster that if we had pre-charged the address signals for each of them one after the other. Since most of the time the next instructions in memory will be needed next (high spatial locality), the amortized cost of instruction fetches will be reduced.

When there is a miss in the data cache, we take the value ejected and save it in the full associative victim cache. When the data cache is direct mapped, the victim cache greatly reduces the rate of conflict misses when we are using two different memory addresses that map to the same cache line. Since our data cache is also fully associative, adding the victim cache is the same as having a larger data cache. Having a larger cache reduces the capacity miss rate. The victim has cache lines to store data like the data cache, but it does not have a state machine controller because it does not make requests to the arbiter

The write buffer enables us to queue up multiple write requests without stalling the processor. When the arbiter alerts the write buffer that it is idle, then the write buffer finally makes a request. However, when the write buffer is full we must stall the processor and move one of the entries to DRAM to make room the new entry. Our cache lines are two words, so on a write miss we do not store the sw value in the cache because we would have to stall to read the other word from the DRAM. Our goal is to not stall on sw. If a lw tries to read an address that is in the write buffer and has not been committed to DRAM yet, we make sure that the value is retrieved from the write buffer instead of making a DRAM request. The write buffer is a vhdl state machine that saves its own state and registers. It is not implemented as a schematic and does not have registers inside. Because our cache lines were 2 words, on a write miss we did not write to the data cache because that would cause a stall and we would like to avoid stalls on writes.

The data memories of the processors must communicate with each other to preserve cache coherency. Stored instructions are assumed to be consistent with the DRAM because we assume instructions do not get overwritten. However, it is possible that one processor writes to a data address that the other processor reads from. If the other processor is doing a store word, the data cache, victim cache, and the write buffer check if any of their entries match the address of the store word. If so, it invalidates the entry. We invalidate the write buffer entry because we only want the most recent value to be written and you don’t know which the arbiter will pick first. On a lw, we must check if the address matches any in our write buffer since it would have the most updated value. Because we have a multiprocessor, the most recent value might also be in the write buffer of the other processor. Here is what needs to be passed for communication between the processors. On a sw, the other processor needs to know the address you are storing and that you are doing a sw. On a lw, you need to ask the other processor to check its write buffer and give it the address you are looking for. It needs to return to you (within the same cycle) whether there was a hit in its write buffer and if so what the value was. Since lw is two words wide, we have to go through this check twice, once for each word. The cache lines may have to check multiple addresses against its tag in each cycle, for its own processor’s lw/sw, and for a sw from the other processor for invalidation.

Testing

Our testing began at the component level and ended at the final datapath level. We only added one thing to the dual processor at a time. We also modified the disassembly monitor so that it kept track of cycle count. Each processor had its own disassembly monitor.

Dual processor:

We tested the dual processor without the victim cache, write buffer, or stream buffer. We checked that we could use the synchronization variables. All our tests completed after some debugging. We first used a sequential store word diagnostic test to determine that the write buffer was working correctly. After this, array copy diagnostic tests were done to check that the correct values are read (cache misses/hits are handled correctly) and that we re-write correct values. These tests were done using a diagnostic designed to run on a multiprocessor, so that six-way coordination of the arbiter was working properly. Second, we wrote a diagnostic file that had a heavy amount of lock grabbing and releasing. We had the two processors run different algorithms to check all the different responsibilities.

Victim cache:

We tested the victim cache in schematic with a data cache. We wrote a test script and viewed the waveforms to verify that the data cache and victim cache interacted properly. We tested to ensure that values were properly swapped between the two caches during victim cache hits.

Write buffer:

The write buffer was written in vhdl. We tested it with a script in digital fusion. We verified that it was able to enqueue and dequeue elements correctly. We also made the write buffer smart enough to eliminate redundant elements of the same address. We eventually installed the write buffer into the dual processor datapath and tested it and debugged tit extensively. Because of ambitious timing techniques, we were unable to get the write buffer to work as expected with the test files.

Stream buffer:

We tested the stream buffer with a new dram controller that supports burst mode and a dram. We tested the two components hooked up together to make sure that they interacted correctly. Because our arbiter and cache react to signals asynchronously, it was difficult to integrate the stream buffer into the datapath.

Dual processor with write buffer and cache coherency signals:

We had to change the processor schematics and changed the cache lines to accept multiple addresses to check for hit. We also had to allow for individual cache line invalidation, in case another cache contained a more recent copy.

Results:

Our cycle time is determined by the minimum 200ns RAS pre-charge time of the slow DRAM. In our previous simulations, the DRAM signal pre-charge time was 50ns. The critical path through the memory stage and all the multiplexors and blocks is still less than 200ns, even after the new additions, so a larger cycle time would not have improved the processor performance or simplified our design. The datapath we started with decided to use that pre-charge time as the cycle time, rather that dividing that pre-charge time between multiple smaller cycles. The goal of this project was not to change the datapath, but to improve average memory access time and instruction throughput per cycle, so we did not change the pipeline to divide the memory stage into multiple cycles. Only lw and sw excite the critical path because they need to use the RAS signal to do memory access. The limiting of memory access could be removed if a simple clock sampler was used. Our processor could be run at 49ns, while the memory would be clocked based on a different clock speed. Due to time availability to work, we weren’t able to implement this function, although it is easily incorporated into our design.

Multiprocessor would hopefully improve performance by keeping one processor working when we would otherwise have a complete stall. In the worst case, both processors stall on memory a lot and we get really long DRAM stalls. Multiprocessing also adds more flexibility to operating system multitasking, since it allows the OS to specifically reserve a processor for particular tasks. Most importantly we increase the throughput by an ideal factor of two, since if caches are effective and synchronization blocking is kept to a minimum we can have two instructions running per cycle.

Since the data cache was already fully associative, the victim cache merely acted as an extension to it and helps alleviate capacity misses. We probably should have transferred some of this additional capacity to the instruction cache, by reducing the size of the data cache, and increasing the size of the instruction cache.

The stream buffer can be very effective at reducing the instruction miss rate given reasonable spatial locality. Combining the stream buffer with burst mode reading decreased the cost of using the stream buffer, since we can read more than one instruction per cycle (on average).

The write buffer’s effectiveness is limited by the speed of the DRAM. If the time it takes to flush a buffer entry is large compared to the frequency of new writes entering the buffer, the buffer will always be full and every store word results in a stall.

Getting the various memory components to work together was a challenge, because many of the components in the processor are asynchronous, with behavior dependent on external signals rather than clocked state. This resulted in several race conditions, where the request signals and the data associated with the request do not arrive at the appropriate times.

Conclusion:

In this project, we have attained experience in dealing with complex memory validity issues within realistic memory systems, especially for multiprocessors. The communication interface between different components has to be thoroughly designed before we can start putting them together in hardware. Also, we have learned that there are in fact simple, hardware inexpensive techniques for improving the access time of memory without the need for advanced circuit technology, other than trying to improve the speed and cost of larger caches with larger cache lines.

Making changes to the memory system was not as easy as expected, and the conflicts caused by multiple cache and buffer blocks, multiple processors, and the 2-word cache lines were significant. The slow DRAM really hit our memory access time hard and a second level cache would really help in decreasing that metric. Processing all signals synchronously would have compensated for the delays in such an implementation by making design much simpler. We would have to divide the memory stage into multiple segments since even a hit would have taken multiple cycles.

Appendix A: Notebooks

These can be found in the notebook subdirectory:

JournalFinalRaymond.txt

journalJeryl.doc

journalGlenn.doc

journalVishnu.doc

journalJeremy.txt

Appendix B: Schematics

Relevant schematics in the sch subdirectory:

DualEntire datapath

Mempath64Single processor

MemblockMemory block

Aribter64New arbiter

SyncSynchronization Block

Cache64New Cache

Cacheline64New CacheLine Block

InstrcInstruction Cache with Stream Buffer

StreambufferStream Buffer

ViccacheVictim Cache

Appendix C: VHDL Files

Mem_controlMemory contoroller

Arbiterm_controlArbiter Control Logic

Sync_controlSynchronization control

Cache64_controlCache Control

Sync_logicSub-synchronization Block

WritebufferWrite Buffer

IccacheInstruction Cache controller

Appendix D: Testing

Diagnostic scripts/programs:

Dual test

Merge sort, the old final project mystery program

Partial sums, the final project mystery program