Fetch Directed Prefetching – A Study

Gokul Nadathur

Nitin Bahadur

Sambavi Muthukrishnan

{ gokul, bnitin, sambavi}@cs.wisc.edu

Dec 8, 2000

University of Wisconsin-Madison

Abstract

This report presents our work on the fetch directed prefetching architecture proposed by Reinman et.al [Reinman 99]. Fetch directed architecture decouples the branch predictor from the main pipeline so that the branch predictor can run ahead of the pipeline on pipeline stalls and i-cache misses. Other components of the architecture help in having the next instruction blocks for execution ready for the fetch unit. We also explored prefetching using stream buffers over the base architecture. We implemented both on Simplescalar 2.0. Our simulations on a subset of the Spec95 Integer benchmarks show that fetch directed prefetching improves performance. Fetch directed prefetching gives greater improvement in IPC as compared to base Simplescalar architecture with stream buffers. We observed a peak improvement of 8.8% for fetch directed prefetching and 6.29% for stream buffers among the benchmarks we ran. Our statistics show that the modest performance can be improved by multi-porting and/or pipelining the cache hierarchy. Thus the fetch directed architecture with intelligent prefetching hides the i-cache stall latencies, improving fetch bandwidth and performance.

  1. Introduction

As computer architecture advances, most modern processors aim at achieving better performance through improved instruction level parallelism. This places a large demand on the instruction fetch mechanism and the memory/cache heirarchy. Also, with the advent of superscalar processors that can execute multiple instructions per cycle, there is a need for better branch prediction ( perhaps multiple branch predictions) per cycle. The rate at which the BTB and branch predictor can be cycled thus have an effect on the number of actual instructions that may be executed since this limits the fetch bandwidth. Unless the instruction fetch and memory hierarchy respond to the needs of the execution engine, the true benefits of the advances in ILP cannot be realised.

This argument motivates our study on instruction prefetching techniques. When instructions are prefetched, there is a possibility of hiding memory latencies due to cache misses. We study two techniques: stream buffers and fetch-directed prefetching [Jouppi 90 and Reinman 99].

While steam buffers just approaches the problem of improving fetch bandwidth by prefetching instructions/cache lines sequentially on a cache miss; the fetch directed architecture improves upon this by modifying the baseline architecture to provide a better idea of what should be prefetched. The fetch directed architecture works by letting the branch predictor work independently of the fetch unit. This enables a prediction each cycle (this is of course dependent on the cycle time of the branch predictor and the BTB). So the branch predictor is never stalled due to instruction cache misses (when the fetch unit is stalled). Similarly the fetch unit is not stalled by any latency in branch prediction. The predictions produced in each cycle may be used to determine what to prefetch (with some suitable filtering mechanism).

The fetch directed architecture also proposes some modifications to the Branch Target Buffer which enables better space utilization of the BTB by grouping together all addresses which fall through (even branches predicted as fall through) and not wasting BTB entries for such fall through branches..

The main goals of our project were to study the fetch directed architecture proposed by the authors of [1], to study prefetching in this context, modify the SimpleScalar tool set for this architecture and revalidate their conclusions from the simulation results. We also compare the

performance of the fetch directed prefetching scheme with the stream buffers technique.

This paper is organized as follows. Section 2 describes the fetch directed architecture in detail. Section 3 describes the stream buffers and the fetch directed prefetching which we have implemented. The results of our simulation studies are presented in Section 4 and we finally conclude in Section 5.

2.Fetch Directed Prefetching Architecture

The fetch directed architecture uses a decoupled branch predictor and an instruction cache. The decoupled branch predictor runs ahead of the execution PC making predictions. The other components of the architecture make use of the predictions to supply the fetch engine with instruction addresses to be next executed. Any prefetching mechanism can be added to the fetch directed architecture to make use of the instruction addresses which are available before they are used. Since stream buffers [Jouppi 90] show consistently better performance, we add stream prefetching to the base design to yield an architecture which combines the use of future predictions and prefetching to supply the fetch engine with correct instructions and hides the L1 i-cache miss latencies. The main components of the Fetch Directed Prefetching Architecture are :

1) The decoupled branch predictor

2) Fetch Target Buffer (FTB)

3) Fetch Target Queue (FTQ)

4) Prefetch Filtration Mechanism

5) Prefetch Instruction Queue (PIQ)

6) Prefetch Buffers and Prefetch Mechanism

The fetch directed prefetching architecture is shown in Figure 1. The fetch target buffer (FTB) is similar to the branch target buffer (BTB). The fetch target queue (FTQ) contains a list of instruction address blocks which should be next used by the fetch stage. The prefetch instruction queue (PIQ) contains a list of i-cache blocks which should be prefetched by the prefetch unit. Each component of the architecture is described in detail in the following sections.

2.1Decoupled branch predictor

Motivation for a decoupled predictor

The decoupled branch predictor is the focal point of the fetch directed architecture. The predictor is decoupled from the main pipeline. This enables the predictor to run ahead of the pipeline. The advantage of this is that the branch predictor being able to work even when the fetch unit is stalled. The branch predictor produces a useful prediction every cycle which may be used in later cycles by the fetch unit. This scheme however has the problem that the history seen by the branch predictor may not be up to date since it is making predictions well in advance of the fetch unit using any of those predictions.

Changes to SimpleScalar

The changes made to the Simplescalar tool set to incorporate a decoupled branch predictor were the following. There were changes needed to the main pipeline to extract the branch predictor out. Essentially, the original main pipeline called the branch predictor for every fetch address. Instead at this point, the fetch addresses were themselves specified by the Fetch Target Queue (Section 2.3). An entry was popped off the queue to specify a range of addresses to access. When this range was completely fetched and put into the dispatch queue, the next range was read off the FTQ. Every range in a FTQ entry incorporates a fetch target block, starting at a taken branch target or the starting address of the program and ending at a taken branch or size of an FTQ entry.

The other changes were to the branch mispredict and branch misfetch handling. In either case, the entire fetch directed architecture state associated with branch misprediction needs to be flushed. The PC used by the branch predictor and the main pipeline also need to be reset to the correct target PC.

There was a need to maintain the branch predictor update information from the branch prediction stage to the branch predictor update stage (when the correct outcome of the branch is known). In the fetch directed architecture, the branch predictor runs ahead of the main pipeline and hence this information cannot be stored in the fetch data queue (in the way that the baseline architecture works). So we added an extra queue between the branch predictor and the fetch stage. This queue maintains the direction update information for the branches till it can be put into the fetch data queue in the fetch stage.

2.2Fetch Target Buffer (FTB)

The fetch target buffer is used for storing the fall though and target addresses of the branch ending a basic block. If the branch ending the basic block is predicted as taken, then the target address is used for next cycle's FTB lookup, else the fall through address is used. Each entry in the FTB contains a tag, taken target address and the fall through address of the block. It is similar to the basic block target buffer (BBTB) [Yeh-Pat 92]. The FTB differs from the BBTB in three aspects:

a)Basic blocks with branches that are not taken are not stored in the FTB.

b)The full fall through address of the basic block is not stored. Instead only the size of the basic block is stored. This reduces the storage requirements for the fall through address ( as it has been shown [Reinman 99] that the typical distance between the current fetch address and BBTB's fall through address is not large. )

c)The FTB is not coupled with the instruction cache, so that it can continue operation even when the i-cache is stalled due to a miss.

Each cycle the FTB is accessed by the branch predictor with an address and a prediction [Reinman2 99]. If an entry is found in the FTB an entry is inserted into the FTQ starting at the current address and ending at the last instruction of the basic block. Based on the prediction, the target or fall through address is returned to the branch predictor lookup routine. The lookup routine will use this address for looking up the FTB in the next cycle. Thus the branch predictor and FTB together are used for predicting the next set of blocks that will be used by the fetch engine. By predicting the blocks before they are used by the fetch engine, the prefetch mechanism can prefetch the instructions in time for use, thus avoiding I-cache miss latencies. An advantage of the FTB is that it can contain fetch blocks which span multiple cache blocks. So on a FTB hit, a single/multi-block entry from the FTB is inserted into the FTQ (in one or more entries). If there is a FTB miss, then the a sequential block starting at the next cache aligned address and upto 2 cache blocks is inserted into the FTQ. Since not-taken branches do not have entries in the FTB, their lookup will fail and sequential blocks will be inserted into FTQ. This not only gives correct operation but also reduces the space requirement of the FTB by not storing not-taken branches. On repetitive FTB lookup failures sequential blocks are inserted into the FTQ until a misprediction occurs when the FTB will be updated with the correct information and lookup failures will reduce. Since the FTB is of fixed size, entries in the FTB are replaced by LRU policy. A 2 level FTB would help reduce the FTB access time and we can have a larger second level FTB. We have currently implemented only a single level FTB.

Changes to Simplescalar

The FTB is used by the branch predictor and it fills in entries into the FTQ. So the decoupled branch predictor was made to use to use the FTB and update the FTB for all branch lookups and branch updates. The blocks produced by the FTB are then inserted into the FTQ.

2.3Fetch Target Queue (FTQ)

The FTQ is a queue of instruction address blocks. It is used to bridge the gap between the branch predictor and the instruction cache[Reinman 99]. The fetch engine deqeueues the FTQ for getting the instruction addresses to execute. The FTQ provides the buffering necessary to allow the branch predictor and i-cache to operate autonomously. It allows the branch predictor to work ahead of the i-cache when the latter is stalled due to an i-cache miss.

Each FTQ entry contains upto 3 fetch block entries and the target address for the branch ending the basic block. Each fetch block entry represents an L1 i-cache line. Each fetch block entry contains a valid bit, an enqueued prefetch bit (whether the entry has been enqueued for prefetch already) and the cache block address.

Changes to Simplescalar

The FTQ gets filled by the FTB and it provides instruction address blocks to the fetch unit. So the fetch unit had to be modified to make use of instruction blocks provided by the FTQ. The FTQ also provided some interfaces for use by the prefetch filtering mechanism.

2.4Prefetch Instruction Queue (PIQ) and Prefetch Filter

The Prefetch Instruction Queue is a queue of addresses waiting to be prefetched by the prefetch engine. Each PIQ entry basically indicates a cache line which is to be prefetched. Entries are inserted into the PIQ every cycle by the Prefetch Filtration Mechanism. The earlier a prefetch can be initiated before the fetch block reaches the I-cache, the greater the potential to hide the miss latency. At the same time, the farther the fetch block is on the FTQ, the more likelihood of it being on a mispredicted branch. So the heuristic [Reinman 99] starts choosing entries starting from second FTQ entry and goes upto 10th entry in the FTQ. The first entry in the FTQ is skipped as there would be little benefit starting to prefetch it when it is close to being fetched by the i-cache. By searching only 9 entries, we reduce the time required to find matching entries to insert into the PIQ. Based on exact timings, even multiple entries can be inserted into PIQ in one cycle. Currently we enqueue upto 3 entries into the PIQ at one time. Associated with each PIQ entry is a prefetch buffer. We implemented the PIQ and prefetch stream buffers with same number of entries but it is possible to use lesser prefetch buffers based on the memory hierarchy and latencies. The prefetch mechanism deqeues entries from the PIQ and schedules them for prefetch. Thus the filtration mechanism fills the PIQ while the Prefetch mechanism empties it.

Changes to Simplescalar

A new mechanism was added which checked the FTQ for any entries which could be enqueued for prefetch and were not already enqueued. Here the filtration scheme looked at FTQ entries, filtered them based on above heuristic and enqueued them into the PIQ for prefetch. Whenever the L2 bus was free and a stream buffer was available, the prefetch mechanism dequeued an entry from the PIQ and started prefetching it.

2.5 Recovery on mispredictions

On a misprediction, the FTQ, PIQ and Prefetch Buffer entries are invalidated and the FTB is updated with the correct branch information. Based on their utilizations, the sizes of the FTQ and PIQ can be accordingly set.

3Stream Buffers

3.1Introduction

Stream Buffers are used to reduce penalty due to compulsory and capacity miss problems. A stream buffer is a cache that is 1 cycle far from the L1 cache. This means that a miss in L1 serviced by the stream buffer will incur a penalty of 1 cycle. The buffers are probed in parallel with an i-cache access and if the i-cache access misses, then in the next cycle , the cache block is fetched into the i-cache from the stream buffer. The buffers can be probed in different ways. We have explored 2 ways of probing the buffer. In one policy the buffer is a FIFO queue where only the head entry contains a comparator. So any probe is compared against the head of the queue. If the entry is present then its serviced or else the entire queue is flushed and prefetching occurs from the next sequential block from the miss. This scheme has the advantage of being simple and a comparator is required only at the head. Another policy that we have explored is LRU. This policy uses a fully associative prefetch buffer with a LRU replacement strategy, adding a replaceable bit to each prefetch buffer entry. A prefetch would be initiated only if there was an entry marked as replaceable in the prefetch buffer, choosing the least recently used replaceble entry. On a miss all entries would be marked as replaceable . They would also be marked as replaceable in the prefetch buffer when the matching cache blocks are fetched during the instruction cache fetch.

Changes to Simplescalar

The stream buffer logic and policies were implemented separately. Changes were made to the fetch unit of the simplescalar, the L1 i-cache miss handler and the main pipeline loop. Additional Configuration parameters were added to configure the stream buffer for different runs. The fetch unit of simplescalar was modified to probe the stream buffer in parallel with the i-cache lookup. The prefetch was initiated after the fetch in the pipeline to give priority to instruction fetches over prefetch. The L1 miss handler was modified so that a miss in L1 and a corresponding hit in the stream buffer is serviced by the stream buffer. If a miss occurs both in L1 and the stream buffer then an L2 cache access is made. During such an access , the stream buffers are flushed appropriately and the next prefetch address is also set. Cases when an instruction fetch is blocked by an ongoing prefetch were also accounted for.