DATA CACHE PREFETCHING SCHEMES FOR MULTIMEDIA
AND IMAGE PROCESSING APPLICATIONS
ALAA R. ALAMELDEENDEEPAK JINDALSANJEEV KULKARNI
Dept. of Computer Sciences, University of Wisconsin-Madison
1
ABSTRACT
Data prefetching is an important approach to reduce data cache miss rate due to compulsory and capacity cache misses. Prefetching can be performed in hardware or in software. The basic idea is to keep track of data access patterns of a program and to use this information to anticipate the location of data that is going to be accessed next. Image processing and multimedia applications are among the most growing application areas in computer systems. They are also memory-intensive and time-critical applications, but they have predictable data access patterns. These properties render them ideal for data prefetching schemes.
In this paper, we study some of the hardware-based data prefetching schemes, and their effect on the performance of image processing and multimedia applications. We introduce a modification to the correlated-prefetching hardware scheme presented by Chen and Baer that removes some of its restrictions requiring compiler support. We describe a set of multimedia and image processing benchmark programs. Simulation experiments are held to compare different hardware prefetching schemes, and they show satisfactory results for most of the image processing and multimedia applications without hurting performance for other SPEC benchmarks.
KEY WORDS
Prefetching, hardware-based prefetching, data cache, reference prediction, memory access patterns, image processing, multimedia.
1. INTRODUCTION
Due to the increasing speed gap between processors and memory, the increasing speed of processors cannot be exploited efficiently unless some way of tolerating memory latency exists. The invention of cache memories as an intermediate storage level between main memory and the processor has been a breakthrough, which helped to reduce the gap between the processor speed and that of main memory. However, memory latency cannot be totally eliminated. Upon a cache miss, the processor has to wait until the data is fetched both to the cache and to the processor. This wait time cannot be eliminated fully even for out-of-order processors.
Several techniques have been proposed to reduce the data access penalty that results from cache misses. These include [5]:
- Increasing cache size, which can be limited by the available on-chip area for cache
- Increasing cache associativity to reduce conflict misses, but this may increase the cache hit time [6]
- Jouppi’s Victim cache [8] which reduces conflict misses by storing the blocks that are replaced from cache to an associative victim cache. However, the effect of this technique is less with the increase in cache size and/or associativity.
- Data prefetching, which reduces compulsory and capacity cache misses, by attempting to have data in cache before it is actually required, whether it has been referenced before and replaced (capacity miss) or not (compulsory miss). Data prefetching is defined as moving data from a higher level of the memory hierarchy to a lower level of the memory hierarchy before the data is actually needed [14].
Data prefetching can be done in hardware or in software. Hardware-based data prefetching, explained in more detail in the next section, is based on adding a hardware unit that fetches data to the data cache before it is needed. Prefetching data to the cache depends on the previously accessed data patterns. Software-based prefetching [2, 9, 10, 11, 16], on the other hand, is based on the analysis of the static program. An intelligent compiler may insert special instructions that prefetch data many cycles ahead of their use by other instructions. These techniques can get more prefetches and can help in prefetching complex access patterns (which cannot be done in hardware due to the high complexity). This is done at the cost of addition of prefetch instructions (including a change in the instruction set architecture) as well as additional program calculations of the prefetched addresses (which may reduce the program speed). Many machines have included some architectural support for software-based prefetching schemes. An example of these machines is the HP PA-8000 [14].
Multimedia and image processing applications are growing application areas in modern computing environments, and this trend tends to increase with the internet technology and its many services like video-on-demand. Inspired by this trend, most modern architectures have included some architectural support for these applications, like the Intel MMX [7]. The main problem with these applications is that they are memory intensive, processing large volumes of data, which renders them rather slow. For example, a typical 24-bit true color image of dimensions 500x500 requires approximately 750 Kbytes of storage, which cannot fit in L1 caches of practical sizes. However, the memory access patterns for most of these applications are quite regular and predictable, which makes them suitable for data prefetching schemes.
The rest of this paper is organized as follows. Section 2 is a survey for some hardware-based data prefetching schemes, with more focus on the Chen and Baer basic and lookahead prefetching schemes [3]. Section 3 explains the Chen and Baer correlated reference prediction scheme. Section 4 introduces our modification to the correlated prefetching scheme. Section 5 describes the image processing benchmarks used to compare the performance of different hardware-based prefetching schemes. Section 6 describes our simulation model and performance metrics. Section 7 presents the results and their explanations. Section 8 provides conclusions and suggestions for future work.
2. HARDWARE-BASED DATA PREFETCHING
Hardware-Based data prefetching schemes dynamically predict the memory address to prefetch based on addresses of previous memory accesses. These techniques can be classified into two main classes:
- On-chip schemes, which depend on the addresses required by the processor in all data references. Examples of these schemes include the Chen and Baer’s basic, lookahead and correlated schemes [3].
- Off-chip schemes, which depend on the addresses that result in L1 cache misses, since these are usually the only addresses available outside the processor chip. These include Jouppi’s stream buffers [8] enhanced by Palacharla and Kessler [12].
There are arguments for both of these approaches. On-chip schemes can be more accurate in estimating prefetched data, thus removing a great portion of the compulsory and capacity cache misses. On the other hand, on-chip schemes consume precious chip area that can be used to build a larger-sized cache. They also increase the required memory bandwidth. Off-chip schemes do not frequently replace useful data in the data cache, and have much less memory bandwidth, but their prefetching accuracy is not usually as good as on-chip schemes.
2.1 Memory Access Patterns
Prefetching schemes, both on-chip and off-chip, base their prediction for the next address to be needed by an instruction on some regularity in the memory access pattern. For example, in a program segment consisting of nested loops, the memory access patterns can be divided into the following four main categories [3]:
1. Scalar, which is a simple variable reference that does not change with respect to the loop index.
2. Zero stride, which is a reference inside an inner loop with a subscript expression that does not change with respect to the inner loop index, but may change with respect to the outer loop (e.g. a reference to A[i] inside an inner loop indexed by j).
3. Constant stride, which is a reference inside a loop with a subscript expression that increases with a constant rate with respect to the loop index (e.g. the reference A[i] inside the loop indexed by i).
4. Irregular, which is any pattern other than the previous three patterns (e.g., linked list traversals, tree traversals, ... etc.).
Hardware-based techniques implement special circuits to detect any regularity in memory access patterns for an instruction or a group of instructions, and base their prediction on these patterns. Some techniques even extend their prediction decisions to the patterns classified as (irregular), like the linked-list traversal accesses [13].
2.2 Jouppi’s Stream Buffer Scheme
This scheme is a variation of the One Block Lookahead (OBL) prefetching scheme [15]. The OBL scheme prefetches the cache block i+1 whenever the cache block i is referenced. Jouppi [8] suggested an off-chip scheme in which several data streams can be prefetched using FIFO stream buffers. A stream buffer is allocated whenever there is a data cache miss on an address, and the buffer proceeds by prefetching the next address to be referenced. Each stream buffer stores the next address to be referenced, the cache block and tag, and a valid bit. An incrementor is used to generate the next prefetch address. It has to be noted that stream buffers are only allocated on a data cache miss, thus greatly decreasing the memory bandwidth required by most on-chip schemes.
Palacharla and Kesseler [12] extended Jouppi’s scheme to detect and prefetch non-unit strides. They added fields containing the last instruction address (PC), stride and state to each stream buffer. The last instruction address is used to keep track of the last reference to this stream buffer. The stride field, which is the difference between the last two references to this stream buffer, is used to determine the next address to be prefetched. The state field determines whether to proceed with prefetching or not. The state is updated based on the success of previous predictions.
2.3 Chen and Baer’s Reference Prediction Schemes
Chen and Baer [3] introduced three variations for on-chip hardware-based data prefetching. These variations handle the scalar, zero-stride and constant-stride access patterns. The basic scheme depends on the construction of a reference prediction table (RPT) for each instruction in the program that references memory (i.e. load and store). An entry in the RPT consists of the instruction address (PC), the previous data address referenced by this instruction, and the stride, which is the difference between the last two referenced data addresses. In addition, an RPT entry contains a state field that provides information about the success of previous prefetches for this entry. Data prefetching is triggered when the program counter reaches an instruction that has a corresponding entry in the RPT. If the state of the corresponding RPT entry indicates that data accesses can be predicted, the data at address (current address + stride)is prefetched to cache. The state field of an RPT entry is updated on any memory reference by the instruction whose address is stored for that entry, according to the state transition diagram shown in figure 1. All states are predictable except for the No-pred state. Whenever data from a certain address is to be prefetched, it is added to an Outstanding Request List (ORL). Data is prefetched from the ORL in order as long as no demand cache misses occur. Demand cache misses have precedence over ORL prefetches, and a reference is not prefetched if it results in an exception (e.g., page fault).
This scheme decreases data access penalty on nearly all benchmarks, but it has a drawback due to its strategy of prefetching the predicted reference for the next execution of the current instruction. If the next occurrence of the same instruction is too close (as in small loops), the prefetched data may not arrive in time for the instruction to use. On the other hand, if the next occurrence of the instruction is too far (as in large loops), the predicted data may arrive too early and replace a useful block in the data cache, or might even get replaced before being used. Due to both of these drawbacks, Chen and Baer introduced the lookahead reference prediction scheme.
In the lookahead reference prediction scheme, an advanced copy of the program counter, the lookahead program counter (LA_PC), advances ahead of the actual program counter and is used to trigger data prefetches. LA_PC is ahead of the actual PC by a slightly higher number of cycles than the memory latency. If the LA_PC matches an address in an RPT entry, the next estimated reference is added to the ORL. This increases the probability that the prefetched data will be in cache just in time before it is actually needed. The lookahead prediction scheme adds another field to the RPT table entries, called times, which indicates the number of iterations the LA_PC is ahead of the actual PC. For the update of LA_PC, branch prediction is required to predict the outcomes of the branches. When a branch misprediction is discovered while executing the normal instruction stream, LA_PC is reset back to PC and all the requests on the ORL are removed. When PC stalls because of a cache miss LA_PC can still advance and issue prefetches to the ORL.
Chen and Baer also described another scheme, the correlated reference prediction scheme, which is going to be discussed and modified in the next sections.
2.4 Other Schemes
Joseph and Grunwald [4] introduced an off-chip scheme that uses the miss address stream as a prediction source. They assume that an observed Markov model can approximate the miss reference stream. A model is dynamically constructed whenever a program is being executed. This means that the Markov model captures the past activity of all programs running on a system and uses this information to predict future references. The Markov model is approximated by limiting the number of states as well as the out-degree of each state to be applicable in hardware. The main disadvantage of this scheme is its high hardware complexity [4].
Roth et al. [13] introduced a Dependence-Based prefetching scheme, which dynamically identifies the load instructions that access a linked list structure, and constructs a description of the steps the program has followed to traverse the structure. Based on the assumption that the program will continue to follow the same steps, a prefetch engine takes this description and speculatively executes the load instructions in parallel with the original program’s execution. Details of this scheme can be found in [13].
3. CORRELATED PREFETCHING SCHEME
From the analysis of results in the published papers for the above schemes, on-chip schemes seem to yield much better performance (i.e., miss rates and data access penalties) than off-chip schemes. Chen and Baer [3] introduced another scheme, the correlated reference prediction scheme, which handles two levels of stride changes. They introduced this scheme to handle triangular loops. A typical triangular loop is:
for (i=0; i<n; i++)
for (j=0; j < i; j++)
A[i][j] = .....;
References in the inner loop are constant strides, but the frequent stride changes on each transition to the outer loop cause a lot of useless prefetches in the lookahead scheme. In image processing applications, if processing is done for only part of the image, a change of stride occurs whenever a line of an image is processed and the processing moves to the next line.
The key idea behind the Chen and Baer’s correlated reference prediction scheme is to keep track of adjacent accesses in inner loops (as in the basic and lookahead schemes), as well as the accesses correlated by changes in the loop level. They assume that loop branches are backward branches. Thus, a non-taken branch triggers the correlated scheme to use the higher level of strides, since a non-taken branch implies the end of a loop.
Their implementation of the correlated scheme is restricted to only two-levels of nested loops. The additional hardware in comparison with the lookahead scheme is:
- A 2-bit shift register that stores the branch history of the last two loop branches
- Additional (prev_addr, stride) pair in each entry of the RPT.
The use of branch history is dependent on some support from the compiler, which is supposed to flag loop branches to distinguish them from other branches (like if-then-else branches inside the loop). Any RPT entry using the correlated scheme should contain the following fields:
- tag: load/store Instruction address
- p_addr0, stride0: Previous address and stride used when the inner loop branch is not taken
- p_addr1, stride1: Previous address and stride for the normal instruction reference inside the inner loop
- times: The number of iterations (of the inner loop) the LA_PC is ahead of the actual PC
- state: Same as in figure 1
The results presented by Chen and Baer [3] show a slight improvement for the correlated scheme over the lookahead scheme, and it was our intuition that the improvement is going to be more significant for image processing and multimedia applications.
4. MODIFIED CORRELATED PREFETCHING SCHEME
There are two main disadvantages for the original correlated scheme. The compiler support assumption that Chen and Baer used to get their results could not be implemented solely in hardware. Also, the hardware implementation depends on whether the loops are compiled with forward or backward branches. Both of these disadvantages greatly limit the applicability of this scheme as a hardware-based prefetching scheme.