VLSI Architecture

1

VLSI Architecture

CONTENTS

Sr.no /

Topic

/ Page No
I /

Introduction

/ 1
II /

Related Work

·  Dynamically Resizable Instruction Cache
·  Cache Decay
·  Partitioned Cache Architecture
·  Selective Cache Ways / 2
III /

Time Based Leakage Control In Partitioned Cache

Architecture
·  Overview
·  Block Diagram
·  Implementation
·  Placement Strategies
·  Prediction Strategies
·  Deciding Cache Decay Interval / 7
IV / CONCLUSION / 11

REFERENCES

/ 12

1

VLSI Architecture

Section I: INRODUCTION

The advance in the semiconductor technology has paved way for increasing the density of transistors per chip. The amount of information storable on a given amount of silicon has roughly doubled every year since the technology was invented. Thus the performance of the processor improved and the chips’ energy dissipation increased in each processor generation. This created awareness for designing low power circuits. Low power is important in portable devices because the weight and size of the device is determined by the amount of battery needed which in turn depends on the amount of power dissipated in the circuit. The cost involved in providing power and associated cooling, reliability issues, expensive packaging made low power a concern in nonportable applications like desktops and servers too. Even though most power dissipation in CMOS CPUs is dynamic power dissipation (which is a function of frequency of operation of the device and the switching capacitance), leakage power (function of number of on-chip transistors) is also becoming increasingly significant as leakage current flows in every transistor that is on, irrespective of signal transition. Most of the leakage energy comes from memories, since cache occupies much of CPU chip’s area and has more number of transistors, reducing leakage in cache will result in a significant reduction in overall leakage energy of the processor. This paper suggests an architectural approach for reducing leakage energy in caches.

Various approaches have been suggested both in architecture and circuit level to reduce leakage energy. One approach is to count the total number of misses in a cache and upsize/ downsize the cache depending on whether the miss count is greater or lesser than a preset value. The cache dynamically resizes to the application’s required size and the unused sections of the cache are shut off. Another method called cache decay turns off the cache lines when they hold data not likely to be reused. The cache lines are shut off during their dead time that is during the time after the last access and before the eviction. After a specific number of cycles have elapsed and still if the data is unused then that cache line is shut off. Another approach was to disable portion of the cache ways called selective cache ways. This method, which is application sensitive, enables all the cache ways (a way is one of the n sections in an n-way set associative cache) when high performance is required and enables only a subset of the ways when cache demands are not high.

This paper is organized as follows, section 1 narrates the work done related to this problem, section 2 describes our approach, which uses a time based decay policy in a partitioned architecture of level – 2 cache and finally section 3 presents the conclusion.

Section II: Related Work

Dynamically resizable instruction cache:

This method exploits the utilization of cache, cache utilization varies depending on the application requirements. By shutting of portion of the cache that is unused, leakage energy can be reduced significantly. It uses a dynamically resizable I-cache architecture, which resizes in accordance with the application requirements and uses a technique called gated-Vdd in the circuit level to turn off unused portions of the cache. The number of misses is counted periodically (say every 1 million instructions), the cache size is increased or decreased depending on whether the count is more or less than a preset value. The cache is also prevented from thrashing by fixing a minimum size beyond which the cache cannot be decreased.

Merits:

·  Reduces the average size of a 64 K cache by 62%, thus lower leakage energy and the performance degradation is within 4%.

·  By employing a wide NMOS dual-Vt gated-Vdd implementation the leakage is virtually eliminated (connecting gated Vdd transistor in series with SRAM cell-stacking effect) with only 5% area increase.

·  By controlling the miss rate with reference to a preset value, the performance degradation and the increase in lower cache levels’ energy dissipation (due to misses in L1 cache) is kept low.

·  The dynamic energy of the counter hardware used is small as the average number of bits switching on a counter increment is less than two (as the ith bit in a counter switches once only every 2^i increments).

Demerits:

·  Here resizing affects the miss rate, a miss in L1 cache will lead to dynamic energy dissipation in L2 cache, so the number of accesses to L2 cache should be low.

·  There is an extra L1 dynamic energy due to the resizing bits.

·  Resizing circuitry may increase energy dissipation offsetting the gains form cache resizing, so the resizing frequency should be low.

·  Longer resizing will span multiple application phases reducing opportunity for resizing and shorter resizing interval may result in increase in overhead.

·  Resizing form one size to another will modify the set-mapping function for blocks and may result in an incorrect lookup.

·  For an application, which requires a small I-cache, dynamic component will be large due to large number of resizing tag bits.

·  Gated Vdd transistor must be large to sink the current flowing through SRAM cells during read/ write operation. But too large gated Vdd reduces stacking effect and increases area of overhead.

Cache decay:

In this technique, cache lines, which hold data that are not likely to be reused, are turned off. It exploits the fact that cache lines will be used frequently when data is first brought in and then there will be a period of dead time before the data is evicted. So by turning off the cache lines during their dead time, leakage energy can be reduced significantly without additional misses incurred thus performance will be comparable to a conventional cache. The policy used here is a time-based policy that turns a cache line off if a pre-set number of cycles have elapsed since its last access.

Fig A

As seen in fig A, the access interval is the time between two hits, dead time is the time between last hit and the time at which the data is evicted.

Fig B

As seen in the fig B, the dead time for most of the benchmarks is high.

Merits:

·  70 % reduction in L1 data cache leakage energy achieved.

·  Program performance or dynamic power dissipation will not be affected much as the cache line is turned off only during its dead time.

·  Results show that dead times are long, thus moderately easy to identify.

·  Very successful if application has poor reuse of data like streaming applications.

·  Can be applied to outer levels of cache hierarchy (as outer levels are likely to have longer generations with larger dead time intervals)

Demerits:

·  There might be additional L1 misses if there is a miss in the L1 cache due to early shut off of the cache line.

·  Shorter decay intervals (time after which cache line is shut off) reduce leakage energy but may increase miss rate, leading to dynamic energy dissipation in lower level memory.

Partitioned Cache Architecture:

This method partitions the cache into smaller units (subcaches) each of which acts as a cache and selectively disables unused components. Since the partition is at the architecture level the data placement and data probing mechanisms are sophisticated than those at the circuit level. The topology of subcaches may be different. The cache predictor tells the cache controller, which subcaches should be activated. The effectiveness of the probing strategy determines the number of subcaches accessed per data reference, and thus the energy consumed.

Merits:

·  Reduces per access energy costs

·  Improves locality behavior

·  Smaller and lesser energy consuming components

·  The subcaches can all be the same or caches with different topology

·  Both performance and energy can be optimized

·  Breaking up into sub caches or sub banks reduces wiring and diffusion capacitances of bit lines and wiring and gate capacitances of word lines. Thus dynamic energy consumption when accessing the cache will be less.

1

VLSI Architecture

1

VLSI Architecture

Demerits:

·  If large number of cycles is spent in servicing a memory request because of a poor probing strategy then performance will be degraded.

·  Performance depends on the effectiveness of the probing policy, if the probing policy is not good then there will be reprobing penalty.

·  Energy depends on the number of subcaches accessed per data reference.

Selective cache ways:

This method exploits the subarray partitioning that is usually already present and enables all the cache ways when required to achieve high performance but only a subset of cache ways when cache demands are not high. Since only a subset of the cache ways is active leakage energy can be reduced significantly. This strategy exploits the fact that cache requirements vary considerably between applications as well as within an application. A software visible register called the cache way select register (CWSR), signals the hardware to enable/ disable particular ways. Special instructions are there for writing and reading cache way select register. Software also plays a role in analyzing application cache requirements, enabling cache ways and saving cache way select register. Thus this is a combination of hardware and software elements. The degree to which ways are disabled depends on the relative energy dissipation of different memory hierarchy levels and how they are affected by disabling ways.

Section III: TIME based leakage control in partitioned cache architecture

Overview:

Level 2 cache is larger in size than level 1 cache, thus level 2 cache will dissipate more leakage energy than level 1 cache. Thus by reducing leakage energy in L2 cache, overall leakage energy can be reduced to a great extent. This paper combines two existing strategies, to reduce leakage power in level 2 cache. This paper exploits the advantages of partitioning and time based cache decay techniques. The level 2 cache is partitioned into smaller units each of which is a cache by itself called subcache. Methods were proposed for partitioning the cache structure, shutting of part of cache ways during their dead time, partitioning the sub arrays of a cache structure. In this paper we suggest partitioning the cache structure into small caches called subcaches and implementing the cache decay (shutting of portions of cache ways) in each subcache. This can reduce the leakage energy significantly. Subcache architecture enjoys the following benefits:

·  Reduces per access energy costs

·  Improves locality behavior

·  Smaller and lesser energy consuming components

·  Both performance and energy can be optimized

·  Breaking up into subcaches or subbanks reduces wiring and diffusion capacitances of bit lines and wiring and gate capacitances of word lines. Thus dynamic energy consumption when accessing the cache will be less.

This architecture selectively disables unused subcaches and activates the one holding the data, thereby leakage energy can be reduced significantly. By applying the time based cache decay technique to each of the subcache, only part of the cache ways will be enabled within a subcache, the power wasted on dead times (when cache way is idle) can be avoided, thus this combination of partitioning and selective cache ways can reduce the leakage energy more than that when one technique is applied. Selective cache ways is an appropriate technique to be used in subbank because of the following reasons:

·  Program performance will not be affected much as the cache line is turned of only during its dead time.

·  Time based cache decay works well if the reuse of data is poor, reuse of data in L2 cache will be less than that in L1 cache, so it is appropriate to apply this technique to L2 cache.

·  Outer levels of hierarchy are likely to have longer generations with larger dead time intervals, which is what is required for this time based cache decay technique.

·  The fraction of time the cache way is dead increases with higher miss rate as the lines spend more of their time about to be evicted.

Implementation:

The block diagram of the hardware implementation is shown in Fig C. The level 2 cache is divided into smaller units each acting like a cache by itself, these are called as SUBBANKS or SUBCACHES. The subcache that needs to be activated is decided by a

logic called CACHE PREDICTOR. This operation is performed concurrently with table lookup operation in order to avoid delay in critical path. The output of the cache predictor will be the subcache id or will be a no prediction. If the output is a no prediction then a logic called DEFAULT PREDICTOR will be used to select the cache for activation. Based on the cache predictor output the CACHE CONTROLLER will activate the appropriate subcache. The check will be made only with the cache ways that are active within the subcache, not all cache ways will be enabled within the subcache. Disabling the cache ways within a subcache is done by means of a time based decay policy (fig D).

The time based decay policy is implemented in each subcache. Each cache line within a subcache is connected to a counter, this counter is a 2-bit counter (local counter) which increments its value after receiving the tick pulse form a global counter. The two inputs the local counter receives are the global tick signal T and the cache line access signal WRD. When the 2 bit counter reaches its maximum value, the decay interval, (It is found that for L2 caches the decay interval should be in the range of tens of thousands of cycles) which is the time allowed before which the line is shut off would have elapsed. On every access to the cache line the 2-bit counter is reset to its initial value. Once the counter saturates to its maximum value the cache line is shut off using gated Vdd technique. The gated Vdd transistor connected in series with the SRAM cell is turned off, disabling that cache way. Thus the cache ways that are idle will be disabled.