OpenSPARC Primer 3

The L2 Cache is pictured below for the OpenSPARC T2 Processor. As an example project: Exapanding or reducing the L2 cache is recommended. It is suggested that the T2 L2 cache can be incorperated it into the T1 for performance analysis.

Note: The below section is taken out of the OpenSPARC Internals Book Appendix B and describes the OpenSPARC T1 L2 Cache in detail.

-----------------------------------------------Begin Quotation-------------------------------------------------------

B.2 L2 Cache

The OpenSPARC T1 processor L2 cache is 3 Mbytes in size and is composed of four symmetrical banks that are interleaved on a 64-byte boundary. Each bank operates independently of the others. The banks are 12-way set associative and 768 Kbytes in size. The block (line) size is 64 bytes, and each L2 cache bank has 1024 sets.

The L2 cache accepts requests from the SPARC CPU cores on the processor to- cache crossbar (PCX), responds on the cache-to-processor crossbar (CPX), and maintains on-chip coherency across all L1 caches on the chip by keeping a copy of all L1 tags in a directory structure. Since the OpenSPARC T1 processor implements System-On-a-Chip, with single memory interface and no L3 cache, no off-chip coherency is required for the OpenSPARC T1 L2 cache other than that the L2 cache must be coherent with main memory.

Each L2 cache bank consists of three main subblocks:

• sctag (secondary cache tag) — Contains the tag array, VUAD array, L2 cache directory, and the cache controller

• scbuf — Contains the write-back buffer (WBB), fill buffer (FB), and DMA buffer

• scdata — Contains the scdata array

Coherency and ordering in the L2 cache are described as follows:

• Loads update directory and fill the L1 cache on return.

• Stores are nonallocating in the L1 cache.

• There are two flavors of stores: total store order (TSO) and read memory order (RMO).

Only one outstanding TSO store to the L2 cache per thread is permitted in order to preserve the store ordering. There is no such limitation on RMO stores.

• No tag check is done at a store buffer insert.

• Stores check the directory and determine an L1 cache hit.

• The directory sends store acknowledgments or invalidations to the SPARC core.

• Store updates happen to D$ on a store acknowledge.

• The crossbar orders the responses across cache banks.

B.2.1 L2 Cache Single Bank

The L2 cache is organized into four identical banks. Each bank has its own interface with the J-Bus, the DRAM controller, and the CCX. The L2 cache consists of the following blocks:

• Arbiter — Manages the access to the L2 cache pipeline from the various sources that request access. Gets input from the following:

• Instructions from the CCX and from the bypass path for input queue (IQ)

• DMA instructions from the snoop input queue

• Instructions for recycle from the fill buffer and the miss buffer

• Stall signals from the pipeline

• L2 tag — Contains the sctag array and the associated control logic. Each 22-bit tag is protected by 6 bits of SEC ECC. The state of each line is maintained by valid (v), used (u), allocated (a), and dirty (d) bits. These bits are stored in the L2 VUAD array.

• L2 VUAD array — Organizes the four state bits for sctags in a dual-ported array structure:

• Valid (v) – Set when a new line is installed in a valid way; reset when that line is invalidated.

• Used (u) – Is a reference bit used in the replacement algorithm. Set when there are any store/load hits (1 per way); cleared when there are no unused or unallocated entries for that set.

• Allocated (a) – Set when a line is picked for replacement. For a load or an ifetch, cleared when a fill happens; for a store, when the store completes.

• Dirty (d) – (per way) Set when a store modifies the line; cleared when the line is invalidated.

• L2 data (scdata) — Is a single-ported SRAM structure. Each L2 cache bank is 768 Kbytes in size, with each logical line 64 bytes in size. The bank allows read access of 16 bytes and 64 bytes, and each cache line has 16 byte-enables to allow writing into each 4-byte part. A fill updates all 64 bytes at a time.

Each 32-bit word is protected by seven bits of SEC/DED ECC. (Each line is 32 × [32 + 7 ECC] = 1248 bits). All subword accesses require a readmodify- write operation to be performed, referred to as partial stores.

• Input queue (IQ, a 16-entry FIFO) — Queues the packets arriving on the PCX when they cannot be immediately accepted into the L2 cache pipe. Each entry in the IQ is 130 bits wide. The FIFO is implemented with a dual-ported array. The write port is used for writing into the IQ from the PCX interface. The read port is for reading contents for issue into the L2 cache pipeline.

• Output queue (OQ, a 16-entry FIFO) — Queues operations waiting for access to the CPX. Each entry in the OQ is 146 bits wide. The FIFO is implemented with a dual-ported array. The write port is used for writing into the OQ from the L2 cache pipe. The read port is used for reading contents for issue to the CPX.

• Snoop input queue (SNPIQ, a 2-entry FIFO) — Stores DMA instructions coming from the JBI. The non-data portion (the address) is stored in the SNPIQ. For a partial line write (WR8), both the control and the store data are stored in the SNPIQ.

• Miss buffer (MB) — Stores instructions that cannot be processed as a simple cache hit; includes the following:

• True L2 cache misses (no tag match)

• Instructions that have the same cache line address as a previous miss or an entry in the write-back buffer

• Instructions requiring multiple passes through the L2 cache pipeline (atomics and partial stores)

• Unallocated L2 cache misses

• Accesses causing tag ECC errors

A read request is issued to the DRAM and the requesting instruction is replayed when the critical quadword of data arrives from the DRAM. All entries in the miss buffer that share the same cache line address are linked in the order of insertion to preserve coherency. Instructions to the same address are processed in age order, whereas instructions to different addresses are not ordered and exist as a free list.

• Fill buffer (FB) — Temporarily stores data arriving from the DRAM on an L2 cache miss request. The fill buffer is divided into a RAM portion, which stores the data returned from the DRAM waiting for a fill to the cache, and a CAM portion, which contains the address. The fill buffer has a read interface with the DRAM controller.

• Write-back buffer (WBB) — Stores the 64-byte evicted dirty data line from the L2 cache. The WBB is divided into a RAM portion, which stores evicted data until it can be written to the DRAM, and a CAM portion, which contains the address.The WBB has a 64-byte read interface with the scdata array and a 64-bit write interface with the DRAM controller. The WBB reads from the scdata array faster than it can flush data out to the DRAM controller.

• Remote DMA write buffer (RDMA–4-entry buffer) — Accommodates the cache line for a 64-byte DMA write. The output interface is with the DRAM controller, which it shares with the WBB. The WBB has a direct input interface with the JBI.

• L2 cache directory — Participates in coherency management and maintains the inclusive property of the L2 cache. Also ensures that the same line is not resident in both the I-cache and the D-cache (across all CPUs).

Each L2 cache directory has 2048 entries, with one entry per L1 tag that maps to a particular L2 cache bank. Half the entries correspond to the L1 instruction cache (I-cache), and the other half of the entries correspond to the L1 data cache (D-cache).

The L2 cache directory is split into two directories, which are similar in size and functionality: an I-cache directory (icdir) and a D-cache directory (dcdir).

B.2.2 L2 Cache Instructions

The following instructions follow a skewed, rather than simple, pipeline.

• A load instruction to the L2 cache is caused by any one of the following conditions:

• A miss in the L1 cache (the primary cache) by a load, prefetch, block load, or a quad load instruction

• A streaming load issued by the stream processing unit (SPU)

• A forward request read issued by the IOB

The output of the scdata array, returned by the load, is 16 bytes in size. This size is same as the size of the L1 data cache line. An entry is created in the D-cache directory. An I-cache directory entry is invalidated if it exists. An I-cache directory entry is invalidated for the L1 cache of every CPU in which it exists.

From an L2 cache perspective, a block load is the same as eight load requests. A quad load is same as four load requests.

• An ifetch instruction is issued to the L2 cache in response to an instruction missing the L1 I-cache. The size of I-cache is 256-bits. The L2 cache returns the 256 bits of data in two packets over two cycles to the requesting CPU over the CPX. The two packets are returned as an atomic. The L2 cache then creates an entry in the I-cache directory and invalidates any existing entry in the D-cache directory.

• A store instruction to the L2 cache is caused by any of the following conditions:

• A miss in the L1 cache by a store, block store, or a block init store instruction

• A streaming store issued by the stream processing unit (SPU)

• A forward request write issued by the IOB

The store instruction writes (in a granularity of) 32-bits of data into the scdata array. An acknowledgment packet is sent to the CPU that issued the request, and an invalidate packet is sent to all other CPUs. The I-cache directory entry for every CPU is cammed and invalidated. The D-cache directory entry of every CPU, except the requesting CPU, is cammed and invalidated.

Partial stores (PSTs) perform sub-32-bit writes into the scdata array. A partial store is executed as a read-modify-write operation. In the first step, the cache line is read and merged with the write data. It is then saved in the miss buffer. The cache line is written into the scdata array in the second pass of the instruction through the pipe.

• Three types of atomic instructions are processed by the L2 cache—loadstore unsigned byte (LDSTUB), SWAP, and compare and swap (CAS). These instructions require two passes down the L2 cache pipeline.

• The following I/O instructions from the J-Bus interface (JBI) are processed by the L2 cache:

• RD64 (block read) — Goes through the L2 cache pipe like a regular load from the CPU. On a hit, 64 bytes of data are returned to the JBI. On a miss, the L2 cache does not allocate but sends a non-allocating read to the DRAM. It gets 64 bytes of data from the DRAM and sends the data directly to the JBI (read-once data only) without installing it in the L2 cache. The CTAG (the instruction identifier) and the 64-byte data are returned to the JBI on a 32-bit interface.

• WRI (write invalidate) — Accepts a 64-byte write request and looks up tags as it goes through the pipe. Upon a tag hit, invalidates the entry and all primary cache entries that match. Upon a tag miss, does nothing (it just continues down the pipe) to maintain the order. The CTAG is returned to the JBI when the processor sends an acknowledgment to the cache line invalidation request sent over the CPX. After the instruction is retired from the pipe, 64 bytes of data are written to the DRAM.

• WR8 (partial line write) — Supports the writing of any subset of eight contiguous bytes to the scdata array by the JBI. Does a two-pass partial store if an odd number of byte enables are active or if there is a misaligned access; otherwise, does a regular store. On a miss, data is written into the scdata cache and a directory entry is not created. The CTAG is returned to the JBI when the processor sends an acknowledgment to the cache line invalidation request sent over the CPX.

• Eviction — Sends a request to the DRAM controller to bring the cache line from the main memory after a load or a store instruction miss in the L2 cache.

• Fill — Issued following an eviction after an L2 cache store or load miss. The 64-byte data arrives from the DRAM controller and is stored in the fill buffer. Data is read from the fill buffer and written into the L2 cache scdata array.

• L1 cache invalidation — Invalidates the four primary cache entries as well as the four L2 cache directory entries corresponding to each primary cache tag entry. Is issued whenever the CPU detects a parity error in the tags of I-cache or D-cache.

• Interrupts — When a thread wants to send an interrupt to another thread, that interrupt is sent through the L2 cache. The L2 cache treats the thread like a bypass. After a decode, the L2 cache sends the instruction back to destination CPU if it is a interrupt.

• Flush — The OpenSPARC T1 processor requires the FLUSH instruction. Whenever a self-modifying code is performed, the first instruction at the end of the self-modifying sequence should come from a new stream. An interrupt with br = 1 is broadcast to all CPUs. (Such an interrupt is issued by a CPU in response to a FLUSH instruction.) A flush stays in the output queue until all eight receiving queues are available. This is a total store order (TSO) requirement.

B.2.3 L2 Cache Pipeline

L2 cache pipeline — The L2 cache processes three main types of instructions:

• Requests from a CPU by way of the PCX. Instructions include load, streaming load, ifetch, pre-fetch, store, streaming store, block store, block init store, atomics, interrupt, and flush.

• Requests from the I/O by way of the JBI. Instructions include block read (RD64), write invalidate (WRI), and partial line write (WR8).