@head: Performance Improvements, Hazards, and Resolution Techniques in Pipelined ASICs

@deck: Adding pipeline stages to a microprocessor core increases performance, but there are a number of potential hazards that need to be overcome in order to achieve a robust and efficient pipelined design.

@text: The ever-increasing demand for high speed ASICs is driving the requirement to increase circuit throughput in terms of calculations per clock cycle. (For the purposes of this article, the term ASIC will be assumed to encompass microprocessor-type devices and SoC components.)

One way to increase the performance of an ASIC is to use pipelining, but this is at the expense of increases in system latency and area. Pipelining decreases combinational delays by inserting registers in long combinational paths, thereby allowing the system’s clock frequency to be increased, which results in higher overall performance.

Factors Affecting the Maximum Clock Frequency

There are a number of different factors that affect the maximum frequency of the clock in a pipelined system. First, consider an “ideal” path between two pipeline stages (Figure 1).

If we assume that the clock signal reaches both banks of registers simultaneously, that this is a perfect clock without any jitter, that the clock-to-output delay of register A (TCQA) is zero, and that the data setup and hold times associated with register B (TSB and THB, respectively) are zero, then the maximum frequency FMAX is the reciprocal of the maximum delay path through the combinational logic; that is:

(1) FMAX = 1/TPERIOD = 1/TCOMB

In the real world, however, the clock signal will arrive at the input to register B some time after it arrives at register A due to the propagation delay of the track. This is referred to as clock skew (TSKW). Negative clock skew occurs if the delay between the clocks of two adjacent registers is more than the combinational path between the two registers. This results in a race condition in which new data arrives before the old data can be successfully latched. When compounded across all of the clock nets in a complex digital product, these tiny differences in propagation delay often lead to unacceptable degradations in overall system timing margins.

Another real-world phenomenon is known as clock jitter (TJIT). This refers to the fact that there may be variations between the arrival times of consecutive clock edges as shown in Figure 2. In turn, this affects the duty cycle of the clock.

Last but not least, we need to account for the setup time associated with register B (TSB), which refers to the amount of time the data has to be present prior to the clock edge arriving at the register. Thus, the real-world equation for the maximum frequency is:

(2) FMAX = 1/TPERIOD = 1/(TCQA + TCOMB + TSB + TSKW + TJIT)

Thus, the concept of pipelining is that, by splitting the combinational logic into smaller pieces and adding more register stages, we can increase the maximum frequency at which the circuit can operate and thereby increase the total circuit throughput in terms of the number of calculations that can be performed each second.

Performance Increases from Pipelining

Now let us consider the performance increases that are possible due to pipelining. The latency of a system is the time between data arriving at the inputs to the pipeline and the results corresponding to this set of inputs becoming available at the outputs from the pipeline. Consider a circuit implemented using only a single pipeline stage as illustrated in Figure 3a (this may also be referred to as an unpipelined circuit). In this case, the latency of the system – which is equal to the clock period – is defined as shown in equation (3):

(3) TLATENCY(1-STAGE) = TCOMB + TREGISTER + TCLOCKING

where TREGISTER is the register overhead (TCQ + TSETUP) and TCLOCKING is the clocking overhead (TSKW + TJIT). Now consider what happens if the same circuit is pipelined into n stages. In this case, the period for any stage (TSTAGE) is derived as follows:

(4) TSTAGE = TCOMB(STAGE) + TREGISTER + TCLOCKING

Ideally, the combinational logic delays will be identical for each stage in the pipeline:

(5) TCOMB(STAGE) = TCOMB / n

In this case, the minimum possible clock period (TPIPELINE) for an n-stage pipeline is:

(6) TPIPELINE = TSTAGE = TCOMB / n + TREGISTER + TCLOCKING

And the latency of the n-stage pipeline as a whole is n times the clock period:

(7) TLATENCY(N-STAGE) = nTPIPELINE

Thus, assuming an ideal clock period, the final latency of an n-stage pipeline is:

(8) TLATENCY(N-STAGE) = TCOMB + n(TREGISTER + TCLOCKING)

This allows us to calculate the speed increases associated with pipelining as follows:

(9) Speed increase = FAFTER / FBEFORE

where FBEFORE and FAFTER refer to the system’s clock frequency before and after pipelining, respectively. Remembering that the frequency is the reciprocal of the clock period, the speed increase is therefore given by:

(10) = TBEFORE / TAFTER

where TBEFORE and TAFTER refer to the clock period before and after pipelining, respectively. From (3) we know that TBEFORE = TCOMB + TREGISTER + TCLOCKING. Similarly, from (6) we know that TAFTER = TCOMB / n + TREGISTER + TCLOCKING. This means that:

(11) = (TCOMB + TREGISTER + TCLOCKING) / (TCOMB / n + TREGISTER + TCLOCKING)

Now, if we specify the register and clocking overhead associated with an unpipelined circuit as being a fraction k of the total clock period for that circuit, then:

(12) k = (TREGISTER + TCLOCKING) / (TCOMB + TREGISTER + TCLOCKING)

Substituting the value of k in equation (11) gives:

(13) Speed increase = 1 / ( (1-k) / n + k )

Now, the throughput of a system can be defined as the number of calculations performed each clock cycle. This means that the performance increase provided by pipelining a system can be defined as:

(14) (Average calculation time per instruction before pipelining) / (Average calculation time per instruction after pipelining)

Let’s assume that the number of instructions per clock cycle T is IPC (“instructions per clock”). In this case, the average calculation time per instruction = T/IPC. Substituting this in equation (14) gives:

(15) Performance increase = (IPCAFTER x TBEFORE) / (IPCBEFORE x TAFTER)

Assuming that no additional micro-architectural features are implemented to improve the IPC after pipelining the circuit, this gives:

(16) = (IPCAFTER / IPCBEFORE) x (1 / ( (1-k) / n + k ) )

Now let us consider a practical example of the performance increase resulting from an increase in the number of pipeline stages. Increasing the number of pipeline stages from 10 in the Pentium III processor to 20 in the Pentium IV processor resulted in a 20% reduction in instructions per cycle (IPC). Assuming a constant 2% timing overhead as a fraction of the total unpipelined delay, the performance increase is calculated as:

(17) = (IPCAFTER / IPCBEFORE) x ( ((1-k) / nBEFORE + k) / ((1-k) / nAFTER + k) )

= 0.8 x ( ((1-0.02)/10 + 0.02) / ((1-0.02)/20 + 0.02) )

= 1.37

Thus, from this equation we see that – if implemented at the same technology node – the Pentium IV has only 37% better performance despite having twice the number of pipeline stages. (Note also that the 20% reduction in the IPC for the Pentium IV is due to branch misprediction, pipeline stalls, and other hazards caused by the higher degree of complex logic as compared to the Pentium III.)

Pipelining a DLX Microprocessor

The DLX is a theoretical 32-bit RISC microprocessor whose architecture is an emerging academic standard. Each DLX instruction consists of at most five elements: Instruction Fetch (IF), Instruction Decode (ID), Execution/Effective Address Cycle (EX), Memory Access (MEM), and Write Back (WB). First consider an unpipelined single-cycle DLX datapath as shown in Figure 4.

Now let’s modify the original unpipelined circuit by adding five pipeline stages, where each stage is associated with one of the main operations (Figure 5).

In the case of the original single-cycle DLX datapath shown in Figure 4, instructions cannot be executed in parallel, so each instruction can be processed only when the previous instruction has been completed. Thus, assuming that the single-cycle clock period were 10 ns as shown in Figure 6(a), executing five instructions one after the other would take 5 x 10 ns = 50 ns.

Next, consider the multi-cycle datapath shown in Figure 6(b). In this case it takes five clock cycles to complete the instruction, but the period of each cycle is only 2 ns. Now, if five instructions were executed sequentially as in the single-cycle datapath, this would take 5 x 5 x 2 = 50 ns (that’s five instructions each consuming five clock cycles where each clock cycle is 2 ns). However, if we take the same multi-cycle datapath and run it in a pipelined mode, it requires only nine clock cycles to execute five instructions as shown in Figure 6(c), which means that these five instructions complete in 9 x 2 ns = 18 ns. Thus, the performance increase due to pipelining = 50/18 = 2.8 (keeping latency constant for a single instruction).

It’s important to remember that there is an additional overhead with the pipelined implementation due to the clock skew and register delays. This overhead (which is not reflected in Figure 6 for simplicity) limits the amount of speedup that can be achieved.

The key principles associated with this form of pipelining are as follows:

o  All instructions that share a pipeline must have the same stages in the same order; for example, an “add” instruction will not do anything during the MEM stage.

o  All intermediate values must be latched in each cycle.

o  There should not be any functional block reuse.

o  All operations in one stage should complete in one cycle.

Pipelining Hazards and Solutions

In this context, the term “hazard” refers to a situation that interferes with the operation of the pipeline and prevents the next instruction from executing during its designated clock cycle. Thus, hazards tend to reduce the theoretical speedup that can be gained by pipelining. The three main types of hazard are structural, data, and control. As we shall see, the common solution to all of these hazard types is to insert one or more “bubbles” (gaps”) into the pipeline so as to stall the pipeline until the hazard is resolved.

Structural Hazards: In the case of a structural hazard, the hardware cannot support all possible combinations of instructions in simultaneous overlapping execution. The result is resource contention in which multiple instructions require access to the same resource. For example, consider the execution of a Load instruction using a single-port memory as shown in Figure 7.

In this case, two instructions try to access the same memory during the same clock cycle (one during the MEM stage and the other during the IF stage), hence a memory conflict occurs. A simple solution to this problem is to stall the pipeline for one clock cycle when the conflict occurs. This results in a pipeline bubble (Figure 8) with a corresponding increase in clock latency.

An alternative solution could be to keep the memories for the IF and MEM stages separate so as to avoid any possible conflict caused by simultaneous memory access. This solves the problem of this form of structural hazard, but at the expense of additional resources.

Data Hazards: In the case of a data hazard, the execution of the current instruction depends on the result from the previous instruction. For example, consider the following sequence of instructions:

ADD R1, R2, R3 (R1 = R1 + R2 + R3)

XOR R4, R5, R1

SUB R7, R1, R6

OR R9, R8, R1

As we see, all of the instructions use R1 following the first instruction (Figure 9). In this case, it is almost impossible for the instructions following the ADD to give correct results since they all require the result from the ADD (in the form of the output from its WB stage) during their ID stages.

The usual solution for this type of hazard is the concept of data/register forwarding. The way this works is that the selected data is not really used in ID stage as shown in Figure 9, but rather in the ALU (EX) stage as shown in Figure 10.

Data forwarding rules are as follows:

o  The ALU output from the EX/MEM buffer of the first instruction (the instruction whose output is supposed to be used in subsequent instructions) is always fed back to the ALU input of the next instruction.

o  If the forwarding hardware detects that its source operand has a new value, the logic selects the newer result as opposed to the value read from the register file (ID/EX buffer). Figure 11 shows the inclusion of a multiplexer to implement this feature.

o  Results need to be forwarded not only from the immediate previous instruction but also from any instruction that started three cycles before (shown as 1, 2, and 3 in Figure 10).

o  The result from an EX/MEM (1 cycle before) and a MEM/WB (2 cycles before) are forwarded to both ALU inputs (shown as 1 and 2, respectively, in Figure 10).

o  Reading from the register file is performed in the first half of the cycle and writing in the second half of the cycle (3 cycles before, shown as 3 in Figure 10)

Some types of data hazards prevent the pipelined circuit from giving correct results as follows:

o  Read After Write (RAW): This is the most common data hazard and is solved by data forwarding

o  Write After Write (WAW): Here, two instructions write to the same register, but one instruction does so before the other. The DLX avoids this by waiting for the WB stage to write to the registers, which means that there are no WAW hazards in the DLX.