Alternatives to ROBs

There are some problems with using a reorder buffer.

•When instruction execution times vary widely, a large number of results must be buffered.

•It requires complex bypassing to allow instructions to use values in the ROB as inputs.

•Access to the register values in the buffer is essentially fully associative, since any register's value can be in any slot.

There are two main alternatives to a ROB—history buffers and future files.

History buffer (HB)

The history buffer is a queue. Each entry contains—

•the program counter of the associated instruction,

•the number of the register in the register file to be overwritten by the associated instruction,

•the old value of that register,

•any exception caused by the instruction, and

•a flag to indicate the completion of the instruction.

With a history buffer, results go into the register file out of order.

•Instructions complete out of order.

•The register file is updated immediately.

•But old state is kept in the history buffer until all previous ops are finished.

This requires hardware to “roll back” the history buffer when an exception occurs.

Operation of HB

The HB is a FIFO with head and tail pointers (like the ROB).

On instruction dispatch—

•We reserve the HB entry at tail, and advance the tail pointer.

•If the instruction writes a register Rx, we read the old value of Rx from the register file and save it in the HB.

On instruction execution/completion—

•We write the register file when the instruction completes.

•Updates occur out of order: we have a “messy register file” (imprecise).

•If an instruction caused an exception, we set the exception bit in the HB for the offending instruction.

On instruction retirement—

•We check the instruction at the head of the HB.

–If completed, we check the exception bit.

–If no exception occurred, we simply advance the head pointer, thereby deleting the entry from the history buffer.

–If an exception occurred, instruction issue is immediately halted.

The processor waits for all “in-flight” instructions to complete.

The history buffer is scanned from tail to head.

The old values of registers in the history buffer are written back to the register file.

This amounts to “undoing” the speculative updates.

Precise Tomasulo pipeline with HB

HB analysis

Advantages—

Disadvantages—


Future file (FF)

A future file is “the reverse of” a history buffer.

With a FF, we maintain two register files

•A future (“messy”) register file – imprecise Tomasulo pipeline is unchanged, future register file updated OOO.

•A separate, architectural register file is updated in order by a reorder buffer.

The architectural file has in-order results.

The future file has out-of-order results for use as operands.

Operation of FF

Upon dispatch, an instruction is allocated a reorder buffer entry at the tail of the ROB.

Operands are read from the future file.

On instruction completion—

•If an instruction at the head of the ROB completes with no exception, its ROB entry is simply deleted.

•If an instruction at the head of the reorder buffer encounters an exception,

–instruction issue is halted, and

–pipelines are flushed.

–the contents of the architectural register file are copied into the future file.

When an instruction completes, it will write into both the reorder buffer and the future file.

The architectural register file always contains precise state for exception-handling.

Precise Tomasulo Pipeline with FF

FFanalysis

Advantages—

•Unlike ROB, no extra

•Unlike HB, no extra

Disadvantages—

•Slow

Branch Mispredictions

Are branch mispredictions like exceptions? Yes and no.

Handle exactly like exception (low performance):

•Wait until mispredicted branch reaches head of ROB.

•Quash entire ROB, set all “In RF” bits.

•Restart along correct path, no problem.

•But, delaying recovery until retirement is a big performance penalty.

Better performance:

•Easy part: Quash bad instructions immediately by moving ROB tail pointer to just after branch entry.

•Hard part: Restore register-file tags to their state before the branch so that renaming restarts correctly.

•How? Checkpoint register-file tags at every branch instruction. Restore tags from checkpoint when misprediction detected.

–“Logical spaces” are managed like a stack.

–At checkpoints, the current logical space is copied into an empty logical space. The stack is pushed.

–On a mispredicted branch, the stack is popped.

Superscalar processing

[H&P §3.6] CPU_time = IC  CPI  CT

For scalar pipelining,

CPI  1(Flynn’s bottleneck)

 Reduce CT further, increase instructions per second.

The limits of pipelining

•Latch overhead and clock skew:

Overhead becomes a larger component of cycle

Pipelining eventually becomes inefficient

Diminishing returns

•Can only break logic up so far.

•Pipelining is ineffective when there are dependences.

For example: You can pipeline L1-cache accesses, but it still takes X ns. to get data out. An instruction that needs that data must wait X ns. regardless of how fast you make the clock and how many stages there are.

How do we get around these limits? By superscalar pipelining.

•Scalar: 1 instruction/cycle

•Superscalar: N instructions/cycle

We increase bandwidth at each stage of the pipeline.

1 / 2 / 3 / 4 / 5 / 6 / 7
IF / ID / EX / MEM / WB
IF / ID / EX / MEM / WB
IF / ID / EX / MEM / WB
IF / ID / EX / MEM / WB
IF / ID / EX / MEM / WB
IF / ID / EX / MEM / WB

There may be limitations on parallel issue, for example—

•No more than one memory instruction per clock cycle.

•A limit of a single branch per cycle.

In order to maximize the number of instructions issued per clock, both static and dynamic scheduling techniques are used.

Formal definitions versus popular usage

Formal: “Superscalar” applies to any multiple-issue machine.

Popular: “Superscalar” also implies dynamic scheduling

Statically scheduled, multiple-issue machines are popularly called VLIW – “very long instruction word.” In a VLIW, the compiler groups multiple independent instructions into a larger package.

New complexity introduced by superscalar architecture

A superscalar architecture adds complexity to almost every pipeline stage.

IF

Parallel access to I$.

Instruction-alignment issues.

ID

Parallel register access: highly multi-ported register file.

RAW hazard cross-checks required.

Renaming/dependence checking now more complex than RF lookup.
Must also check prior instructions in fetch group.

IS/EX

Parallel execution—replicate datapaths, FUs, etc.

MEM

> 1 per cycle? If so, multi-ported D$

WB

Many result buses (bypasses) and register-file write ports

Superscalar complexity and cycle time

The IPC / CT trade-off:

•Increasing superscalar complexity can hurt cycle time

•Microarchitect must balance instructions per cycle (IPC) and cycle time (CT) to achieve best overall performance.

Where complexity (potentially) impacts cycle time:

Superscalar evolution

Historical perspective—

1 integer + 1 floating-point

Any 2 instructions

Any 4 instructions

Any n instructions? How far?

Register for ECE 721, “Advanced Microarchitecture,” next fall and find out!

You’ll learn about superscalar and many other execution paradigms that may replace it in the future.

© 2005 Edward F. GehringerECE 463/521 Lecture Notes, Spring 20051

Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU

Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)