Timing Issues and Pipelining

CPU design

CISC and RISC

Addressing modes

CPU processing methods

timing issues and pipelining

scalar and superscalar processor organization

Microprogramming

Reading: Englander, chap 10

CISC and RISC

CISC – Complex Instruction Set Computer

Became popular in the 1970s and 1980s

The goal – combine several instructions into one instruction to simplify the code generated by high-level languages

IBM S/390

Intel x86 families of CISC

RISC Reduced Instruction Set Computer

Sun SPARC

IBM RT

PowerPC

StrongARM

Both CISC And RISC Architectures are consistent with the von Neumann computer characteristics

CISC Complex Instruction Set Computer

Reasons for complexity?

Simplify high level language coding

Belief that the semantic gap should be shortened

The gap between machine-level instructions and high-level language statements

Examples

VAX Sort instruction
IBM 360 MVC instruction (move character)

Resulting in:

A large number of complex instructions (200-300)
More complex hardware
Increased power requirements
Slower execution
Inefficient use of memory

–

CISC

Hopkins 1986: IBM/370

10% of instructions account for 71% executions.

Programmers avoid complex instructions

85% of instructions in 5 languages consisted of:

assignment calls
IF statements
procedure calls

Implied that an increase in CPU performance could be accomplished by optimizing the LOAD, STORE, and BRANCH instructions

CISC

Procedure and Function Calls create significant bottlenecks

Need to pass arguments from one procedure to another

Need to store the general-registers’ values so the called subroutine can use them and then restore the original values when the call is returned

Studies revealed that such calls comprised a significant portion of overall program execution time due to the modular approach of well-designed programs.

Confirmed that most complex instruction and addressing modes largely unused by compilers, sometimes not even implemented

Used by assembly programmers

But most programmers used a high-level language

The needed additional hardware complexity for carrying out CISC instructions slowed down the execution of other frequently used CISC instructions.

Bulk of computer programs are very simple at the instruction level

Little payoff in making complex instructions

RISCReduced Instruction Set Computer

Make the common case go fast; by making simple instructions fast, most programs will go fast

Load/Store architecture
Only way to communicate with memory is via Load/Store from register file. E.g., an ADD can’t have an operand be a memory address

Simplifies communications and pipelining (coming up)

Means a lot of registers

Tradeoff: simpler CPU means there is space to put more registers on the chip

Limited, simple instruction set (50-200)

Register orientated instructions with limited memory access

Fixed length, fixed format instruction
SPARC has 5 instruction formats
DEC VAX has dozens
Limited address modes
Large bank of registers
Circular register buffer

How RISC Differs From CISC

A limited, simple instruction set

Goal: create instructions that can (1) execute quickly (2) at high clock speeds (3) using a hard-wired pipelined implementation

Although no limit on the number of RISC instructions in a RISC instruction set, there tends to be less in RISC compared with CISC

Register-oriented instructions with very limited memory access

Most RISC instructions operate only with registers

Only a few basic LOAD and STORE instructions access data in memory

This saves processing time.

How RISC Differs From CISC

Fixed-length, fixed-format instruction word

If instructions have the same size and same format, then they can be fetched and decoded independently during the same clock interval.

Therefore, the RISC architecture encourages a very smallnumber of different instruction formats

Eg: 5 SPARC-chip formats (RISC) (see p. 187, Figure 7.15) versus dozens of Digital VAX-chip formats (CISC)

Limited addressing modes

To speed up instruction execution, many RISC CPUs use only one addressing mode for addressing memory

Usually direct or register indirect addressing with an offset

Largebank of registers

This avoidsstoring variables and intermediate results in memory during program execution

Consequently, do not use many LOAD and STORE instructions in a program.

Circular Register Buffer

Problem

When executing subroutine calls and returns, register values need to be stored and then reloaded

When switching between programs in a multitasking system (contextswitching), register values also need to be stored and then restored

This can create a significant execution-speed overhead.

A RISC solution  Circular Register Buffer

Common in many RISC designs

A CPU’s registers are arranged in a circularshape

These registers are typically grouped into pie-shaped components consisting of 8 registers each

An additional 8 registers (“global registers”) are arranged in a block and used for program global variables.

The circular register buffer at any time can be divided into a “window”

A window consists of 3 adjacent pie-shaped components (24 registers)

A “current window pointer” indicates where a window begins

Given a window (24 registers) and the global registers (8 registers), the program thinks that the CPU only has 32 registers.

How windows work

1st pie-shaped component (8 registers)

Used to store incomingparameters from the calling subroutine

Can also be used to return values to the calling subroutine

2nd pie-shaped component (8 registers)

Stores local variables

Used for temporary storage

3rd pie-shaped component (8 registers)

Store arguments to pass to another subroutine that it will call as well as the returnvalues from the called subroutine.

When a subroutine is called:

The current window pointer is shifted 16 registers to the right

Thus, the 3rd window component (which has any arguments that it wants to pass) becomes the 1st window component of the called subroutine

When a subroutine returns values

The return values are stored in the called subroutine’s 1st window component

The current window pointer is then shifted 16 bits to the left

The 1st pie-shaped component of the called subroutine (which has the return values) becomes the 3rd component again for the calling subroutine

The values in these registers can now be used by the calling subroutine

Advantage of this approach

Saves execution time by eliminating the need to copy, store, and restore subroutines and data due to subroutine calls and returns.

Circular Register Buffer

Addressing Modes

The LMC only provides a single method of addressing memory as well as using only one register (the calculator = general purpose)

The LMC addressing method used is called DirectAbsolute Addressing

Direct Addressing

The data to be accessed is notpart of the instruction itself

Rather the data is reached directly from the address in the instruction

absolute addressing

The address given in the instruction is the actual memory location being addressed

Why Have More Than 1 Addressing Mode?

Allows a greater range of memory addresses to be accessed by an instruction while using only a reasonable number of bits for the instruction address field

Easier to write certain types of programs (loops that use an index to address different elements in a table or array)

Permits greater flexibility, speed, and efficiency in manipulating data.

The instruction contains the address of a register(s) – not a memory location

Instructions execute faster
Do not need to access memory
Some instruction steps can be carried out simultaneously
Eg: MOVE instruction
Contents of source register can be moved directly to the destination register
PC can be incremented parallel with earlier steps

RISC machines have an instruction set almost completely made up using register operations

Consequently, RISC machines are designed with a larger number of registers than typically CISC computers.

Relative Addressing

To allow the addressing of large amounts of memory while maintaining a reasonably-sized instruction address field

Permits moving an entire program to a different memory location without changing any of its memory-referencing instructions since using an offset

Know as relocatablility, which is important to the efficient use of the computer because can move anywhere convenient

BaseOffset Addressing Mode

An address is determined by using a starting address in addition to an offset/displacement from the starting point
In other words, the instruction address field does not hold the absolute/exact address to be accessed

Addressing Modes

Direct addressing separates the data and the instruction:

data changes without affecting the instruction

action on data by other instructions

Immediate addressing contains the data in the instruction:

faster

inflexible

Indirect addressing

Separates the address of the data from the instruction

Useful when data’ address varies throughout the execution of a program
Compare the use of pointers
Instruction holds the (constant) address of the address of the data.
Used, for example, in subscripting tables. (p294)

Addressing Modes

Register Indirect Addressing

Used in Intel x86, Compaq DEC VAX, Motorola 68x00

Similar to indirect addressing but the location holding the data address is a general purpose register.

Fast and efficient




Addressing Modes

Indexed Addressing

similar to Register Indirect Addressing

uses a general purpose or index register to increment the address held in the instruction.



CPU processing methods

Since computers basically execute instructions, it is useful to have CPUs execute the most instructions possible in a given amount of time
= increase Performance

The built-in clock runs continuously when the computer is on

Determinant of computer speed:

Clock frequency
Number of steps required by each instruction

Conceptually, each clock pulse controls one step in the instruction sequence

Sometimes, each instruction step is further broken down into several smaller microcycle steps
In this case, several clock pulses are needed for one instruction step.

Pipelines

Definition

The technique of overlapping instructions so that more than one instruction step/phase is being executed at a time

One of the major advantages in modern computing design

Has resulted in large increases in program execution speed

NOTE:

With pipelining, it still takes the same amount of time to complete an instruction as without pipelining

BUT, the overall AVERAGE number of instructions performed in a given time period is increased

Similar to an automobile assembly line.

The RISC approach lends itself well to a technique that can greatly improve processor performance called pipelining

more difficult with CISC instructions

The clock pulse as master control for the CPU

In the simplest case, one pulse=one instruction step

Scalar And Superscalar CPU Organization

Concurrent processing of instructions using “pipelining”

Scalar processing is where one instruction at a time is being executed to completion - even with pipelining.

Superscalar processing is where more than one instruction can be completed with a single clock cycle - by support of more than one “execution unit”

Scalar Processing

Various phases of the fetch-execute cycle are separated into different components (fetch, decode, execute, write-back/store)

Instructions are then pipelined so that different phases of multiple instructions are processed simultaneously

But only 1 instruction at a time is actually being executed to completion (I.e., is completed at a time)

Limitations with pipelining (and hence scalar processing)

Different instructions have different number of steps (I.e., are differentsizes)

Conditionalbranch instructions may change the sequence of instructions to be executed

Can be a protracted delay while new instructions are reloaded into the pipeline.

Superscalar processing

Definition

The ability to process morethan1 instruction each clock cycle

With a perfect pipeline (as in scalar processing), it might be possible to achieve an average processing rate of one instruction each clock cycle

Superscalar processing offers the possibility to achieve an average processing rate of morethanone instruction per clock cycle.

Instruction Prefetch

Simple version of Pipelining – treating the instruction cycle like an assembly line

Fetch accessing main memory

Execution usually does not access main memory

Can fetch next instruction during execution of current instruction

Called instruction prefetch

Improved Performance

But not doubled:

Fetch usually shorter than execution

Prefetch more than one instruction?

Any jump or branch means that prefetched instructions are not the required instructions

Add more stages to improve performance

But more stages can also hurt performance…
Problems in superscalar processing:

Instructions in the wrong order

Data dependency

Deliberately out of order? “Search ahead”

Intel P6 search ahead 20-30 instructions.

Changes in program flow due to branch instructions

conditional branch instructions can create delays or even incorrect results

speculative execution using additional registers

Conflicts for resources

use speculative execution registers where instructions compete for resources.

Modern CPU block diagram

PowerPC

Pentium II and III

IBM S/390 processors

Instruction unit maintains fetch pipeline

Completion/Retire unit for speculative instructions.

Summary

RISC vs CISC

CISC complexity/RISC simplicity

Reasons for dominance of CISC in 2000?

Where is RISC succeeding?

Modes of Addressing

variety of simple variations on a theme

Scalar, superscalar and pipelining

CISC and RISC

Addressing Modes

Pipelines