CPU design
CISC and RISC
Addressing modes
CPU processing methods
timing issues and pipelining
scalar and superscalar processor organization
Microprogramming
Reading: Englander, chap 10
CISC and RISC
CISC – Complex Instruction Set Computer
Became popular in the 1970s and 1980s
The goal – combine several instructions into one instruction to simplify the code generated by high-level languages
IBM S/390
Intel x86 families of CISC
RISC Reduced Instruction Set Computer
Sun SPARC
IBM RT
PowerPC
StrongARM
Both CISC And RISC Architectures are consistent with the von Neumann computer characteristics
CISC Complex Instruction Set Computer
Reasons for complexity?
Simplify high level language coding
Belief that the semantic gap should be shortened
The gap between machine-level instructions and high-level language statements
Examples
- VAX Sort instruction
- IBM 360 MVC instruction (move character)
Resulting in:
- A large number of complex instructions (200-300)
- More complex hardware
- Increased power requirements
- Slower execution
- Inefficient use of memory
–
CISC
Hopkins 1986: IBM/370
10% of instructions account for 71% executions.
Programmers avoid complex instructions
85% of instructions in 5 languages consisted of:
- assignment calls
- IF statements
- procedure calls
Implied that an increase in CPU performance could be accomplished by optimizing the LOAD, STORE, and BRANCH instructions
CISC
Procedure and Function Calls create significant bottlenecks
Need to pass arguments from one procedure to another
Need to store the general-registers’ values so the called subroutine can use them and then restore the original values when the call is returned
Studies revealed that such calls comprised a significant portion of overall program execution time due to the modular approach of well-designed programs.
Confirmed that most complex instruction and addressing modes largely unused by compilers, sometimes not even implemented
Used by assembly programmers
But most programmers used a high-level language
The needed additional hardware complexity for carrying out CISC instructions slowed down the execution of other frequently used CISC instructions.
Bulk of computer programs are very simple at the instruction level
Little payoff in making complex instructions
RISCReduced Instruction Set Computer
Make the common case go fast; by making simple instructions fast, most programs will go fast
- Load/Store architecture
- Only way to communicate with memory is via Load/Store from register file. E.g., an ADD can’t have an operand be a memory address
Simplifies communications and pipelining (coming up)
Means a lot of registers
Tradeoff: simpler CPU means there is space to put more registers on the chip
Limited, simple instruction set (50-200)
Register orientated instructions with limited memory access
- Fixed length, fixed format instruction
- SPARC has 5 instruction formats
- DEC VAX has dozens
- Limited address modes
- Large bank of registers
- Circular register buffer
How RISC Differs From CISC
A limited, simple instruction set
Goal: create instructions that can (1) execute quickly (2) at high clock speeds (3) using a hard-wired pipelined implementation
Although no limit on the number of RISC instructions in a RISC instruction set, there tends to be less in RISC compared with CISC
Register-oriented instructions with very limited memory access
Most RISC instructions operate only with registers
Only a few basic LOAD and STORE instructions access data in memory
This saves processing time.
How RISC Differs From CISC
Fixed-length, fixed-format instruction word
If instructions have the same size and same format, then they can be fetched and decoded independently during the same clock interval.
Therefore, the RISC architecture encourages a very smallnumber of different instruction formats
Eg: 5 SPARC-chip formats (RISC) (see p. 187, Figure 7.15) versus dozens of Digital VAX-chip formats (CISC)
Limited addressing modes
To speed up instruction execution, many RISC CPUs use only one addressing mode for addressing memory
Usually direct or register indirect addressing with an offset
Largebank of registers
This avoidsstoring variables and intermediate results in memory during program execution
Consequently, do not use many LOAD and STORE instructions in a program.
Circular Register Buffer
Problem
When executing subroutine calls and returns, register values need to be stored and then reloaded
When switching between programs in a multitasking system (contextswitching), register values also need to be stored and then restored
This can create a significant execution-speed overhead.
A RISC solution Circular Register Buffer
Common in many RISC designs
A CPU’s registers are arranged in a circularshape
These registers are typically grouped into pie-shaped components consisting of 8 registers each
An additional 8 registers (“global registers”) are arranged in a block and used for program global variables.
The circular register buffer at any time can be divided into a “window”
A window consists of 3 adjacent pie-shaped components (24 registers)
A “current window pointer” indicates where a window begins
Given a window (24 registers) and the global registers (8 registers), the program thinks that the CPU only has 32 registers.
How windows work
1st pie-shaped component (8 registers)
Used to store incomingparameters from the calling subroutine
Can also be used to return values to the calling subroutine
2nd pie-shaped component (8 registers)
Stores local variables
Used for temporary storage
3rd pie-shaped component (8 registers)
Store arguments to pass to another subroutine that it will call as well as the returnvalues from the called subroutine.
When a subroutine is called:
The current window pointer is shifted 16 registers to the right
Thus, the 3rd window component (which has any arguments that it wants to pass) becomes the 1st window component of the called subroutine
When a subroutine returns values
The return values are stored in the called subroutine’s 1st window component
The current window pointer is then shifted 16 bits to the left
The 1st pie-shaped component of the called subroutine (which has the return values) becomes the 3rd component again for the calling subroutine
The values in these registers can now be used by the calling subroutine
Advantage of this approach
Saves execution time by eliminating the need to copy, store, and restore subroutines and data due to subroutine calls and returns.
Circular Register Buffer
Addressing Modes
The LMC only provides a single method of addressing memory as well as using only one register (the calculator = general purpose)
The LMC addressing method used is called DirectAbsolute Addressing
Direct Addressing
The data to be accessed is notpart of the instruction itself
Rather the data is reached directly from the address in the instruction
absolute addressing
The address given in the instruction is the actual memory location being addressed
Why Have More Than 1 Addressing Mode?
Allows a greater range of memory addresses to be accessed by an instruction while using only a reasonable number of bits for the instruction address field
Easier to write certain types of programs (loops that use an index to address different elements in a table or array)
Permits greater flexibility, speed, and efficiency in manipulating data.
Register Addressing
The instruction contains the address of a register(s) – not a memory location
- Instructions execute faster
- Do not need to access memory
- Some instruction steps can be carried out simultaneously
- Eg: MOVE instruction
- Contents of source register can be moved directly to the destination register
- PC can be incremented parallel with earlier steps
RISC machines have an instruction set almost completely made up using register operations
- Consequently, RISC machines are designed with a larger number of registers than typically CISC computers.
Relative Addressing
To allow the addressing of large amounts of memory while maintaining a reasonably-sized instruction address field
Permits moving an entire program to a different memory location without changing any of its memory-referencing instructions since using an offset
- Know as relocatablility, which is important to the efficient use of the computer because can move anywhere convenient
BaseOffset Addressing Mode
- An address is determined by using a starting address in addition to an offset/displacement from the starting point
- In other words, the instruction address field does not hold the absolute/exact address to be accessed
Addressing Modes
Direct addressing separates the data and the instruction:
data changes without affecting the instruction
action on data by other instructions
Immediate addressing contains the data in the instruction:
faster
inflexible
Indirect addressing
Separates the address of the data from the instruction
- Useful when data’ address varies throughout the execution of a program
- Compare the use of pointers
- Instruction holds the (constant) address of the address of the data.
- Used, for example, in subscripting tables. (p294)
Addressing Modes
Register Indirect Addressing
Used in Intel x86, Compaq DEC VAX, Motorola 68x00
Similar to indirect addressing but the location holding the data address is a general purpose register.
Fast and efficient
Addressing Modes
Indexed Addressing
similar to Register Indirect Addressing
uses a general purpose or index register to increment the address held in the instruction.
CPU processing methods
Since computers basically execute instructions, it is useful to have CPUs execute the most instructions possible in a given amount of time
= increase Performance
The built-in clock runs continuously when the computer is on
Determinant of computer speed:
- Clock frequency
- Number of steps required by each instruction
Conceptually, each clock pulse controls one step in the instruction sequence
- Sometimes, each instruction step is further broken down into several smaller microcycle steps
- In this case, several clock pulses are needed for one instruction step.
Pipelines
Definition
The technique of overlapping instructions so that more than one instruction step/phase is being executed at a time
One of the major advantages in modern computing design
Has resulted in large increases in program execution speed
NOTE:
With pipelining, it still takes the same amount of time to complete an instruction as without pipelining
BUT, the overall AVERAGE number of instructions performed in a given time period is increased
Similar to an automobile assembly line.
The RISC approach lends itself well to a technique that can greatly improve processor performance called pipelining
more difficult with CISC instructions
The clock pulse as master control for the CPU
In the simplest case, one pulse=one instruction step
Scalar And Superscalar CPU Organization
Concurrent processing of instructions using “pipelining”
Scalar processing is where one instruction at a time is being executed to completion - even with pipelining.
Superscalar processing is where more than one instruction can be completed with a single clock cycle - by support of more than one “execution unit”
Scalar Processing
Various phases of the fetch-execute cycle are separated into different components (fetch, decode, execute, write-back/store)
Instructions are then pipelined so that different phases of multiple instructions are processed simultaneously
But only 1 instruction at a time is actually being executed to completion (I.e., is completed at a time)
Limitations with pipelining (and hence scalar processing)
Different instructions have different number of steps (I.e., are differentsizes)
Conditionalbranch instructions may change the sequence of instructions to be executed
- Can be a protracted delay while new instructions are reloaded into the pipeline.
Superscalar processing
Definition
- The ability to process morethan1 instruction each clock cycle
With a perfect pipeline (as in scalar processing), it might be possible to achieve an average processing rate of one instruction each clock cycle
Superscalar processing offers the possibility to achieve an average processing rate of morethanone instruction per clock cycle.
Instruction Prefetch
Simple version of Pipelining – treating the instruction cycle like an assembly line
Fetch accessing main memory
Execution usually does not access main memory
Can fetch next instruction during execution of current instruction
Called instruction prefetch
Improved Performance
But not doubled:
Fetch usually shorter than execution
Prefetch more than one instruction?
Any jump or branch means that prefetched instructions are not the required instructions
Add more stages to improve performance
But more stages can also hurt performance…
Problems in superscalar processing:
Instructions in the wrong order
Data dependency
Deliberately out of order? “Search ahead”
Intel P6 search ahead 20-30 instructions.
Changes in program flow due to branch instructions
conditional branch instructions can create delays or even incorrect results
speculative execution using additional registers
Conflicts for resources
use speculative execution registers where instructions compete for resources.
Modern CPU block diagram
PowerPC
Pentium II and III
IBM S/390 processors
Instruction unit maintains fetch pipeline
Completion/Retire unit for speculative instructions.
Summary
RISC vs CISC
CISC complexity/RISC simplicity
Reasons for dominance of CISC in 2000?
Where is RISC succeeding?
Modes of Addressing
variety of simple variations on a theme
Scalar, superscalar and pipelining
CISC and RISC
Addressing Modes
Pipelines