Von Neumann Model

Comp326 Review

by: Mamta Patel May/2003

· 3 driving forces behind architecture innovations:

o technology, applications, programming languages/paradigms

von Neumann model /

Dataflow model

Non-deterministic / Not non-deterministic (is deterministic)
Has side-effects – due to multiple assignments / No side-effects (side-effect-free) – single assignment
Inherently sequential / Explicitly parallel/concurrent (concurrency is explicit)
Imperative languages / Functional languages
Separation btwn data and control – control flow / Only data space (no control space) – no control flow
Not general-purpose enough
A single actor thread is too much to manage at run-time – thus, impractical as an execution architecture (synchronization overhead)

· Moore’s Law: #devices (transistors) on a chip doubles every 2 years

· Amdahl’s Law: speedup = 1/[(1-fracenh) + fracenh/speedupenh] = exec timeold/exec timenew

o fracenh = the portion that is enhanceable (always <= 1)

o speedupenh = the gain/speedup factor of the enhanceable portion (always > 1)

o the synchronization overhead is ignored by Amdahl’s Law

· latency and throughput are not inversely related

o if no concurrency, then the 2 are inversely related

o otherwise, the 2 are independent parameters

· CPU time = CPI*n/clock rate ; n = #instructions in program

· MIPS = clock rate/CPI*106 = instruction count/exec time*106

· CPI is instruction-dependent, program-dependent, machine-dependent, benchmark-dependent

· problems with MIPS:

o it is instruction-set dependent

o it varies with diff programs on same computer

o it can be inversely proportional to actual performance

· 3 main reasons for emergence of GPR (general-purpose register) architectures:

o registers are faster than memory

o registers are more efficient for a compiler to use than other forms of internal storage

o registers can be used to hold variables (reduced memory traffic, and hence latency due to memory accesses)

Instruction-Set Architectures

Stack / Register-Register / Register-Memory

Advantages

/ - simple model
- good code density
- generates short instructions
- simple address format / - simple, fixed-length instructions
- simple code generation model (separation of concerns)
- instructions take similar #clocks to execute
- tolerates high degrees of latency
- supports higher degree of concurrency
- makes it possible to do hardware optimizations / - data can be accessed w/o separate load instr first
- instr format easy to encode
- good code density
Disadvantages / - inherently sequential data structure
- programming may be complex (lots of stack overhead)
- stack bottleneck (organization is too limiting)
- lack of random access (memory can’t be accessed randomly) / - higher instruction count than models with memory references in instructions
- more instructions & lower instruction density => larger programs / - CPI varies depending on operand location
- operands not equivalent since source operand in a binary operation is destroyed
- encoding a register number & a memory address in each instr may restrict #registers
Addressing Modes
Advantages / Disadvantages
- can reduce instruction count / - complicates CPU design
- incurs runtime overhead
- some are rarely used by compilers

· GCD test: if loop-carried dependence exists, then gcd(a,c) | (d – b)

o ie. if gcd(a,c) does not divide (d – b), then loop-carried dependence does not exist

Techniques to Remove Dependences and Increase Concurrency

· rescheduling – shuffle around the instructions, put in delay slots if possible

· loop unrolling – don’t forget to rename registers when you unroll (rename AS NEEDED)

· software pipelining – make pipe segments, reverse order and rename IF NECESSARY

o rename registers to get rid of WAR/WAW dependencies

o unroll the loop a couple of times

o select instructions for the pipe segments

o reverse the order of the instructions and make adjustments to deplacements in the LD (or SD) instructions

· dynamic scheduling – can have out of order completion of instructions

o Scoreboard

o Tomasulo’s Algorithm

Scoreboard /

Tomasulo’s Algorithm

- centralized
- stages:
o issue
o read operands
o execute
o write-back
- table format:
o FU, Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk / - distributed
- stages:
o issue
o execute
o write-back
- table format:
o FU, Busy, Op, Vj, Vk, Qj, Qk, A
- reservation stations (provide register renaming)
- common data bus (which has serialized access)
- delays issue of WAW hazards and structural hazards
- delays RAW until resolved
- delays write-back of WAR hazards
- limited by:
o amt of parallelism available among instructions
o #scoreboard entries
o # and types of FUs
o presence of WAR and WAW hazards / - handles WAR, WAW by using register renaming
- delays issue of structural hazards
- delays RAW hazards until resolved

Vectors

· én/MVLù*(Tloop + Tstartup) + n*Tchime

o n = actual vector length; MVL = max vector length

o Tloop = time to execute scalar code in loop; Tstartup = flush time for all convoys

o Tchime = #chimes/convoys ( = 1 when we use chaining)

· don’t forget to consider the chaining overhead in your calculations

· WAR and WAW hazards (false dependencies) drastically affect vectorization since it serializes the processing of elements

· stripmining = break down vector of length n into subvectors of size <= 64 (MVL)

· vector stride = “distance” between two successive vector elements

· #memory servers actually activated = n/gcd(n, vector stride)

o n = n-way interleaved memory

o vector stride affects the vector access of load and store operations

Cache

· miss penalty affected by:

o main memory bandwidth/concurrency

o write policy

o cache line size

· hit ratio affected by:

o cache management policy

o cache size

o program behaviour (temporal/spatial locality)

· cache size affected by:

o technology advancement

o program behaviour (temporal/spatial locality)

o hit rate

· 4 questions in Memory Hierarchy:

o placement (mapping)

§ direct – block maps to only 1 line

§ fully-associative – block maps to any line

§ set-associative – block maps to lines in selected set

o addressing

§ direct – check 1 tag

§ fully-associative – check all tags (expensive)

§ set-associative – check tags of selected set

o replacement policy – random, LRU, FIFO

o write policy

§ write through – easier to implement, cache is always clean (data coherency), high memory bandwidth

§ write back – uses less memory bandwidth, cache may have dirty blocks, more difficult to implement

· types of misses:

o compulsory (cold) – refer to inevitable misses that occur when the cache is empty

o conflict – if a block maps to a cache line that is occupied by another block

§ decreases as the associativity mapping increases

o capacity – misses due to cache size (cache is full and we need to replace a line with the requested block)

§ decreases as the size of the cache increases

· how to reduce miss rate:

o change parameters of cache (block size, cache size, degree of associativity)

o use a victim cache (“backup” memory where you dump all blocks that have been “thrown” out of the cache recently)

o prefetching techniques

o programming techniques (improve spatial locality)

§ merging arrays, loop interchange, loop fusion, matrix multiplication

· how to reduce the miss penalty

o read through (bring the chunk that I need first; the rest can come in but I don’t wait for the rest, only for what I need)

o sub-block replacement – bring in sub-block, not whole block

o non-blocking caches

o multi-level caches

· CPU time = (CPU clock cycles + Memory stall cycles)*clock cycle time

· Memory stall cycles = #misses*miss penalty

· Memory stall cycles = IC*miss penalty*miss rate*(memory accesses/instruction)

· Memory stall cycles = IC*(misses/instruction)*miss penalty

· misses/instruction = miss rate*(memory accesses/instruction)

· Average memory access time = hit time + miss rate*miss penalty

· CPU time = IC*(CPI + (memory accesses/instruction)*miss penalty*miss rate)*CCT

· spatial locality: the next references will likely be to addresses near the current one

· temporal locality: a recently referenced item is likely to be referenced again in the near future

Shared Memory Multiprocessing (SMP)

· UMA = Uniform Memory Access Machine

o centralized memory

o address-independent

o time to access memory is equal among the processors

· NUMA = Non-Uniform Memory Access Machine

o distributed memory

o address-dependent

o time to access memory depends on the distance between the requesting processor and the data location

· COMA = Cache-Only Memory Access Machine

· problems with SMP:

o memory latency

§ sharing causes memory latency that doesn’t scale well with multiprocessing (kills concurrency)

§ sharing is expensive, but you can’t have concurrency without sharing

o synchronization overhead

§ synchronization is required to ensure atomicity semantics in concurrent systems

§ locality obtained through replication incurs synchronization overhead

Memory Consistency Models

Sequential Consistency / Weak Ordering / Release Consistency
Description / - all memory operations are performed sequentially
- requires serialization and delay of memory operations in a thread
- the results are as if all memory operations are performed in some sequential order consistent with the individual program orders / - classify memory accesses as ordinary data access (R, W) and synchronization access (S)
- allows concurrent R/W as long as data dependencies do not exist btwn them / - refinement of WO model
- classify synchronization accesses as acquire (SA) and release (SR)
- until an acquire is performed, all later memory operations are stalled
- until past memory operations are performed, release operation cannot be performed
Rules / - do all memory operations in order, and delay all future memory operations until the previous ones are done / S ® R; S ® W; S ® S
R ® S; W ® S / SA ® R; SA ® W; SA ® SA/R
SR ® SA/R
R ® SR; W ® SR
Problems / - kills concurrency because of the serialization of memory operations / - ordering is still too strict
- not all programs will run correctly under WO / - program may still give non-SC results (ie. not data-race-free)

· a data-race exists when 2 conflicting memory operations exist between 2 threads (ie a conflict-pair occurs at runtime)

· a properly-labelled program is one that cannot have a data-race (ie. we get only SC results on an RC platform)

Cache Coherence

· locality is improved with cache local to a processor at the expense of replication mgmt (overhead)

Snooping Bus Protocol / Directory-Based Protocol
Features / - common bus that all caches are connected to
- each cache block maintains status info
- write is broadcast on the bus and automatically used to invalidate/update local copies elsewhere / - distributed protocol that is used to route messages and maintain cache coherence
- central directory holds all status info about cache blocks
- all accesses must go through the directory
Advantages / - centralized protocol
- no directory overhead / - solution is more scalable
- enhances concurrency via concurrent messages and processing
Disadvantages / - because writes are serialized by the common bus, it destroys concurrency
- poor scalability due to serialization (only 1 event can occur on the bus at a time) / - directory overhead (sync cost)
- message routing

· write invalidate protocol:

o processor has exclusive access to item before it writes it; all copies of the item in other caches are invalidated (invalidate all old copies of the data)

· write update/broadcast protocol:

o all copies of the item in other caches are updated with new value being written (update all old copies of the data)

· write-invalidate is preferred over write-update because it uses less bus bandwidth

Snooping Bus – Processor Side
Request / Original state of cache block /

New state of cache block

/ Actions
Read miss / Invalid / Shared / - place read miss on bus
Shared / Shared / - place read miss on bus*
Exclusive / Shared / - write old block to memory
- place read miss on bus*
Read hit / Shared / Shared / - read data from cache
Exclusive / Exclusive / - read data from cache
Write miss / Invalid / Exclusive / - place write miss on bus
Shared / Exclusive / - place write miss on bus*
Exclusive / Exclusive / - write old block to memory
- place write miss on bus*
Write hit / Shared / Exclusive / - place write miss on bus
Exclusive / Exclusive / - write data in cache

* may cause address conflict (eg. in direct-mapping, we will always have a replacement of

the cache block– not always true for other addressing schemes)

· each cache block has 3 states:

o invalid

o shared (read-only)

o exclusive (read-write)

· observe that for processor side, we deal with 4 types of requests (read miss, read hit, write miss, write hit)

· for bus side, we deal with 2 types of requests (read miss, write miss) – write-back block is handled internally

o the request on the bus is from another processor (we only do something if the address of the request matches the address of our own cache line)

Snooping Bus – Bus Side
Request / Original state of cache block /

New state of cache block

/ Actions
Read miss / Invalid / Invalid / - no action in our cache
Shared / Shared / - no action in our cache
Exclusive / Shared / - place cache block on bus (share copy with other processor)
- change state of our block to Shared
Write miss / Invalid / Invalid / - no action in our cache
Shared / Invalid / - invalidate our block since another processor wants to write to it
Exclusive / Invalid / - write-back our block
- invalidate our block since another processor wants to write to it

· in directory-based protocol, messages are sent btwn local cache (local processor), home directory, and remote cache (remote processor)

o local cache: processor cache generating the request

o home directory: directory containing the status info for each cache block

o remote cache: processor cache containing cache copy of the requested block