Comp326 Review

by: Mamta Patel May/2003

·  3 driving forces behind architecture innovations:

o  technology, applications, programming languages/paradigms

von Neumann model /

Dataflow model

Non-deterministic / Not non-deterministic (is deterministic)
Has side-effects – due to multiple assignments / No side-effects (side-effect-free) – single assignment
Inherently sequential / Explicitly parallel/concurrent (concurrency is explicit)
Imperative languages / Functional languages
Separation btwn data and control – control flow / Only data space (no control space) – no control flow
Not general-purpose enough
A single actor thread is too much to manage at run-time – thus, impractical as an execution architecture (synchronization overhead)

·  Moore’s Law: #devices (transistors) on a chip doubles every 2 years

·  Amdahl’s Law: speedup = 1/[(1-fracenh) + fracenh/speedupenh] = exec timeold/exec timenew

o  fracenh = the portion that is enhanceable (always <= 1)

o  speedupenh = the gain/speedup factor of the enhanceable portion (always > 1)

o  the synchronization overhead is ignored by Amdahl’s Law

·  latency and throughput are not inversely related

o  if no concurrency, then the 2 are inversely related

o  otherwise, the 2 are independent parameters

·  CPU time = CPI*n/clock rate ; n = #instructions in program

·  MIPS = clock rate/CPI*106 = instruction count/exec time*106

·  CPI is instruction-dependent, program-dependent, machine-dependent, benchmark-dependent

·  problems with MIPS:

o  it is instruction-set dependent

o  it varies with diff programs on same computer

o  it can be inversely proportional to actual performance

·  3 main reasons for emergence of GPR (general-purpose register) architectures:

o  registers are faster than memory

o  registers are more efficient for a compiler to use than other forms of internal storage

o  registers can be used to hold variables (reduced memory traffic, and hence latency due to memory accesses)

Instruction-Set Architectures

Stack / Register-Register / Register-Memory

Advantages

/ -  simple model
-  good code density
-  generates short instructions
-  simple address format / -  simple, fixed-length instructions
-  simple code generation model (separation of concerns)
-  instructions take similar #clocks to execute
-  tolerates high degrees of latency
-  supports higher degree of concurrency
-  makes it possible to do hardware optimizations / -  data can be accessed w/o separate load instr first
-  instr format easy to encode
-  good code density
Disadvantages / -  inherently sequential data structure
-  programming may be complex (lots of stack overhead)
-  stack bottleneck (organization is too limiting)
-  lack of random access (memory can’t be accessed randomly) / -  higher instruction count than models with memory references in instructions
-  more instructions & lower instruction density => larger programs / -  CPI varies depending on operand location
-  operands not equivalent since source operand in a binary operation is destroyed
-  encoding a register number & a memory address in each instr may restrict #registers
Addressing Modes
Advantages / Disadvantages
-  can reduce instruction count / -  complicates CPU design
-  incurs runtime overhead
-  some are rarely used by compilers

·  GCD test: if loop-carried dependence exists, then gcd(a,c) | (d – b)

o  ie. if gcd(a,c) does not divide (d – b), then loop-carried dependence does not exist

Techniques to Remove Dependences and Increase Concurrency

·  rescheduling – shuffle around the instructions, put in delay slots if possible

·  loop unrolling – don’t forget to rename registers when you unroll (rename AS NEEDED)

·  software pipelining – make pipe segments, reverse order and rename IF NECESSARY

o  rename registers to get rid of WAR/WAW dependencies

o  unroll the loop a couple of times

o  select instructions for the pipe segments

o  reverse the order of the instructions and make adjustments to deplacements in the LD (or SD) instructions

·  dynamic scheduling – can have out of order completion of instructions

o  Scoreboard

o  Tomasulo’s Algorithm

Scoreboard /

Tomasulo’s Algorithm

-  centralized
-  stages:
o  issue
o  read operands
o  execute
o  write-back
-  table format:
o  FU, Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk / -  distributed
-  stages:
o  issue
o  execute
o  write-back
-  table format:
o  FU, Busy, Op, Vj, Vk, Qj, Qk, A
-  reservation stations (provide register renaming)
-  common data bus (which has serialized access)
-  delays issue of WAW hazards and structural hazards
-  delays RAW until resolved
-  delays write-back of WAR hazards
-  limited by:
o  amt of parallelism available among instructions
o  #scoreboard entries
o  # and types of FUs
o  presence of WAR and WAW hazards / -  handles WAR, WAW by using register renaming
-  delays issue of structural hazards
-  delays RAW hazards until resolved
Vectors

·  én/MVLù*(Tloop + Tstartup) + n*Tchime

o  n = actual vector length; MVL = max vector length

o  Tloop = time to execute scalar code in loop; Tstartup = flush time for all convoys

o  Tchime = #chimes/convoys ( = 1 when we use chaining)

·  don’t forget to consider the chaining overhead in your calculations

·  WAR and WAW hazards (false dependencies) drastically affect vectorization since it serializes the processing of elements

·  stripmining = break down vector of length n into subvectors of size <= 64 (MVL)

·  vector stride = “distance” between two successive vector elements

·  #memory servers actually activated = n/gcd(n, vector stride)

o  n = n-way interleaved memory

o  vector stride affects the vector access of load and store operations

Cache

·  miss penalty affected by:

o  main memory bandwidth/concurrency

o  write policy

o  cache line size

·  hit ratio affected by:

o  cache management policy

o  cache size

o  program behaviour (temporal/spatial locality)

·  cache size affected by:

o  technology advancement

o  program behaviour (temporal/spatial locality)

o  hit rate

·  4 questions in Memory Hierarchy:

o  placement (mapping)

§  direct – block maps to only 1 line

§  fully-associative – block maps to any line

§  set-associative – block maps to lines in selected set

o  addressing

§  direct – check 1 tag

§  fully-associative – check all tags (expensive)

§  set-associative – check tags of selected set

o  replacement policy – random, LRU, FIFO

o  write policy

§  write through – easier to implement, cache is always clean (data coherency), high memory bandwidth

§  write back – uses less memory bandwidth, cache may have dirty blocks, more difficult to implement

·  types of misses:

o  compulsory (cold) – refer to inevitable misses that occur when the cache is empty

o  conflict – if a block maps to a cache line that is occupied by another block

§  decreases as the associativity mapping increases

o  capacity – misses due to cache size (cache is full and we need to replace a line with the requested block)

§  decreases as the size of the cache increases

·  how to reduce miss rate:

o  change parameters of cache (block size, cache size, degree of associativity)

o  use a victim cache (“backup” memory where you dump all blocks that have been “thrown” out of the cache recently)

o  prefetching techniques

o  programming techniques (improve spatial locality)

§  merging arrays, loop interchange, loop fusion, matrix multiplication

·  how to reduce the miss penalty

o  read through (bring the chunk that I need first; the rest can come in but I don’t wait for the rest, only for what I need)

o  sub-block replacement – bring in sub-block, not whole block

o  non-blocking caches

o  multi-level caches

·  CPU time = (CPU clock cycles + Memory stall cycles)*clock cycle time

·  Memory stall cycles = #misses*miss penalty

·  Memory stall cycles = IC*miss penalty*miss rate*(memory accesses/instruction)

·  Memory stall cycles = IC*(misses/instruction)*miss penalty

·  misses/instruction = miss rate*(memory accesses/instruction)

·  Average memory access time = hit time + miss rate*miss penalty

·  CPU time = IC*(CPI + (memory accesses/instruction)*miss penalty*miss rate)*CCT

·  spatial locality: the next references will likely be to addresses near the current one

·  temporal locality: a recently referenced item is likely to be referenced again in the near future

Shared Memory Multiprocessing (SMP)

·  UMA = Uniform Memory Access Machine

o  centralized memory

o  address-independent

o  time to access memory is equal among the processors

·  NUMA = Non-Uniform Memory Access Machine

o  distributed memory

o  address-dependent

o  time to access memory depends on the distance between the requesting processor and the data location

·  COMA = Cache-Only Memory Access Machine

·  problems with SMP:

o  memory latency

§  sharing causes memory latency that doesn’t scale well with multiprocessing (kills concurrency)

§  sharing is expensive, but you can’t have concurrency without sharing

o  synchronization overhead

§  synchronization is required to ensure atomicity semantics in concurrent systems

§  locality obtained through replication incurs synchronization overhead

Memory Consistency Models
Sequential Consistency / Weak Ordering / Release Consistency
Description / -  all memory operations are performed sequentially
-  requires serialization and delay of memory operations in a thread
-  the results are as if all memory operations are performed in some sequential order consistent with the individual program orders / -  classify memory accesses as ordinary data access (R, W) and synchronization access (S)
-  allows concurrent R/W as long as data dependencies do not exist btwn them / -  refinement of WO model
-  classify synchronization accesses as acquire (SA) and release (SR)
-  until an acquire is performed, all later memory operations are stalled
-  until past memory operations are performed, release operation cannot be performed
Rules / -  do all memory operations in order, and delay all future memory operations until the previous ones are done / S ® R; S ® W; S ® S
R ® S; W ® S / SA ® R; SA ® W; SA ® SA/R
SR ® SA/R
R ® SR; W ® SR
Problems / -  kills concurrency because of the serialization of memory operations / -  ordering is still too strict
-  not all programs will run correctly under WO / -  program may still give non-SC results (ie. not data-race-free)

·  a data-race exists when 2 conflicting memory operations exist between 2 threads (ie a conflict-pair occurs at runtime)

·  a properly-labelled program is one that cannot have a data-race (ie. we get only SC results on an RC platform)

Cache Coherence

·  locality is improved with cache local to a processor at the expense of replication mgmt (overhead)

Snooping Bus Protocol / Directory-Based Protocol
Features / -  common bus that all caches are connected to
-  each cache block maintains status info
-  write is broadcast on the bus and automatically used to invalidate/update local copies elsewhere / -  distributed protocol that is used to route messages and maintain cache coherence
-  central directory holds all status info about cache blocks
-  all accesses must go through the directory
Advantages / -  centralized protocol
-  no directory overhead / -  solution is more scalable
-  enhances concurrency via concurrent messages and processing
Disadvantages / -  because writes are serialized by the common bus, it destroys concurrency
-  poor scalability due to serialization (only 1 event can occur on the bus at a time) / -  directory overhead (sync cost)
-  message routing

·  write invalidate protocol:

o  processor has exclusive access to item before it writes it; all copies of the item in other caches are invalidated (invalidate all old copies of the data)

·  write update/broadcast protocol:

o  all copies of the item in other caches are updated with new value being written (update all old copies of the data)

·  write-invalidate is preferred over write-update because it uses less bus bandwidth

Snooping Bus – Processor Side
Request / Original state of cache block /

New state of cache block

/ Actions
Read miss / Invalid / Shared / -  place read miss on bus
Shared / Shared / -  place read miss on bus*
Exclusive / Shared / -  write old block to memory
-  place read miss on bus*
Read hit / Shared / Shared / -  read data from cache
Exclusive / Exclusive / -  read data from cache
Write miss / Invalid / Exclusive / -  place write miss on bus
Shared / Exclusive / -  place write miss on bus*
Exclusive / Exclusive / -  write old block to memory
-  place write miss on bus*
Write hit / Shared / Exclusive / -  place write miss on bus
Exclusive / Exclusive / -  write data in cache

* may cause address conflict (eg. in direct-mapping, we will always have a replacement of

the cache block– not always true for other addressing schemes)

·  each cache block has 3 states:

o  invalid

o  shared (read-only)

o  exclusive (read-write)

·  observe that for processor side, we deal with 4 types of requests (read miss, read hit, write miss, write hit)

·  for bus side, we deal with 2 types of requests (read miss, write miss) – write-back block is handled internally

o  the request on the bus is from another processor (we only do something if the address of the request matches the address of our own cache line)

Snooping Bus – Bus Side
Request / Original state of cache block /

New state of cache block

/ Actions
Read miss / Invalid / Invalid / -  no action in our cache
Shared / Shared / -  no action in our cache
Exclusive / Shared / -  place cache block on bus (share copy with other processor)
-  change state of our block to Shared
Write miss / Invalid / Invalid / -  no action in our cache
Shared / Invalid / -  invalidate our block since another processor wants to write to it
Exclusive / Invalid / -  write-back our block
-  invalidate our block since another processor wants to write to it

·  in directory-based protocol, messages are sent btwn local cache (local processor), home directory, and remote cache (remote processor)

o  local cache: processor cache generating the request

o  home directory: directory containing the status info for each cache block

o  remote cache: processor cache containing cache copy of the requested block