1.  How does Global Code Scheduling work? (Trace Scheduling and Superblocks)

Loop unrolling and software pipelining:

These are useful if there are no conditionals within the loop as it will be relatively easy to find independent code.

Once there are conditionals within the loop (especially a tight loop) it will be difficult to find independent instructions to schedule

Very likely that whatever code is in the loop body will depend on the outcome of the conditional. E.g.:

for(i=0; i<n; i++)

{

a[i] = a[i] + b[i];

….

if(a[i] == 0)

{

b[i] = b[i] – 5;

….

}

else

{

X += b[i] + 7;

….

}

c[i] = …

}

Difficult to unroll this because future iterations depend on the outcome of the condition if(a[i] == 0)

Global code scheduling attempts to schedule code across conditional statements to allow for more ILP.


Complications of Global Scheduling

i)  Which part of the conditional is most frequently executed? To get maximum savings we will need to move instructions out of the part of the branch that is executed most frequently. For this discussion we will assume that the THEN portion is executed most of the time.

ii)  What is the cost of moving the statement to above the conditional branch? Obviously we will require empty ILP slots above the branch in order to have any benefit (e.g. if the CPU can issue 3 instructions per cycle but there is only 1 instruction above the branch that can be issued, then there are two empty ILP slots).

iii)  What kind of compensation code must be inserted into the ELSE portion of the branch? How much compensation code must be inserted? For e.g. if we moved the b[i] = b[i] – 5 above the THEN portion, then we must insert b[i] = b[i] + 5 into the ELSE portion to compensate.

iv)  How easy is it to schedule the compensation code? If the cost of the compensation code is high, then we must ensure that the ELSE portion is executed very infrequently compared to the THEN portion.

v)  Is it easier to move the c[i] portion up? What if the THEN or ELSE portions also depend on c[i]? Moving it up means that the correctness of the THEN and ELSE portions are affected.

Many other complications involved.

How do we do global code scheduling?

Trace Scheduling

Idea:

To find as many instructions as possible in the frequently visited portion of the branch (the THEN part in our case), to move these portions up and to compact them into as few wide instructions (for VLIW) or issue packets (for superscalar processors) as possible.

Step 1: Trace Selection

a.  Loop is unrolled (since loop conditions are usually taken) to form a sequence of instructions that are likely to be executed (this is called a “trace”)

A trace may have multiple exit points (corresponding to the ELSE portion of our code) and multiple entry points (corresponding to returns to the trace, e.g. for the next iteration).

If the trace is only allowed to have one entry point but multiple exit points, it is called a “superblock”. This restriction is placed to ease moving code.

E.g. if we unrolled the above code four times and assume that the THEN portion is always taken, our superblock will be:


a[i] = a[i] + b[i];

b[i] = b[i] – 5;

….

c[i] = ….

if(a[i] != 0)

goto EXIT;

a[i+1] = a[i+1] + b[i+1];

b[i+1] = b[i+1] – 5;

….

c[i+1] = ….

if(a[i+1] != 0)

goto EXIT;

a[i+2] = a[i+2] + b[i+2];

b[i+2] = b[i+2] – 5;

….

c[i+2] = ….

if(a[i+2] != 0)

goto EXIT;

a[i+3] = a[i+3] + b[i+3];

b[i+3] = b[i+3] – 5;

….

c[i+3] = ….

if(a[i+3] != 0)

goto EXIT;

….

goto SKIP;

EXIT:

….

X += b[i]+7;

….

SKIP:

-  More code is actually inserted at the end of the superblock to finish up the remaining iterations of the loop. To do this other code might be inserted to track which portion of the trace did we exit from.

-  This is not necessary for non superblock traces (with multiple entry points) since additional iterations can be achieved by jumping to the next part of the trace. But other problems are introduced when we attempt to move code across entry points. Simpler to just finish the remaining iterations at the exit point.

Step Two: Trace Compaction

This is basically code scheduling. Code from the unrolled loop are moved to empty ILP slots above their current position, and compensation code is inserted into other portions to compensate for these moves.

2.  What is a write miss?

A write-miss, like a read-miss, is a write to a memory location that had never been loaded into the cache (or was loaded and later invalidated).

Two ways to handle:

i)  No-write-allocate

Useful for write-through caches. We simply update the memory location without creating an entry in the cache.

Of course if we write to the same location again we will have another write-miss.

ii)  Write-allocate

More complicated. A new cache entry is created in the cache, and the new value is written to this entry (and to memory if we are doing write-through).

Note that multiple-word blocks will require a read from memory before the write, otherwise the other values in the block will be incorrect. E.g.

Write value “1” to location x+2 missed. Allocate cache block (assume 4-word blocks), read locations x, x+1, x+2, x+3, then modify location x+2 in cache (and memory if write-through).

3.  How to distinguish between RAW and WAR?

RAW – Read-after-write

The results of a previous write are read. E.g.:

I0: a[i] = b[i] + 3;

I1: c[i] = a[i] + 1;

There is a RAW between I0 and I1 based on a[i]. This is the only TRUE dependency and cannot be removed by renaming.

WAR – Write-after-read

A variable is written to, and this variable was read in a previous instruction. E.g.:

I0: a[i] = b[i]+3;

I1: b[i] = 7+c[i];

Here there is a WAR between I0 and I1 based on b[i]. This is hazardous because if I1 is executed and retired before I0, I0 will read the wrong value in b[i].

Can be resolved by renaming each new copy of b[i]:

I0: a[i] = b[i]+3;

I1: b1[i] = 7+c[i];

Now if I1 is executed and retired before I0, no problem.

WAW (not asked but added for completeness)

Two instructions write to the same variable. E.g.:

I0: a[i] = 3+b[i];

I1: a[i] = 7 – c[i];

This is hazardous because if I1 is executed and retired before I0, the final a[i] answer is wrong. Again resolve by renaming.

4.  What is the justification behind IA-64?

Main justification:

a.  Superscalar architectures attempt to discover and benefit from ILP using hardware:

i)  ILP search is confined to a narrow window of instructions in the “pre-fetch queue” or “issue buffer”. CPU cannot “see” ILP beyond this narrow window. Large windows are expensive and can be hard to fill.

ii)  Difficult to find ILP in the instructions. Many conditions need to be met (e.g. dependencies) to be able to issue the instructions in parallel.

iii)  Branches introduce a new complication: Not sure which portion to fetch instructions from. Further complicates instruction fetch and issue.

iv)  Cost of wrong branch assumption very high. Sophisticated methods of branch prediction with prediction buffers, statistical predictors etc. required to reduce poor predicitons. Very expensive and hard to build.

v)  Loads are expensive and cause pipelines to idle.

b.  Why not move all the complication to software?

i)  Software is cheap, simple to write.

ii)  Extensive support to debug and fix faulty software. Faulty hardware on the other hand are a disaster (e.g. Pentium Bug).

c.  Three ideas in Itanium (IA-64):

i)  Pass the responsibility of finding independent instructions to the software (compiler).

i.  IA-64 is a VLIW architecture. Three instructions can fit into a single instruction word.

ii.  All three instructions in the instruction word are simply sent to the execution units.

iii.  Dependency information must be inserted by the compiler using a set of “template bits” at the end of the instruction word, that not only indicate independence within a single instruction word, but also across instruction words. Thus the CPU will only look at the template to determine which instructions can be executed in parallel. Will not try to discover ILP on its own.

ii)  Forget about branch prediction.

Compiler inserts predicates to indicate the THEN and the ELSE portions of code. E.g. for the following code:

cmp r1, r2

bne ELSE

addi r1, r1, 2 // THEN

sub r3, r7, r1 // portion

j JOIN // Skip over ELSE

ELSE: subi r1, r1, 7

mul r3, r7, r1

JOIN: …

Using Stalling’s notation, This is re-written as:

<P1, P2> = cmp(r1, r2);

<P1> addi r1, r1, 2

<P1> sub r3, r7, r1

<P2> subi r1, r1, 7

<P2> mul r3, r7, r1

The code for both predicates are executed together, but only the results of the correct predicated are committed.

This allows us to fully utilize all the execution units, yet do not require us to roll-back any mispredicted branches.

iii)  Speculative Loading

Loads from memory can be very slow. Thus we want to begin loads as early as possible.

The Itanium also supports speculative loading. This allows us to move all loads to the start of the program (and as far away from the instructions that use the result of the load as possible).

Results are not committed until a “check” instruction is executed. Exceptions are also delayed until the “check” is executed.

This separation between “load” (when memory is accessed) allows us to do all the memory loads early, but the effect of the load (i.e. modifying the target register) takes place only at the point when the result of the load is needed.

Surprisingly benefits of IA-64 are still not clear in practice.

5.  From page 203 of CA-AQA 3rd Ed, Fig 3.13, why is it that for the first row of d=2, why did the b1 branch prediction start with NT/NT with the LHS bolded, whereas for b2 it starts with NT/NT with the RHS bolded?

Fig 3.13 is about using correlation bits to do branch prediction.

The bolding in Table 3.13 will depend on whether the previous branch was taken (bold RHS) or not taken (bold LHS). Illustrated below.

Main Idea:

Base our prediction of the current branch on the behavior of other branches

We use two bits. The first bit is to indicate what prediction to make if the previous branch was not taken, and the second bit to indicate what prediction to make if the previous branch was taken. This is shown below:

Prediction Bits

/ Prediction if previous branch not taken / Prediction if last branch was taken
NT/NT / NT / NT
NT/T / NT / T
T/NT / T / NT
T/T / T / T

Individually each bit is treated like a 1-bit predictor (i.e. its direction is set according to whether the current branch was actually taken or not).


Given the following C code:

if(d==0)

d=1;

if(d==1)

d=2;

Assuming d is in register r1, this gives us:

bnez r1, L1 ; Branch b1

addi r1, r0, 1 ; Set r1 to 1 if d==0. Note that r0 is

; a special register that always holds

; the value 0

L1: addi r3, r1, -1 ; Subtract –1 from r1. If r1=1, result in

; r3 should be zero.

bnez r3, L2 ; Branch b2. Taken if d != 0

L2: ….

If this is in a loop, predictors for b1 will depend on b2, and predictors for b2 will depend on b1. Initially we assume both not taken. Shaded portions indicate which prediction (previous branch not taken/previous branch is taken) is used.

Table below shows various iterations through the loop when d takes a value in the range [0,2].

Iter

/ d=? / b1 pre / b1 act / new b1 pre / b2 pre / b2 act / new b2 pre
0 / 2 / NT/NT /
T
/ T/NT / NT/NT /
T
/ NT/T
1 / 0 / T/NT /
NT
/ T/NT / NT/T /
NT
/ NT/T
2 / 1 / T/NT /
T
/ T/NT / NT/T /
NT
/ NT/NT
3 / 2 / T/NT /
T
/ T/NT / NT/NT / T / NT/T
4 / 0 / T/NT /
NT
/ T/NT / NT/T /
NT
/ NT/T