1 Use the Following Code Fragment

Homework 2

Due March 26

1 Use the following code fragment:

Loop: LD R1, 0(R2) ; load R1 from address 0+R2

DADDI R1, R1, #1 ; R1=R1+!

SD 0(R2), R1 ; store R1 at address 0+R2

DADDI R2, R2, #4 ; R2=R2+$

DSUB R4, R3, R2 ; R4=R3-R2

BNEZ R4, Loop ; branch to Loop if R4!=0

Assume that the initial value of R3 = R2+396.

Throughout this exercise use the classic RISC five-stage integer pipeline and assume all memory accesses take 1 clock cycle.

a. Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle. Use a pipeline timing chart like Figure A.10. Assume that the branch is handled by simply stall the pipeline, and the branch condition checking and target address calculation are performed in the second stage. How many cycles does this loop take to execute?

b. Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. How many cycles does this loop take to execute?

2. We will now add support for register-memory ALU operations to the classic five-stage RISC pipeline. To offset this increase in complexity, all memory addressing will be restricted to register indirect (i.e., all addresses are simply a value held in a register: no offset or displacement may be added to the register value). For example, the register-memory instruction ADD R4, R5, (R1) means add the contents of register R5 to the contents of the memory location with address equal to the value in register R1 and put the sum in register R4. Register-register ALU operations are unchanged. Answer the following for integer RISC pipeline.

a. List a rearranged order of the five traditional stages of the RISC pipeline that will support register-memory operations implemented exclusively by register indirect addressing.

b. For the reordered stages of the RISC pipeline, what new data hazards are created by this addressing mode? Give an instruction sequence illustrating each new hazard.

3. Consider the following MIPS assembly code.

LD R1, 45(R2)

DADD R7, R1, R5

DSUB R8, R1, R6

OR R9, R5, R1

BNEZ R7, target

DADD R10, R8, R5

XOR R2, R3, R4

a. Identify each dependence by type (data or control); list the two instructions involved; identify which instruction is dependent; and, if there is one, name the storage location involved.

Hint: this question just checks the dependence between two instructions (not just two consecutive instructions). It has nothing to do with pipelining. Do not assume that this code will be executed in a five-stage pipeline. Please use a table to list all dependence. Below is an example

Dependence / Independent instruction / Dependent instruction / Storage location
1 / Data / LD R1,… / DADD R7, .. / R1
2 / Data / LD R1,… / DSUB / R1
3
.
.

b. Use information about the MIPS five-stage pipeline from Appendix A and assume a register file that writes in the first half of the clock cycle and reads in the second half-cycle forwarding. Which of the dependences that you found in part (a) become hazards and which do not? Why?

Hint: I would like to see answers like the following:

From the table in part (a), dependences #?, #?, .. and #? will (or will not) become hazards since …

You may need to consider the cases with forwarding and without forwarding.

4. The following loop computes Y[i] = a ´ X[i] + Y[i], the key step in a Gaussian elimination. Assume the pipeline latencies from Figure 2.2 (or the following table) and a 1-cycle delayed branch.

Loop: L.D F0, 0(R1) ; load X[i]

MUL.D F0, F0, F2 ; multiply a*X[i]

L.D F4, 0(R2) ; load Y[i]

ADD.D F0, F0, F4 ;add a*X[i]+Y[i}

S.D 0(R2),F0 ;store Y[i]

DSUBUI R1, R1,#8 ;decrement X index

DSUBUI R2, R2, #8 ;decrement Y index

BNEZ R1,loop ; loop if not done

Assume a single-issue pipeline. Unroll the loop as many times as necessary to schedule it without any delays, collapsing the loop overhead instructions. Show the schedule. What is the execution time per element?

Hint: you need to unroll the loop twice. “per element” means “per iteration”.