Some Homework Questions 2016

Others may be discussed during class hours. This file will be updated as we progress in the course.

Q1) Create a 3-4 slide PPT that shows a comparison between MIPS and ARM’s recent ISA. Show similarly to how we had the MIPS ISA in the lecture - focusing on types of instructions and their encodings.

Q2) Run the geekbench 3 benchmark-suite on your computer to measure CPU performance. Compare with your colleagues (project) results and discuss speedups and reasons for differences.

Q3) Assume that two embedded processors that are identical expect memory systems.

One uses CAM and the other 2-way RAM-TAG caches with same size blocks and overall size. Assume that in the I$ one can eliminate 60% of all Tag checks and TAG power is otherwise 65% of total read power in a CAM but only 35% in the 2-way cache. There are no cache misses. CAM and Tag have 32 lines as in lecture version. A CAM lookup is proportional in power with the number of rows it needs to check. Assume that a TAG lookup is proportional with associativity (how many rows you check). Which CPU is more efficient and why? Under what assumptions? Make up other assumptions if you need to.

Q4) A single-issue processor fetching one instruction per cycle consumes 25% of its energy in the instruction memory system; all accesses in the application considered are from the instruction cache, i.e., there is no DRAM access, and only one instruction is fetched per cycle. Now assume a branch predictor that has roughly the same power-per-access as a single instruction cache fetch due to a similar associative organization. 20% of all instructions executed are branches. A single instruction fetch and a branch prediction each take one cycle to complete.

(i) How much energy is consumed (in processor energy %) in the branch predictor? Assume that all branches are predicted correctly and CPI is 1. (Hint: this is an easy question; don’t overthink it).

(ii) Now assume that all branches are predicted Taken in the Decode stage. The Decode stage has the ability to calculate the target address but you will know whether the branch was correctly predicted in the Execute stage only. The pipeline is 5 stages long with Fetch, Decode, Execute, Memory, and Write-back. Assume that 60% of the branches are conditional taken, 30% are conditional untaken, and 10% are unconditional. 20% of all instructions are branches. No branch delay slots could be filled.

Please list first the penalty in cycles (if any) for conditional taken, conditional untaken, and unconditional branches. Calculate the CPI assuming that CPI is 1 without branch penalty. (Hint: for CPI calculation, imagine 100 instructions. Think about the execution time with and without branch penalty).

Penalty conditional taken:

Penalty conditional untaken:

Penalty unconditional branch:

New CPI:

Q5) (i) Assuming the following 2-bit predictor (see below) please identify a pattern of 6 branches (e.g., use T and N sequence notation) that would be poorly predicted (less than 10% accuracy). Assume that the predictor is in the T*N state originally and your first branch is not taken (N).

Propose a different 2-bit scheme that would handle the original pathological pattern better than this predictor by at least 40%. You also need to point out the weakness of your new scheme -- if any, e.g., its pathological case. To get full credit you need to show (a) the new state machine, and (b) patterns. (Hint: we studied in class one example of a 2-bit scheme that would also work).

Baseline Scheme (above) Your scheme (show above):

Pathological pattern:

Weakness of new scheme, e.g., pattern:

(ii) Describe how can you save energy attributed to branch prediction. Mention at least two methods we discussed in class. (Hint: think about how you can make branch prediction more efficient).

Method 1:

Method 2: