CPE 631 Advanced Computer Systems Architecture:
Homework #2

Issued: 02/05/03
Due: 02/19/03

Q#1. Exercise 5.4 (Textbook, pp. 515)

Q#2. (a) Write a simple C program for matrix multiplication MC = MA x MB. Matrices MA, MB, and MC are squared with NxN elements of type double. Let N be an input parameter for your program. Using Intel’s VTune Performance Analyzer (see Getting Started Tutorial: Intel’s Vtune Perfromance Analyzer: Ajay&Swathi Tutorial) measure the number of clock cycles for program execution, the number of instructions executed, the number of load and store instructions retired, and the number of L1 and L2 cache misses. Assume N = 512.
(b) Write a C program for matrix multiplication MC = MA x MB which uses blocked algorithm aimed to reduce the cache miss rate (Textbook, pp. 433). Let N and B (blocking factor) be input parameters for your program. Repeat the same measurements as above for different values of the blocking factor B.
(c) Find the optimal value of the blocking factor on your machine. Discuss results.

Turn in: Source files with your programs, and results from Vtune.

N=128 / B=0
(base) / B=4 / B=8 / B=16 / B=32 / B=64 / B=128 / B=256
#Clock cycles
#Instructions
#Load&Stores
#L1 Misses
#L2 Misses

Additional credits will be awarded if this analysis is also done for N=256, 768, 1024, or for different computing systems.

Q#3. Simulate at least two SPEC CPU2000 benchmarks using SimpleScalar’s functional simulator sim-cache. Consider the following cache configurations:
L1I (8KB, direct-mapped, 64B cache line) + L1D (8KB, direct-mapped, 64B cache line), L2U (32-512KB, 64B cache line, LRU replacement policy, 4-way set-associative).
For workload use reference inputs of the applications you selected. In order to make simulations feasible use command line options to skip first 500 million instructions (-forward 500000000) and then simulate 500 million instructions (-max:inst 500000000).
Draw a graphic to show the number of L2 misses per 1000 instructions depending on the L2 cache size (Y-axes will have the number of misses per 1K instructions, X-axes will have L2 cache size).

Turn in: results, command files and output files from sim-cache simulator.

Notes on using SimpleScalar

(1) Read SimpleScalar Tutorial.

(2) Download 631ssAlpha.tgz (for Linux, 20,033,842 bytes) or 631ssAlpha-Cygwin.tgz (for Cygwin, 19,844,137 bytes). Unzip it (tar xvzf cpe631Alpha.tgz).
This archive includes all necessary simulators from SimpleScalar tool suite and Alpha binaries of SPEC CPU2000 benchmarks. Some of the simulators have been modified by our research group members, e.g., sim-cache in order to allow you to skip the specified number of instructions.

If you have a PC running Linux you might want to install full SimpleScalar suit which includes program development environment for PISA instruction set architecture (MIPS like) and ARM instruction set architecture. Links are on the course Web site.

(3) Be sure that you have SPEC CPU2000 (SED contact person is Mr. David Austin). You can install or just copy it. Let’s say that home directory of SPEC CPU2000 is $SPEC_HOME.

(4) Steps to do:

# create a working directory

mkdir work

cd work

mkdir 172.mgrid # e.g., you want to simulate 172.mgrid application

cd 172.mgrid

# now you can copy inputs for this application
# into your working directory;

# with Cygwin you can use Explorer to move necessary input file mgrid.in

cp $SPEC_HOME/spec_cpu2000/benchspec/CFP2000/172.mgrid/data/ref/input/mgrid.in .

# let’s say $SS_HOME is where you unzipped 631ssAlpha

# to run simulation type in (one line command)

$SS_HOME/631ssAlpha/mysimplesim_pff_log/sim-cache -fastfwd 500000000 -max:inst 500000000 -redir:sim u2_32KB.txt -cache:il1 il1:512:64:1:f -cache:dl1 dl1:512:64:1:f -cache:il2 none -cache:dl2 ul2:2048:64:4:l $SS_HOME/631ssAlpha/spec2000binaries/mgrid00.peak.ev6 < mgrid.in

# this will run sim-cache functional cache simulator for mgrid00 SPEC CPU application;
# input for this application is given in the file mgrid.in

# tested cache configuration is 8KB L1I, 8KB L1D, and 32KB L2U;
# first 500M instructions will be skipped, and then 500M simulated.

# you can prepare a command file, e.g., 172mgrid.sh to include command lines for
# all runs for your homework (u2 is 32KB, 64KB, ...).

(5) To learn more about:
- SimpleScalar (

- SPEC (

- Inputs for SPEC CPU applications
(