There Are Two Parts in This Assignment

Advanced Computer Architecture
HW #1

Author: Andreas Moshovos

There are two parts in this assignment

NOTE: This assignment refers to several source and executables that you need to complete it. Most are available through the web site. The gcc port for simplescalar is available on the CD we provided (if you untar it, it will install on /usr/local under CYGWIN). We will provide the same tarball on the web site but it is rather large.

In this assignment you will be exposed to using and modifying Simplescalar’s functional simulator

In this assignment you will modify the functional simulator from the Simplescalar simulation suite. You are asked to collect certain statistics about program execution. There are other, more detailed simulators in the Simplescalar suite. We will start with the functional simulation not only because it’s the simplest, but also because it forms the building-block for all other simulators. A functional simulator simulates the function of the program. It executes one instruction after the other with no notion of time. Instructions execute one after the other, as assumed by sequential execution semantics. In contrast, a timing simulator is in addition concerned with time, that is how long it takes to complete each instruction. The core of the functional simulator is basically a loop with a huge switch statement where every case corresponds to a different instruction type. The register file is simulated via an array and memory is simulated via another array (actually implemented as a two-level array: first level is indexed by page number, second level is dynamically allocated only for pages actually touched by the simulated program). System calls are forwarded to the actual host OS (more info in class).

While Simplescalar is a fairly elaborate and lengthy piece of software, still there is no need to not panic. A detailed, step-by-step description of all the necessary modifications for collecting the first set of statistics is given in this homework. This is why there are so many pages.

Equipment Required: You will need access to an x86 linux or Cygwin machine to complete this assignment. This assignment requires the use of benchmarks that are compiled for a little-endian machine such as an x86 one. You may be able to complete this assignment on other little-endian architectures. However, this may require various changes in the Simplescalar code (which of course, you will have to figure out on your own).

Structure: First, this assignment shows you how to modify the simulator using an example. Then, you are asked to develop code for collecting a different set of statistics. You will have to figure out how to do the latter on your own. Please plan ahead. This is not something you want to attempt the night before the due date. Minor details not necessarily related to the core of this assignment may take more time than expected.

Here’s the statistics we are interested in. A detailed description of all required modifications follows:

Assume an ideal 5-stage pipeline (similar to the DLX pipeline in H&P ch. 3). In this pipeline accessing memory for instructions or data takes a single cycle. Moreover, an instruction access and a data access can proceed at the same cycle. For a very good reason we need to measure how often a load use hazard occurs in typical programs. What is a load hazard? It occurs when an instruction that immediately follows a load has to use the value read by the load. Recall that in a typical 5-stage pipeline memory accesses occur in the MEM stage (4th stage). In the same pipeline, registers are used in the beginning of the EX stage (3rd stage). An instruction following a load has to wait for a cycle before the load value is read. The following figure illustrates a load use hazard:

Cycle / N / N+1 / N+2 / N+3 / N+4 / N+5
ld $1, 10($2) / FETCH / DECODE / EXEC / MEM / WB
add $3, $1, 1 / FETCH / DECODE / STALL / EXEC / MEM

Here’s the main idea on how to collect the desired statistic. Every time an instruction is executed we can simply check whether one of its inputs is the target of an immediately preceding load. If so, we simply count one more load use hazard. To get this done, we will modify the functional simulator to collect the source register numbers for all instructions and to also identify loads and their target registers.

Here’s a step-by-step description on how this can be done (there are other ways this is one that works):

a) Create a directory where you will copy the simulator files. For clarity, let’s name this directory ~/mysim. The commands to do so are:

cd ~

mkdir mysim

b) Copy the simulator files using:

cd ~/mysim

gunzip –c simplesim-3v0c.tgz | tar xvf -

c) Before you attempt any changes, first make sure that your simulator copy compiles and works correctly without any modifications. Use the following command to compile the functional simulator:

cd simplesim-3.0

make config-pisa

make sim-safe

The first command configures simplescalar to use the PISA instruction set. There are other instruction set supported such as ALPHA. For most of our assignments we will use PISA since it is somewhat simpler while still being a representative of commercial load/store instruction sets.

There many simulators in the Simplescalar suite. We will be working with “sim-safe” which is a functional simulator. “Sim-fast” is pretty much the same simulator, however, it does not include many of the sanity checks included in sim-safe. For example, sim-fast would never complaint if your code jumps into illegal memory addresses.

We can use sim-safe to measure load use hazards without having to write the complete code for a pipeline. This is a standard trick in simulation. In general, simulating is different than building hardware. If you are interested in studying a phenomenon you may be able to do so (and save a lot of time and effort) by thinking whether it is possible to collect the necessary statistics without necessarily simulating how the hardware would do something but instead simulating the resulting behavior. For example, in a typical five stage pipeline writeback would occur in the 5th stage. In hardware, each instruction would go through every stage with the appropriate information stored in a set of inter-stage latches. In simulation, this behavior can be simulated simply by delaying the writeback for five “cycles” (i.e., we do not need to have code for the latches and copy values every “cycle”). Also note that Simplescalar performs execution-driven simulation. That is, it actually simulates the execution of each instruction. This is in contrast to trace-driven simulation where a trace of a previous execution is used to retrace the path followed when the program executed in the past.

If you get “warnings” while compiling “sim-safe” you can safely ignore them. If you get errors, then something went terribly wrong (this is not supposed to happen).

d) Check that the compiled simulator works. You can invoke it as follows:

./sim-safe [arguments] executable [executable arguments]

A precompiled, short running program is provided through the website for your convenience. Run it as follows:

./sim-safe msgs

You should see a nice message along with numerous statistics about the program. Try running count also.

e) Now it’s time to start making modifications. We will be working with “sim-safe.c” which is the core implementation of the sim-safe simulator. Much of the code in “sim-safe.c” can be ignored for completing this assignment. Because Simplescalar was written to facilitate multiple simulators there is a set of functions that have to be provided by any simulator implementation even though many of them are not always used.

The core of instruction execution occurs in function “sim_main()”. Go ahead and edit “sim-safe.c” with your favorite editor and locate “sim_main()”. After a couple of statements in “sim_main()” you will find an infinite loop starting with “while (TRUE)” statement. Each iteration of this loop corresponds to the execution of one instruction. The first statement in this loop is (ignore all code that is within an #ifdef TARGET_ALPHA…#endif):

regs.regs_R[MD_REF_ZERO] = 0;

This statement sets register $0 to 0. In the MIPS instruction set architecture (Simplescalar uses a derivative of the MIPS ISA) register $0 is always 0. The table “regs.regs_R[]” implements the general purpose register file. This structure has another array “regs.regs_F[]” which implements the floating point register file. Look in “regs.h” and “regs.c” if you wish to learn more (but, leave this for later and this is not necessary – actually, it is good practice to abstract away parts of the code that are not of immediate interest. This is the only hope one has to work with existing, lengthy and complex code).

The next statement reads an instruction from memory:

MD_FETCH_INST (inst, mem, regs.regs_PC);

This macro is defined in “machine.h” which contains many, PISA specific functions. Suprisingly, variable “mem” implements main memory. Again, you may look in “memory.c” if you care for details. You should probably avoid the temptation and assume that mem is simply a gargantuan array (it’s almost that, as it is implemented as via an array of pointers pointing to 4K chunks of simulated memory. This is similar to page tables as we will learn later on in the course). The above statement reads an instruction (which is of size “sizeof(md_inst_t)” from address “regs.regs_PC”. “Regs.regs_PC” is an integer simulating the program counter.

Following this statement is one that keeps a tally of executed instructions followed with some bookeeping and instruction opcode determination statements:

MD_SET_OPCODE (op, inst);

This command, extracts the opcode field out of the raw instruction. This field specific what the instruction does (e.g., addition, load, etc.). It is defined in “machine.h”.

Instruction execution takes place in the switch statement that follows. You will see that this switch statement uses the instruction’s opcode (switch (op)) to determine what needs to be done. If at first the code within the switch statement looks weird don’t worry. You are absolutely right! Just keep reading on and soon all will make sense. The idea is that the switch statement oddly enough contains one “case” statement per instruction opcode. Within each of this “case” statements are command that simulate instruction execution.

In the switch statement you will find a couple of “#define” statements defining various macros. The one to pay attention to is the:

#define DEFINST(OP, MSK, NAME, OPFORM, RES, FLAGS, O1, O2, I1, I2, I3)

This is defined to be “SYM_CAT(OP,_IMPL)”. “SYM_CAT(X,Y)” is a macro that expands to XY which is then also treated as a macro that is expanded by the preprocessor (I think this is gcc specific). As a result, whenever DEFINST is called, it expands into OP_IMPL where “OP” is the first argument passed into DEFINST. For example, the statement “DEFINST(ADD,….)” expands into “ADD_IMPL” which is also treated as macro and expanded. After these definitions, there is an “#include “machine.def”” statement. View this file and you will see that it contains a macro definition and a macro call per instruction opcode. The first macro is of the form “X_IMPL” where “X” is an opcode such as ADD or SUB. The second macro, is a call to “DEFINST(X,…)” Returning back to “sim_main()” and the switch statement, we can now understand what happens when this code passes through the preprocessor. As a result of the definition of “DEFINST” in sim-safe.c, every call to DEFINST from machine.def expands into the following code:

Case OP: OP_IMPL; break;

Next the preprocessor replaces each of these OP_IMPL, with the appropriate statements as defined in the “machine.def”. As a result, after preprocessing the “switch (op)” statement expands into a huge switch statement with a case statement for every possible opcode. Within each case, the code provided in “machine.def” is included. You can see the expanded code (doesn’t look pretty) via the following command:

gcc –E sim-safe.c –o sim-safe.pcc.

This should produce a “sim-safe.pcc” file which you may view. Search for
“sim_main()” in that file and then take a look at the expanded switch statement. On another note, if you want to see what gcc does to compile your programs use the “-v” option. In fact, gcc is just a front-end that calls a variety of other programs.

The “OP_IMPL” macros utilize other macros to access machine state including the register file and the memory. These have to be provided in the simulator. “Sim-safe.c” includes appropriate definitions for our purposes. For example, to read a register, OP_IMPL uses the “GPR(N)” macro, while it uses the “SET_GPR(N)” to write to a register. “GPR(N)” is defined simply as “regs.regs_R[N]” in “sim-safe.c”. That is, to read register N we just access the N element of the array regs.regs_R[] which implements the register file. There are other macros, that deal with floating point registers and memory which you can find in “sim-safe.c” starting at line 239.

A more detailed description of how to use the “DEFINST” macro is given at the beginning of “machine.def”. For our purposes, suffices to note that “O1” and “O2” are the target register numbers, while “I1”, “I2” and “I3” are the source register numbers for each instruction. There are 2 targets and 3 sources as in PISA an instruction may write up to 2 registers (load double or multiply) and read up to 3 (e.g., store double). Most instructions read up to 2 registers and write just one. The macro DNA is used to indicate that the instruction does not write or read the corresponding target or source register (i.e., if the instruction only writes one target register).