ELEG 652 Principles of Parallel Computer Architectures Handout # 6Problem Set # 4

Issued: Wednesday, October 19, 2005

Due: Wednesday, November 2, 2005

Please begin your answer to every problem on a new sheet of paper. Be as concise and clear as you can. Make an effort to be legible. To avoid misplacement of the various components of your assignment~ make sure that all the sheets are stapled together. You may discuss problems with your classmates, but all solutions must be written up independently.

The goal of this assignment is to gain experience in fine tuning code for performance on a particular platform. At this point we are still working with single processors.

The link where you will go to retrieve files used in this assignment. In this link, you should download the files matrix.c.

Your assignment, basically, is to make this code run as fast as possible on one platform: a Sun Ultra-10. You should consider both compiler switches and code changes. Write a report describing your efforts. You should show evidence of a comprehensive search, e.g., describe each optimization or code change you try and why you think it will improve performance, and then show the actual results.

First, compile the code with base optimization, using cc (Sun's compiler) on the Sun. You are also encouraged to try GCC.

Experiment with various compiler options. See which combination produces the best results. For SUN's cc compiler, the options are available in the materials you can get from TA. If you also determine to try gcc, you should read the man page of gcc.

Now, experiment with modifying the code itself, e.g., to make better use of the memory hierarchy. Again, try out various compiler options to improve your results. You are required to get at least 180 MFLOPS for matrix multiply of 1024 by 1024 size matrix on the SUN workstation. To get higher performance than 280 MFLOPS is encouraged. And please put the highest number of MFLOPS you get on the cover of your homework report.

In order to make meaningful comparisons of results, I would like you to produce your numbers on one machine types only: the Suns in 133.

The machines in 133 are:

Magnemitepikachulickitungwigglytuffmagnetonzapdos

rattata snorlax jolteon electabuzzraichujigglypuff

tauroselectrode voltorb pidgey

All hostnames end with .acad.ece.udel.edu (e.g., pidgey.acad.ece.udel.edu).

You can SSH into these machines from any other machine. Be sure to type ``finger'' to see if anyone else is on the machine. Try to find unoccupied machines as this will give more meaningful numbers (and you will get them faster).

For completing this assignment, you should read the materials you copied from TA, and you may need to read the contents about memory hierarchy in textbooks.

Produce a table that shows that IPC of several modern architectures. Instructions per Clock cycles isa measurements of how many instructions can be executed in one cycle which is a very good indication of how well the multi-issue hardware is doing. This quantity can be produced by taking the ratio of a MIPS (Millions of Instruction per Second) by the frequency of the processor. Your assignment is:

Give the frequencies of 5 types of modern processors
Give the corresponding MIPS values for the five types of processors. Be aware, find the benchmark that produced this result and make sure that all your MIPS values are calculated using this benchmark. Mention the benchmark as part of the answer for this question.
Create a table with all these values plus an extra column that have the IPC for each machine

It is critical that the scoreboard be able to distinguish RAW and WAR hazards, since a WAR hazard requires stalling the instruction doing the writing until the instruction reading an operand initiates execution, while a RAW hazard requires delaying the reading instruction until the writing instruction finishes - just the opposite. For example, consider the sequence:

MULTD F5, F6, F4

SUBD F8, F5, F3

ADDD F3, F10, F3

The SUBD depends on the MULTD (a RAW hazard) and thus the MULTD must be allowed to complete before the SUBD; if the MULTD were stalled for the SUBD due to the inability to distinguish between RAW and WAR hazards, the processor will deadlock. This sequence contains a WAR hazard between the ADDD and the SUBD, and the ADDD cannot be allowed to complete until the SUBD begins execution. The difficulty lies in distinguishing the RAW hazard between MULTD and SUBD, and the WAR hazard between the SUBD and ADDD. Describe how the scoreboard avoids this problem. (Hint: Think about how WAW hazards are prevented and what this implies about active instruction sequences.)

You have learned about a generic superscalar architecture in class. The reorder-buffer mechanism described in class is used in such a superscalar architecture to support the so-called “micro-dataflow” for out-of-order issuing/execution. Briefly describe how flow-, output- and anti- dependences are resolved in this scheme for the following example:

...

R1 = R7 + R0

R2 = 7 + R1

R3 = R2 * R1

R1 = 9*R2

You may want to explain your scheme by drawing a dependence graph and demonstrating the contents of the reorder buffer and the reservation stations for a few key steps as we did in class. State your assumptions on timing of instruction completion in your description.

Briefly compare reorder-buffer mechanism with the scoreboard mechanism in terms of their functions and cost.