ENGR 9861 HIGH PERFORMANCE COMPUTER ARCHITECTURE

Final examination :: RV :: Apr. 19, 2005 2 PM – 4.30 PM

Attempt all questions. Make suitable assumptions, where necessary, and clearly state them.

Total Marks: 100

1.Name the three major parallel programming models, and explain (using fewer than 100 words and diagrams) each of them.

2.Consider a simple 2D finite difference scheme where at each step every point in the matrix is updated by a weighted average of its four neighbours:

A[i, j] = A[i, j] – w(A[i-1, j] + A[i+1, j] + A[i, j-1] + A[i, j+1]).

All the values are 64-bit floating point numbers. Assuming one element per processor and 1024 x 1024 elements, how much data must be communicated per step? Explain how this computation could be mapped onto 64 processors so as to minimize the data traffic. Compute how much data must be communicated per step in this case.

3.Given the following code segments, say what results are possible (or not possible) under sequential consistency (SC). Assume that all variables are initialized to 0 before the code is reached.

a.

P1P2P3

------

A = 1u = Av = B

B = 1w = A

b.

P1P2P3P4

------.

A = 1u = AB = 1w = B

v = Bx = A

4.Describe the essential aspects of one other student project in the course (besides yours) using fewer than 100 words.

5.What are the two sufficient correctness requirements to guarantee program order in a shared memory multiprocessor (SMP) system? In one or two sentences, state how each of these two requirements is satisfied.

6.A modern microprocessor uses two L1 caches (IC, DC) and two TLBs (ITLB, DTLB) when the architecture permits virtual addressing. Many SMP architectures employ these processors to assemble a large system.

a)When we discuss cache coherency, we refer only to DC. Why?

b)What are the types of misses the data cache in an SMP system would suffer? It would be sufficient to name the major ones.

c)Give an example application where the data cache would be efficient in spite of the possibility of suffering all these misses.

d)We know that in certain problems where large multi-dimensional data arrays are used, the cache would become useless. Suppose you are determining the architecture of the processor used in an application-specific parallel computer for such an application. Would you still employ IC, ITLB and DTLB? Give reasons to support your answer.

7.What are the major considerations while determining the architecture for a massively parallel application-specific parallel computer?

8.Describe store-and-forward routing and cut-through routing in a multiprocessor interconnection network, and discuss why the latter is more attractive. Suppose the links are 1 byte wide and operating at 300 MHz in a network for which the average routing distance between nodes is log4 P for P nodes. Compare the unloaded latency for 80-byte packets under store-and-forward and cut-through routings, assuming 4 cycles of delay per hop to make the routing decision for the following values of P: 16, 32, 64, 128, 256, 512, and 1024.

9.Using diagrams where necessary, briefly describe two of the following interconnection topologies: torus, hypercube of dimension 4, butterflies.

10.Using a state transition diagram, describe how a basic three-state write-back invalidation protocol works. Using the above protocol, show how the state transitions and bus transactions for the scenario depicted in the figure on the following page. For each processor action, list the state of each processor, resulting bus action, and which entity supplies data.