April 13, 2004 Yu Hen Hu

Department of Electrical and Computer Engineering

University of Wisconsin – Madison

ECE 734 VLSI Array Structures for Digital Signal Processing

Homework #3 Solution

This homework consists of questions taken from the notes and open-ended questions. You must do the homework by yourself. No collaborations are allowed. Late homework will receive a 5% penalty per day. There are total 100 points. This homework is worth 10% of your overall grades.

  1. (10 points) Consider the following Fortran-like program:

C This is an implementation of back-substitution algorithm

DO 10 I=N,1,-1

S(I)=B(I)

IF I<N THEN

S(I)=S(I)-U(I,J)*X(J)

10 X(I)=S(I)/U(I,I)

(a)  (4 points) Convert this program into single assignment format with localized data communication.

Answer:

S(I,N)=B(I), 1 £ I £N

X1(I,I)=S(I,I)/U(I,I), 1 £ I £N

S(I,J-1)=S(I,J)-U(I,J)*X1(I,J), 1 £ I £ N-1, I £ J £N

X1(I-1,J)=X1(I,J), 1 £ I £ N-1, I £ J £N

X(J)=X1(1,J), 1 £ J £ N

(b) 
(3 points) Plot the iteration DG for N =5.


(c)  (3 points) Project this DG onto a linear systolic array by choosing a projection vector d and a linear schedule vector s. Perform node mapping, arc mapping and I/O mapping.

Answer: Let us assume d = [-1 0]T, s = [-1 -1]T. The dependence vectors are

. The processor space is P = [0 -1]T.

Node mapping: PT[i j]T = -j.

Arc mapping:

I/O mapping:

  1. (10 points) Consider the iteration DG given below:

For each pair of scheduling vector s and projection (assignment) vector d, (i) determine if they are permissible, (ii) if they are permissible, compute the processor space P, and perform the node mapping, arc mapping, and I/O mapping.

Answer: Note that the dependence matrix is: .

(a)  (3 points)

Answer: (i) sTD = [1 0 1], and sTd = 1. permissible. (ii) P = [0 1]T.

Node mapping: .

Arc mapping:

I/O mapping: Three inputs, one along [i, 1]T, one is [1, j]T, and the third one appear in both of them.

Input:

Output:

(b)  (2 points)

Answer: (i) sTD = [1 -1 0], and sTd = 1. NOT permissible.

(c)  (3 points)

Answer: (i) sTD = [1 1 2], and sTd = 1. permissible. (ii) P = [0 1]T.

Node mapping: .

Arc mapping:

I/O mapping: Three inputs, one along [i, 1]T, one is [1, j]T, and the third one appear in both of them.

Input:

Output:

(d)  (2 points)

Answer: Since sTd = 0, it is not permissible.

  1. (10 points) Given that the dependence vectors of a shift-invariant DG are:

(a)  (3 points) Determine which of the following schedule vectors are permissible linear schedules.

(i) s = [1 0]T, (ii) s = [1 -1]T, (iii) s = [1 1]T.

Answer: (i) and (iii) are permissible. (ii) is not permissible.

(b)  (5 points) Suppose that the DG consists of index points {(i, j); 0 £ i£ 5, 0 £ j £ 5, i + 2j £ 12}. Find a permissible schedule (needs not be the same as any of the schedule vector in part (a)) that minimizes the total computing time.

Answer:

Þ s1 > 0, s2 > 0, and s1+s2 >0.

Computing time

T = max. sT(p - q) +1

over all nodes p and q. Since s1 > 0, s2 > 0, and 0 £ i£ 5, 0 £ j £ 5, to maximize sT(p - q), one may set q = [0 0]T. Only 3 vertices of the dependence graph need to be examined: [5 3]T, [4 4]T, and [2 5]T. Choose the minimum possible values of s1, and s2, we set s1 = s2 = 1. Thus, the minimum computing time is 9. Note that the answer is NOT UNIQUE.

(c)  (2 points) Find a permissible schedule s and a project vector d that minimizes the pipelining period a = sTd.

Answer: Since a ³ 1, and s1 > 0, s2 > 0, the choice is sT = [1 1], and dT = [1 0] that yields a = 1.

  1. (10 points) Consider the correlation of two 1-dimensional sequences {x(n)} and {y(n)} with 0 £ n £ N-1.

(a)  (2 points, CC) Write a C program (or use any high level language) source code that compute {r(n)} for N =5, and n = 0, 1, 2.

Answer: any HLL will be acceptable.

For n = 0:2,

r(n) = 0;

For k = 0:5-n-1,

r(n) = r(n) + x(k)*y(k+n);

end

end

(b)  (3 points) Convert this program into single assignment format.

Answer:

For n = 0:2,

r1(n,0) = 0;

For k = 0:5-n-1,

r1(n,k) = r1(n,k-1) + x(k)*y(k+n);

end

r(n) = r1(n,5-n-1)

end

(c)  (3 points) Convert this program so that it has local data communications.

Answer:

x1(0,k)=x(k), x1(n,k)=x(n-1,k), 0£k£5, 0£n£2

y1(0,k)=y(k), y1(n,k)=y1(n-1,k+1), 0£k£5, 0£n£2

r1(n,-1)=0, r1(n,k)=r1(n,k-1)+x1(n,k)*y1(n,k), 0£k£5, 0£n£2

r(n) = r1(n,5-n-1) , 0£n£2

(d)  (2 points, CC) Plot the corresponding iterative DG.

Answer:

  1. (20 points) (Binary number multiplication) Consider two unsigned, n-bit binary numbers

A = An-1An-2 … A1A0, and B = Bn-1Bn-2 … B1B0, where Ai, Bi Î {0, 1}, 0 £ i £ n-1. Denote C = A ´ B = C2n-2C2n-3 … C1C0 to be an 2n-1 bit binary number representing the product of A and B.

(a)  (5 points) Derive an mathematical expression of Ck in terms of Ais and Bis.

Answer: Note that , and . Hence,

. By comparison of coefficients, we have

However, the range of may be greater than 1. Thus, a final evaluation of

with Ck Î {0, 1}

will be needed. To incorporate carry propagation, finer granularity of expressions will be needed. Consider the addition of two binary number p and q, with a carry-in cin. There will be two outputs:

Sum s = pÅ qÅ cin, and carry-out cout = pq + qcin + cinp.

Instead of computing Ck at once, we adopt a partial-product notation.

(b)  (3 points) Write a C or Matlab® program to compute Ck for 0 £ k £ 2n-2. Make sure this program has single-assignment format. Use variable indices as A(I), B(I), C(K), etc.

Answer:

FOR K=0:2N-2,

C1(K,-1)=0;

FOR I=MAX(0,K-N+1):MIN(K,N)

C1(K,I)=C(K,I-1)+A(I)*B(K-I);

END

C(K)=C1(K,MAX(K,N));

END

(c)  (3 points) Localize the program in part (b) by converting all variables that need to be broadcast (used by more than one iterations) during the execution into transmittal variables that will be passed along from an iteration to the next.

Answer:

C1(K,-1)=0; 0 £ K £ 2N-2

A1(K,0)=A(I); 0 £ K £ 2N-2, 0 £I £N-1

B1(K,0)=B(K); 0 £ K £ N-1

C1(K,I)=C1(K,I-1)+A1(K,I)*B1(K,I); 0 £ K £ 2N-2, 0 £I £N-1

A1(K,I)=A1(K,I-1); 0 £ K £ 2N-2, 0 £I £N-1

B1(K,I)=B1(K-1,I-1); 0 £ K £ 2N-2, 0 £I £N-1

C(k) = C1(k,N)

(d)  (3 points, CC) Plot the logic schematic diagram of the architecture to implement the operations of the loop body as specified in part (c). Note that A(I), B(I) are both binary variables.

(e)  (3 points) For n = 4, plot the DG of the localized program in part (c), and identify the critical path of this DG. A hardware implementation of this DG is called an array multiplier.

Answer:

(f)  (3 points) Use cut-set retiming to convert the DG directly into a 2D systolic array that implements a fully pipelined version of the array multiplier.

Answer: Use to transform the DG in part(e) so that it can be directly converted into a 2D systolic array.

  1. (10 points) Textbook, Chapter 7, Problem 10, pp. 214-215.

Answer:

Figure 7.15

Figure 7.16

  1. (10 points) Saturation arithmetic

Saturation arithmetic is a convenient, nonlinear method of handling overflow of fixed-point computation in DSP applications. Based on the types of two operands and one results, there are 8 possible combinations. In practice, the three commonly used options are sss, uuu, and uus where u stands for unsigned, and s stands for signed. In Intel MMX and SSE-2 instruction set, only uuu, and sss are implemented. The former is called unsigned saturation arithmetic, and the latter is called signed saturation arithmetic. Suppose now there are two packed operands each consists of four 16-bit words as shown below

Ra: / 58 / 14 / 12 / 77
Rb: / 22 / 192 / 118 / 36

The content is given in decimal format for convenience. Refer to MMX instruction set,

(a)  (5 points) Write a MMX assembly code segment to compute the maximum of each pair of integers at the corresponding position. Assuming the result is stored in register Rc. The final content of Rc should be

Rc: / 58 / 192 / 118 / 77

You should prove that your program is correct. (Hing: Use unsigned saturation arithmetic)

Answer: Let us consider on one pair of these integers a and b. If we perform unsigned subtraction

PSUBUSW Rc, Ra, Rb

(packed subtraction with unsigned saturation arithmetic on each 16-bit word unit), then the content of the Rc register will be

Rc: / 36 / 0 / 0 / 41

Note that Now adding b to the result, we have

This can be performed with usual packed addition since we are adding two unsigned numbers, and the result will not have overflow! Thus the program looks like this:

Operation / Content of Rc AFTER operation
PSUBUSW Rc, Ra, Rb / 36 / 0 / 0 / 41
PADDW Rc, Rc, Rb / 58 / 192 / 118 / 77

(b)  (5 points) Still given Ra and Rb with their original content shown in part (a) of this problem. Now, our goal is to perform absolute difference between each pair of the words in Ra and Rb. Using registers Re, Rf, …to store intermediate results. The final result should be stored in Rd. The absolute difference between a and b is |a - b|.

Answer:

PADDUSW Re, Ra, Rb /
PADDUSW Rf, Rb, Ra /
PADDW Rd, Re, Rf / Regular packed addition = |a - b|
  1. (10 points)

Consider the RIA implementation of the FIR filter

y1(n,-1)=0, n = 0,1,2, …

h1(0,k)=h(k), k=0,…,N-1

n=0,1,2,…, and k=0,…,N-1

y1(n,k)=y1(n,k-1)+h1(n,k)*x1(n,k)

h1(n,k)=h1(n-1,k)

x1(n,k)=x1(n-1,k-1)

y(n)=y1(n,N-1), n=0,1,2,

And the corresponding DG:

(a)  (2 points) Suppose the projection vector d = [1 0]T. Choose a scheduling vector such that the resulting DFG implementation is the same as the data broadcasting FIR structure as discussed in chapter 3.

Answer: Let s = [s1 s2]T. Due to dependence constraints, and resource conflict constraints, we have

For data broadcasting structure, the # of delay element on the dependence vector that propagate x(i), [1 1]T is 0. That implies s1+s2 = 0. Since the constrain sTD >0 holds iff the data is localized, finding the scheduling vector that result in data broadcasting structure will violate this constraint (which is Ok in this case since we want the data broadcast structure). And since sTd >0 must hold, s1 >0. Hence, one may choose s = [ 1 -1 ]T.

(b)  (2 points) Plot the projected DFG

Answer: See below.

(c)  (4 points) Find a uni-modular transform matrix M to transform the original DG into a new DG such that the same data broadcasting DFG can be obtained if the projection vector and scheduling vector are chosen as d = s = [ 1 0 ]T. Give the (i) M matrix, and (ii) plot the transformed DG.

Answer:

The DG and projected DFG are shown below:

(d)  (2 points) Write down the corresponding RIA formulation of the transformed DG.

Answer: The RIA equations are of the following form:

y1(n,-1)=0, n = 0,1,2, …

h1(0,k)=h(k), k=0,…,N-1

n=0,1,2,…, and k=0,…,N-1

y1(n,k)=y1(n+1,k-1)+h1(n,k)*x1(n,k)

h1(n,k)=h1(n-1,k)

x1(n,k)=x1(n,k-1)

y(n)=y1(n,N-1), n=0,1,2,

  1. (10 points)

Assume that four 64-bit wide registers R1, R2, R3, and R4. Each of them stores a packed 4 16-bit words as shown below:

R1 / A3 / A2 / A1 / A0
R2 / B3 / B2 / B1 / B0
R3 / C3 / C2 / C1 / C0
R4 / D3 / D2 / D1 / D0

(a)  (5 points) Consider the execution of the following MMX instructions. Fill out the content of the destination registers AFTER the execution of an instruction shown to the left.

1 / MOVQ R5, R1 / R5 / A3 / A2 / A1 / A0
2 / PUNPCKHWD R5, R3 / R5 / C3 / A3 / C2 / A2
3 / MOVQ R6, R2 / R6 / B3 / B2 / B1 / B0
4 / PUNPCKHWD R6, R4 / R6 / D3 / B3 / D2 / B2
5 / MOVQ R7, R5 / R7 / C3 / A3 / C2 / A2
6 / PUNPCKHWD R7, R6 / R7 / D3 / C3 / B3 / A3
7 / MOVQ R8, R5 / R8 / C3 / A3 / C2 / A2
8 / PUNPCKLWD R8, R6 / R8 / D2 / C2 / B2 / A2
9 / MOVQ R9, R1 / R9 / A3 / A2 / A1 / A0
10 / PUNPCKLWD R9, R3 / R9 / C1 / A1 / C0 / A0
11 / MOVQ R10,R2 / R10 / B3 / B2 / B1 / B0
12 / PUNPCKLWD R10, R4 / R10 / D1 / B1 / D0 / B0
13 / MOVQ R11, R9 / R11 / C1 / A1 / C0 / A0
14 / PUNPCKHWD, R11, R10 / R11 / D1 / C1 / B1 / A1
15 / MOVQ R12, R9 / R12 / C1 / A1 / C0 / A0
16 / PUNPCKLWD, R12, R10 / R12 / D0 / C0 / B0 / A0

(b)  (2 points) If the data stored in R1, R2, R3, R4 represent a 4 ´ 4 matrix

Discuss what above MMX program have achieved?

Answer: the result is the transpose of the matrix M, it is stored in registers R7, R8, R11, and R12.

(c)  (3 points) In above program, total 12 registers are used. Optimize the program by minimizing the total number of different registers that are needed. You must discuss your approach to receive credit.

Answer: Other than R1 to R4, other registers are used to store temporary results. We may create a demand table as follows:

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15 / 16
R1 / S / S
R2 / S / S
R3 / S / S
R4 / S / S
R5 / D / D / S / S
R6 / D / D / S / S
R7 / D / D
R8 / D / D
R9 / D / D / S / S
R10 / D / D / S / S
R11 / D / D
R12 / D / D

Thus, R9, R10 can be replaced with R5, R6, and R11, R12 can occupy R3, R4.

Page 1 of 12