Implementing Multicast in a Software Emulation

VLSI Implementation of Differencing Multistage Detector for CDMA

Gang Xu, Praful Kaul, Sridhar Rajagopal

Department of Electrical and Computer Engineering

Rice University

Houston, Texas

{gxu, pkaul, sridhar}@rice.edu

Table of Content

1. Introduction.

2. Functional Description

3. Circuit Design and Layout..

4. PLA Description.

5. Performance Analysis.

6. Summary.

References.

1.1 Introduction

The chip we have made implements a CDMA related algorithm. It is called the Differencing Multistage detector for CDMA communications. In the future CDMA communication systems, sophisticated algorithms must be deployed in order to increase the capacity. The Multistage detector has proven to be one of the most efficient interference suppression algorithms because:

Compared to the conventional detector, Multistage detector approaches the error rate bound for the single user communication system which means almost all the interference is eliminated.
Multistage detector has O(K2) computation complexity (where K is the number of users), which is much less than the optimum multi-user detector and most of the proposed sub-optimum multi-user detectors.

The Differencing Multistage detector is the improvement of the conventional multistage detector. It maintains the performance and explicit data flow of the Multistage detector, while saving computations. The good performance of the Multistage algorithm together with the computation efficiency of the Differencing method makes this algorithm very promising.

1.2 Algorithm Description

The Multistage detection algorithm uses an iterative method to solve the linear equation

where y is the matched filter output vector, R is the cross-correlation matrix (self-correlation's are zero), A is the amplitude matrix, d is the vector we are going to solve, and  is the noise component added to the signal by the channel. In the proposed Wide-band CDMA communications, Binary Phase Shift Keying (BPSK) is used as the modulation scheme. Thus the contents of each bit would be either +1 or -1.

The lth stage of iteration will be:

In the Differencing method, we just use the differencing vector to get the iterative solution, because the difference of two consecutive stages would be:

In the differencing vectors, which are expressed as the difference of two consecutive hard decisions, there are only +2, -2 or 0s. Bypassing all the zeros, we are going to save the total computations. +2s and -2s can be easily implemented as arithmetic right shift once. So dedicated multipliers are not required.

To simplify the hardware design, we have focussed on fixed-point implementation and synchronized system only. The matched filter output has 10-bit precision. Due to the synchronization and the low cross-correlation of the Gold Code 31, the cross correlation matrix (RA) has the same elements in each row. So for eight-user case, we only need to store 7 elements (the self-correlation element is neglected). The ALU has 12-bit precision in order to match the dynamic range of the input data.

2.1 Functional Description

Our chip implements single stage of the Multistage detection algorithm. We can implement the Multistage detection algorithm by cascading chips together. The number of chips that are cascaded depends on the number of the stages in the detector. If the chip is used as the first stage, then we directly feed the inputs into the chip, which should have been otherwise supplied by the matched filter bank. But if the chip is not used as the first stage, then it receives inputs from the chip that implements the previous stage. The flow of data between the chips is controlled by a hand shaking mechanism. As soon as the next stage is ready, the current stage starts transmitting until all the bits are sent, which is indicated by the hand shaking's finished signal.

Each chip takes as input the soft decision bits, cross correlation matrix bits and previous hard decision input. The soft decision inputs are 10 bits wide and are loaded in parallel for each user. The cross correlation input is 8 bits wide and also loaded in parallel.

The standard cells include: 12-bit full adder/subtractor, 12-bit mux, 3:8 decoder, 3-bit comparator, 12x8 4x8 8x8 register files, shift register and PLAs. In the MAGIC layout design of these standard cells, we particularly pay attention to saving the space and hierarchy design. We divide the chip into two major blocks: the register block and the ALU block. The bounding box of these two blocks is 2300 by 2700 (lambda).

2.2 Block description

The register block imports and stores the input data and prepares the data for the ALU block. There are three major functional units in this block:

Register file A: has 12x8 register block, "read" decoder and "write" decoder and corresponding PLAs. It has the capability to be read and written at the same time (with different addresses). In the first step, it stores the input soft decision values. During the accumulation stage, it outputs the soft decision values to the ALU and stores the accumulation result.

Register file B: has 4x8 register block, decoder and the corresponding PLA. While we storing the soft decision values, it detects the position of all the non-zero elements in the differencing vector and stores their addresses and sign bits. During the accumulation stage, it outputs these addressed to fetch the proper cross-correlation constants in register C to the ALU.
Register file C: has 8x8 register block the decoder and mux's. Before we load the soft decision values to the register file A, we could load the cross-correlation constants to this register. The write of this register is controlled by PLA1 in register file A and the read of this register is controlled by the output of the register file B.We select the proper cross-correlation constants by determining if it is in the first stage and if the address is the same as that of register A (since we do not cancel the desired user's signal).

ALU: The ALU is a 12-bit adder/subtracter logic block, which uses 6 2-bit carry look-ahead addition to reduce the propagation delay time. The subtraction is done by adding the 2's complement input of the subtrahend. Each 2-bit carry look-ahead adder is made up of 2 Ex-ORs, 6 NANDs, 5 NORs and 6 NOTs. Compared to this, a 2-bit ripple carry adder is made up of 4 Ex-ORs and 8 NAND's. Thus, also, the Ex-ORs which contribute to the delay at the output are minimized though the interconnection logic becomes more complicated. Also, the carry is generated at the output much before the sum and is available to the next stage immediately, thereby reducing the overall delay. The timing analysis of the ALU is discussed in detail later. The crystal simulation showed that the ALU is approx. 1.94 times faster than it would have been if implemented in a ripple carry fashion. The logic diagram of a 2-bit carry lookahead adder is enclosed. For the 2's complement of the input (needed for subtraction), we invert every input bit (1's complement) and add 1 to the carry input bit of the lowest 2-bit adder. The invert/non-invert selection is done by using transmission gates which select the input for addition and the complemented input for subtraction. The 12-bit ALU consists of 298 p-channel and n-channel transistors each and occupies 556x1427 micron space in the chip.

3.1 Circuit Design & Layout

Logic Diagram of MUX-2

MUX-2 is a 2-1 multiplexer cell. We used the compound gate design for this cell.

Logic diagram of 3-bit Comparator

We use 3 XORs to get the equal and not equal information from the input.

Logic Diagram for a Latch

In the latch design, we focused on the protection of the latched information when several latches are connected by the same bus. The extra inverter in the circuits achieves this goal.

Logic Diagram for the Shift Register

In the shift register, two separated stages of the simple latch are used in order to prevent the transparent transmission of the input data.

Logic diagram of 3-8 decoder

A three-stage Differencing Multistage detector is shown above. Three ASICs are cascaded in a chain, driven by the same clock. The first stage is different because the differencing algorithm requires two initial inputs. Presumably, the second and third stage would be less busy than the first stage due to bypassing some zeros in their calculations. The throughput is determined by the slowest stage (which is the first stage obviously) and the delay is governed by all the three stages. Thus, using differencing method will not improve the throughput, but reduce the system delay dramatically.

4.1 PLA Description

In our chip we have used four PLA's in all. There is one main PLA that controls the other

three PLA's. The primary purpose that the PLA's serve in our chip is to generate addresses for read and write operations on the register files. There are three register files in our chip each of them storing some input bits. We need addresses so that we can access these bits. So the PLA's have been designed to generate these addresses. Since each register file has eight registers, we have 3 bit addresses. The address generated by the PLA is forwarded to a 3-8 decoder, which enables the line corresponding to the input address. The bits controlled by the enabled line are read or written if the enable line is high.

4.2 Main PLA

The main PLA is used in the chip to control the other three PLA's. It takes as input the

Restart and Load signal which are both external signals, and generates three Restart signals (one for each PLA), WB signal and the REGACTL signal.

Load / Restart1 Restart2

Restart3 REGACTL WB

Meg file for Main_PLA

-- Meg file for PLA1

INPUTS : RESTART LOAD;

OUTPUTS : RESTART1 RESTART2 RESTART3 REGACTL WB;

-- describe the reset logic

reset on RESTART to state0(RESTART1 RESTART2 RESTART3);

-- wait for input signal

-- idle state 0

state0 : if LOAD then state0 else state1;

state1 : goto state2(RESTART2 WB REGACTL);

state2 : goto state3(WB);

state3 : goto state4(WB);

state4 : goto state5(WB);

state5 : goto state6(WB);

state6 : goto state7(WB);

state7 : goto state8(WB);

state8 : goto state9(WB);

state9 : goto state10;

state10 : goto state11;

state11 : goto state11(REGACTL WB);

4.3 PLA 1

PLA 1 is used to generate the addresses for write operations into the soft decision and the

cross correlation register file. It takes as input the Restart1, WB and the Load signal which are generated by the Main PLA, and generates the 3-bit address, WR1 and WR3 which are write enable signals and Change signal which is used by PLA 2.

Load WB/ addr WR1 WR3

Meg file for PLA 1

-- Meg file for PLA1

INPUTS: RESTART LOAD WRITEBACK;

OUTPUTS: ADDR0 ADDR1 ADDR2 WR1 WR3 CHANGE;

-- describe the reset logic

reset on RESTART to state0;

-- wait for input signal

-- idle state 0

state0 : case(LOAD WRITEBACK)

0 0 => state0;

1 0 => state1(WR3);

0 1 => state1(WR1);

1 1 => state0;

endcase => ANY;

state1 : case(LOAD WRITEBACK)

0 0 => state1;

1 0 => state2(WR3 ADDR0);

0 1 => state2(WR1 ADDR0);

1 1 => state1;

endcase => ANY;

state2 : case(LOAD WRITEBACK)

0 0 => state2;

1 0 => state3(WR3 ADDR1);

0 1 => state3(WR1 ADDR1);

1 1 => state2;

endcase => ANY;

state3 : case(LOAD WRITEBACK)

0 0 => state3;

1 0 => state4(WR3 ADDR0 ADDR1);

0 1 => state4(WR1 ADDR0 ADDR1);

1 1 => state3;

endcase => ANY;

state4 : case(LOAD WRITEBACK)

0 0 => state4;

1 0 => state5(WR3 ADDR2);

0 1 => state5(WR1 ADDR2);

1 1 => state4;

endcase => ANY;

state5 : case(LOAD WRITEBACK)

0 0 => state5;

1 0 => state6(WR3 ADDR0 ADDR2);

0 1 => state6(WR1 ADDR0 ADDR2);

1 1 => state5;

endcase => ANY;

state6 : case(LOAD WRITEBACK)

0 0 => state6;

1 0 => state7(WR3 ADDR1 ADDR2);

0 1 => state7(WR1 ADDR1 ADDR2);

1 1 => state6;

endcase => ANY;

state7 : case(LOAD WRITEBACK)

0 0 => state7;

1 0 => state0(WR3 ADDR0 ADDR1 ADDR2);

0 1 => state0(WR1 ADDR0 ADDR1 ADDR2 CHANGE);

1 1 => state7;

endcase => ANY;

4.4 PLA 2

PLA 2 is used to generate the addresses for read and write operations into the state register

file. It takes as input the Restart2, ABS and Change signals and generates the 3-bit address, RD2 and WR2 which are the read and the write enable signals for state register file, and the Finish signal.

Change ABS/ addr Finish RD2 WR2

Meg file for PLA 2

-- Meg file for PLA 2

INPUTS: RESTART CHANGE ABS;

OUTPUTS: ADDR0 ADDR1 ADDR2 FINISH WR2 RD2;

-- describe the reset logic

reset on RESTART to state0;

-- wait for input signal

-- idle state 0

state0 : case(CHANGE ABS)

0 0 => state0(RD2);

1 0 => state0(FINISH);

0 1 => state1(WR2);

1 1 => state0(FINISH);

endcase => ANY;

state1 : case(CHANGE ABS)

0 0 => state1(RD2 ADDR0);

1 0 => state0(RD2 FINISH);

0 1 => state2(WR2 ADDR0);

1 1 => state0(RD2 FINISH);

endcase => ANY;

state2 : case(CHANGE ABS)

0 0 => state2(RD2 ADDR1);

1 0 => state1(RD2 ADDR0);

0 1 => state3(WR2 ADDR1);

1 1 => state1(RD2 ADDR0);

endcase => ANY;

state3 : case(CHANGE ABS)

0 0 => state3(RD2 ADDR0 ADDR1);

1 0 => state2(RD2 ADDR1);

0 1 => state4(WR2 ADDR0 ADDR1);

1 1 => state2(RD2 ADDR1);

endcase => ANY;

state4 : case(CHANGE ABS)

0 0 => state4(RD2 ADDR2);

1 0 => state3(RD2 ADDR0 ADDR1);

0 1 => state5(WR2 ADDR2);

1 1 => state3(RD2 ADDR0 ADDR1);

endcase => ANY;

state5 : case(CHANGE ABS)

0 0 => state5(RD2 ADDR0 ADDR2);

1 0 => state4(RD2 ADDR2);

0 1 => state6(WR2 ADDR0 ADDR2);

1 1 => state4(RD2 ADDR2);

endcase => ANY;

state6 : case(CHANGE ABS)

0 0 => state6(RD2 ADDR1 ADDR2 );

1 0 => state5(RD2 ADDR0 ADDR2);

0 1 => state7(WR2 ADDR1 ADDR2);

1 1 => state5(RD2 ADDR0 ADDR2);

endcase => ANY;

state7 : case(CHANGE ABS)

0 0 => state7(RD2 ADDR0 ADDR1 ADDR2);

1 0 => state6(RD2 ADDR1 ADDR2);

0 1 => state8(WR2 ADDR0 ADDR1 ADDR2);

1 1 => state6(RD2 ADDR1 ADDR2);

endcase => ANY;

state8 : case(CHANGE ABS)

0 0 => state8(RD2 ADDR0 ADDR1 ADDR2);

1 0 => state7(RD2 ADDR0 ADDR1 ADDR2);

0 1 => state8(RD2 ADDR0 ADDR1 ADDR2);

1 1 => state7(RD2 ADDR0 ADDR1 ADDR2);

endcase => ANY;

4.5 PLA 3

PLA 3 is used to generate the addresses for read operations into the soft decision register

file. It takes as input the Restart3 signal and generates the 3-bit address (the state bits are used for the address, so no separate address signals have to be generated by the PLA), RD1 which is the read enable signal for the soft decision register file and the Change signal.

Restart3/RD1 Change

Meg file for PLA 3

-- Meg file for PLA 3

INPUTS: RESTART;

OUTPUTS: RD1 CHANGE;

-- describe the reset logic

reset on RESTART to state0;

-- wait for input signal

-- idle state 0

state1 : goto state2(RD1);

state2 : goto state3(RD1);

state3 : goto state4(RD1);

state4 : goto state5(RD1);

state5 : goto state6(RD1);

state6 : goto state7(RD1);

state7 : goto state0(RD1 CHANGE);

state0 : goto state1(RD1);

5. Performance Analysis of the Chip

The heart of the chip is the ALU, which performs the calculations in the nested loop and stores them back in registers. Hence, the speed of the ALU operating on the input and storing them back in the register file constitutes the critical path in the circuit. Hence, a detailed timing analysis of the ALU was done. To perform the worst case timing analysis, we should ensure that

a change in the carry of one stage causes a change in the carry of the next stage, so that the carry bit ripples all the way to the top and produces a change in the output of the last stage.

The CRYSTAL and IRSIM analysis showed that the worst case could occur when all the inputs are high(carry is high) and all the carry's change to low (due to all the inputs changing to low). i.e on a 1-to-0 transition of all the carry's.

5.1 IRSIM Analysis

The IRSIM analysis of the ALU showed a worst case timing delay of around 25ns, for all the carry input's falling from 1-to-0 during addition. The IRSIM plot for the worst case is attached.

5.2 CRYSTAL Analysis

The Crystal simulation for the 12-bit ALU showed a worst case delay of 42.63ns for the output of the 12th bit to appear after the inputs have been changed during addition. To verify the results, we also simulated for the worst case of the 2-bit carry look-ahead adder. Quite interesting observations were seen. The 2nd bit of the sum took a worst case of 9.06ns to appear while the 1st bit of the sum took a worst case of 8.04ns to appear. However, for the carry of the 2-bit addition took only 7.04ns to appear while the carry of the addition of the 1st bit took 6.84ns to appear. This meant that hardly 0.2ns were needed for the 2nd stage of the 2-bit adder to generate its carry after the previous carry had been computed. Hence, if this adder had been used in a ripple carry fashion, we would be losing 6.84*2 - 7.04 = 6.64ns per 2-bits. Thus, using a 2-bit carry lookahead adder, we are able to increase the speed of the chip by 1.94.

C2 (7.04ns)

A1,B1 2nd bit S1 (9.06ns)

C1 (6.84ns)

A0,B0 1st bit S2 (8.04ns)

Worst case analysis of a 2-bit carry look-ahead Adder

5.3 Spice Analysis

The Spice Simulation for the worst case showed the maximum delay to be around 17ns. The ripples could be observed as all the inputs except the LSB changed from 1-to-0, on changing all 1's to 0's and adding them. (The LSB is 0 because on adding 2 inputs with all 1's, the LSB is 0)

The reason for the difference in timing between the various methods of simulation is due to the fact that Crystal, Spice and IRSIM use different parameter models for their simulation.

As the model was extracted for the 1.2 micron process, the results obtained by Spice should be taken as the most accurate.

5.4 Maximum Frequency

Hence, the theoretical maximum frequency obtained = 100MHz.

But, the speed of operation is limited by the IO Pads to around 25MHz.

Hence, Operating Max. Frequency for the chip = 25MHz.

6. Summary

We have successfully tested the functionality of our chip for each stage of the multistage case and also for all the 3 chips cascaded together. We found that we need some small glue logic like that of a NOR gate for connecting the reset signals and the finish-out signals together at a pin. We had 6 unused pins, which we used to bring out internal write signals of the register files and the upper 2 bits of the ALU. We are eagerly waiting for the chip to be fabricated so that we can start our testing.

6.1 Improvements

If there were enough space we would have liked to have 2 ALUs switched by a demultiplexer to improve the speed of the circuit, as the ALU was the bottleneck i our circuit. However, as the chip sped is limited by the IO pads increasing the complexity of the design did not make much sense.

As the Register file containing the soft decision inputs is accessed successively, we could have implemented that register as a shift register and saved a PLA for that register block.

6.2 Comments on CAD tools

The CAD tools were fine, except that all the CAD tools were for different models and the IRSIM, Crystal and Spice all showed largely different results.

6.3 Division of Labor

Tasks / Gang / Sridhar / Praful
Project discussion /  /  / 
The major standard cells / 
The ALU block / 
The registers, shift register / 
Main PLA / 
Overall debugging / 
Pad frame and 3-chip simulation /  / 
Crystal and Spice simulation / 
Report /  /  / 

References

1. Weste and Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, Second Edition, 1994.

2. J. Cavallaro,Elec 422 Course Lectures and Handouts.

3. J. Cavallaro VLSI Design I Packet,

4. William I Fletcher, An Engineering Approach to Digital Design, Englewood Cliffs, N.J. Prentice-Hall,1980.

5. G. Xu, J. Cavallaro, and B. Aazhang, VLSI Implementation of the Differencing Multistage Detection in CDMA Communication Systems,