Comparison and Analysis of BZ-FAD Based On Shift-Add Architecture With Various Multiplier

K.Martin Sagayam#1, N. Sri Ramalinga Ganesa Perumal*2

1.# PG Scholar, Sudharsan Engineering College, Sathiyamangalam, Pudukkottai, Tamilnadu, India.

, Contact : +91 9791435429

2.* Assistant Professor / HOD of ECE, Sudharsan Engineering College, Sathiyamangalam, Pudukkottai, Tamilnadu, India.

, Contact : +91 9894581169

Abstract

In the majority of digital signal processing (DSP) applications the critical operations are the multiplication and accumulation (MAC). Real time signal processing requires less delay, high speed and high throughput multiplier unit that consumes low power, which is always a key to achieve a high performance of the system. The purpose of this work is to design and analysis of low-power structure, which is called Bypass Zero Feed A Direct (BZ-FAD) based on shift and add multiplier. The architecture considerably lowers the switching activity of conventional multipliers, in terms of speed, area and delay. The result show that, Among other conventional multipliers the proposed architecture works faster of 154.20 MHz, less delay of 9.217 ns and area is optimized in terms of transitions, slices and LUTs.

Keywords: MAC, high throughput, BZ-FAD, Switching activity

1. Introduction

Multiplication is a less common operation than addition, but is still essential for microprocessors, digital signal processors, and graphics engines. Multiplication algorithms will be used to illustrate methods of designing different cells so that they fit into a larger structre. The most basic form of multiplication consists of forming the product of two unsigned (positive) binary numbers. M × N-bit multiplication can be viewed as forming N partial products of M bit each, and then summing appropriately shifted partial products to produce an M+N-bit result P. Binary multiplication is equivalent to a logical AND operation. Therefore, generating partial products consists of the logical ANDing of the appropriate bits of the multiplier and multiplicand. Each column of partial products must then be added and, if necessary, any carry values passed to the next column.

There are a number of techniques that can be used to perform multiplication. In general, the choice is based upon factors such as latency, throughput, area and design complexity. An approach is to use an M+1-bit carry propagate adder (CPA) to add the first two partial products, then another CPA to add the third partial product to the running sum, and so forth. Such an approach requires N-1 CPAs and is slow, even if a fast CPA is employed. More efficient parallel approaches use some sort of array or tree of full adders to sum the partial products. There are some of the multiplication techniques such as shift and add multiplication, SPST multiplier and BZ-FAD multiplier are discussed in more detail in further context.

2. Shift-and-Add Multiplication

Shift-and-add multiplication is similar to the multiplication performed by paper and pencil. This method adds the multiplicand X to itself Y times, where Y denotes the multiplier. To multiply two numbers by paper and pencil, the algorithm is to take the digits of the multiplier one at a time from right to left, multiplying the multiplicand by a single digit of the multiplier and placing the intermediate product in the appropriate positions to the left of the earlier results. As an example, consider the multiplication of two unsigned 4-bit numbers, 8 (1000) and 9 (1001).

Multiplicand 1000 ×

Multiplier 1001__

1000

0000

0000

1000 _____

Product 1001000 (7210)

In the case of binary multiplication, since the digits are 0 and 1, each step of the multiplication is simple. If the multiplier digit is 1, a copy of the multiplicand (1 × multiplicand) is placed in the proper positions; if the multiplier digit is 0, a number of 0 digits (0 × multiplicand) are placed in the proper positions.Consider the multiplication of positive numbers. The first version of the multiplier circuit, which implements the shift-and-add multiplication method for two n-bit numbers, is shown in Fig. 1.

Fig. 1. First version of the multiplier circuit

The 2n-bit product register (A) is initialized to 0. Since the basic algorithm shifts the multiplicand register (B) left one position each step to align the multiplicand with the sum being accumulated in the product register, we use a 2n-bit multiplicand register with the multiplicand placed in the right half of the register and with 0 in the left half.

Fig. 2. The first version of the multiplication algorithm

Fig. 2 shows the basic steps needed for the multiplication. The algorithm starts by loading the multiplicand into the B register, loading the multiplier into the Q register, and initializing the A register to 0. The counter N is initialized to n. The least significant bit of the multiplier register (Q0) determines whether the multiplicand is added to the product register. The left shift of the multiplicand has the effect ofshifting the intermediate products to the left, just as when multiplying by paper and pencil. The right shift of the multiplier prepares the next bit of the multiplier to examine in the following iteration. Example : Using 4-bit numbers, perform the multiplication 9 × 12 (1001 × 1100). Table I shows the value of registers for each step of the multiplication algorithm.

Table I. Multiply example using the first version of the algorithm

The original algorithm shifts the multiplicand left with zeros inserted in the new positions, so the least significant bits of the product cannot change after they are formed. Instead of shifting the multiplicand left, we can shift the product to the right. Therefore the multiplicand is fixed relative to the product, and since we are adding only n bits, the adder needs to be only n bits wide. Only the left half of the 2n-bit product register is changed during the addition.

Another observation is that the product register has an empty space with the size equal to that of the multiplier. As the empty space in the product register disappears, so do the bits of the multiplier. In consequence, the final version of the multiplier circuit combines the product (A register) with the multiplier (Q register). The A register is only n bits wide, and the product is formed in the A and Q registers. The final version of the multiplication algorithm is shown in Fig. 3.

Fig. 3. Final version of the multiplier circuit

Figure 4. The final version of the multiplication algorithm

Example : Perform the multiplication 9 × 12 (1001 × 1100) using the final version of the multiplication algorithm. Table II shows the revised multiplication example for the final version ofthe algorithm.

Table II. Multiply example using the final version of the algorithm

3. Switching Activities

The architecture of a conventional shift-and-add multiplier, which multiplies A by B is shown in Figure.5[1]. There are six major sources of switching activity in the multiplier. These sources, which are marked with dashed ovals in the figure, are: (a) shifts of the B register; (b) activity in the counter; (c) activity in the adder; (d) switching between “0” and A in the multiplexer; (e) activity in the mux-select controlled by B(0); and (f) shifts of the partial product (PP) register. Note that the activity of the adder consists of required transitions (when B(0) is nonzero) and unnecessary transitions (when B(0) is zero). By removing or minimizing any of these switching activity sources, one can lower the power consumption. Since some of the nodes have higher capacitance, reducing their switching will lead to more power reduction.

Fig. 5. Major sources of switching activity in the conventional shift-and-add multiplier

4. SPST multiplier

This concept provides the experience of applying an advanced version called spurious power suppression technique (SPST) on multipliers for high-speed and low-power purposes. To filter out the useless switching power, there are two approaches, i.e., using registers and using AND gates, to assert the data signals of multipliers after the data transition. The SPST has been applied on both the modified Booth decoder and the compression tree of multipliers to enlarge the power reduction.

Fig. 6. Detection logic circuits using registers to assert the control signals

Fig. 7. Detection logic circuits using an AND gate to assert the control signal

The main contribution of this concept is exploring two implementing approaches for the SPST and comparing their efficiency, which provide diverse reference materials for applying the SPST. The first implementing approach of the control signal assertion circuits is using registers, which is illustrated in the shadow area in Fig. 6. The three output signals of the detection logic are given a certain amount of delay before they assert, demonstrated in the timing diagram shown in Fig. 8. The delay Φ, used to assert the three output signals, must be set in the range of ψ Φ < Δ to filter out the glitch signals as well as to keep the computation results correct, where Δ and ψ, respectively, denote the data transient period and the earliest required time of all the inputs. The range of Φ is also shown as the shadow area in Fig. 8. Using registers to control the signal assertions can obviously reduce the spurious power dissipation of adders/subtractors. However, the restriction that Φ must be greater than ψ to guarantee the registers from latching the wrong values of control signals usually decreases the overall speed of the applied designs. This issue should be noticed in high-end applications which demands both high-speed and low-power requirements. To solve the previously mentioned problem, we adopt the other implementing approach of the control signal assertion circuits illustrated in the shadow area in Fig. 8, using an AND gate in place of the registers to control the signal assertion. The timing control of the delay Φ in this implementation is slightly different from the one in the first implementation. That is, the range of Φ can be set as 0 < Φ < Δ to filter out the glitch signals and to keep the computation results correct, as well.

Fig. 8 Timing diagram of the control signals of detection logic circuits after assertions

This feature allows upper-level systems to assert the close signal with an arbitrarily short delay closing to the positive edge of the clock signal, which provides a more flexible controlling space for the delay Φ. When speed is seriously concerned, this implementing approach enables an extremely high flexibility on adjusting the data asserting time of the SPST-equipped multipliers. Therefore, the proposed advanced SPST can benefit multipliers on both high-speed and low-power features.

5. Low Area, Delay and Power Multiplier: BZ-FAD

To derive a low power architecture, concentrate efforts on eliminating or reducing the sources of the switching activity in the conventional multipliers. A low-power structure called by pass zero, feed A directly (BZ-FAD) for shift-and-add multipliers is proposed. The modifications to the multiplier which multiplies by include the removal of the shifting the register, direct feeding of to the adder, bypassing the adder whenever possible, using a ring counter instead of a binary counter and removal of the partial product shift. The architecture makes use of a low-power ring counter proposed in this work. The proposed multiplier can be used for low-power applications where the speed is not a primary design parameter.

Fig. 9. Proposed low area, delay and power multiplier architecture

In this Fig.9 the Feeder and Bypass are used to bypass the adder in the cycles where B(n) is zero. In each cycle, the hot bit of the next cycle (i.e.,B(n+1) is 1 ) is checked. If it is 0, i.e., the adder is not needed in the next cycle, the Bypass register is clocked to store the current partial product. If B(n+1) is 1, i.e., the adder is really needed in the next cycle, the Feeder register is clocked to store the current partial product which must be fed to the adder in the next cycle. Note that to select between the Feeder and Bypass registers we have used NAND and NOR gates which are inverting logic, therefore, the inverted clock Fig.9 is fed to them. Finally, in each cycle, B(n) determines if the partial product should come from the Bypass register or from the Adder output.

6. Result and Discussion

In this section, compared analysis of various multipliers with the proposed architecture. The tool used is XILINX 9.1i and Quartus II 9.0. The results showed for various multiplier schemes – speed, area and delay.

Table III: Synthesis result of various multiplier design using Quartus II 9.0

S.No. / Multiplier / Logical elements / Speed (MHz) / No of Transitions
1 / BZ-FAD / 70 / 154.20 / 985
2 / SPST / 66 / 153.25 / 11028
3 / Conventional / 65 / 59.68 / 17435
4 / Array multiplier / 63 / 133.46 / 1032

Table IV: Synthesis result of various multiplier design using XILINX 9.1i

S.no. / Scheme / Area in terms of Slices / Area in terms of LUTS / Delay (ns)
1 / BZ – FAD / 85 out of 3584 / 156 out of 7168 / 9.217
2 / SPST / 8 out of 768 / 15 out of 1536 / 10.824
3 / Conventional shift-add multiplier / 60 out of 3584 / 108 out of 7168 / 26.511
4 / Wallace tree multiplier / 88 out of 3584 / 155 out of 7168 / 34.161
5 / Booth encoding / 28 out of 3584 / 45 out of 7168 / 13.829

Fig. 10. Output waveform for BZ-FAD mulitiplier in Modelsim 6.4a

7. Summary and Conclusion

In this paper, a low-area and low-delay architecture for shift-and-add multipliers was proposed. The modifications to the conventional architecture included the removal of the shift of the B register (in A X B ), direct feeding of A to the adder, bypassing the adder whenever possible, use of a ring counter instead of the binary counter, and removal of the partial product shift. The result show that, Among other conventional multipliers the proposed architecture works faster of 154.20 MHz, less delay of 9.217 ns and area is optimized in terms of transitions, slices and LUTs. From this result, BZ-FAD multiplier is an excellent choice for real time signal processing applications.

8. References

[1] M.Mottaghi-Dastjerdi, A.Afzali-Kusha, and M.Pedram, “BZ_FAD: A Low-power Low-Area Multiplier Based on Shift and Add Architecture,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 2, pp. 302–306, Feb.2009.