Lucite Project

A report

On

Implementation of Multipliers and Elliptic Curve Addition on Viva

Siddaveerasharan Devarkal

Dec 9th 2003

Multipliers

We have looked into basically three types of multipliers namely Divide-and-Conquer, Broadcast, and Hybrid models. Each of them is described in the following three sub-sections of the document.

Divide and Conquer Multiplier (D&Q):

This is Karatsuba’s algorithm for divide and conquer approach that requires only three multipliers instead of four as would be required in conventional D&Q multipliers. This eliminates one multiplication at the expense of two adders.

The algorithm* is as follows:

A = Ah . 2n/2 + Al, B = Bh . 2n/2 + Bl

T0 = Ah Bh

T1 =Al Bl

T2 = (Ah + Al ) + (Bh +Bl)

A . B = T0 . 2n + (T2 – (T1 + T0)) . 2n/2 + T1

This algorithm is recursive, which enables us to use 18-bit hardware multipliers. For example, a 32-bit multiplier requires three 16-bit multipliers for which it uses 3 hardware multipliers. Therefore, in a single Xilinx Vertex-II 6000 FPGA chip, which has 144 hardware multipliers, we can fit a maximum of one 256-bit D&Q multiplier.


How to force the viva synthesizer to use hardware multiplier? To find hardware multiplier or MULT18x18 object click on the system tab and in there go to /HC/HC_PEx directory. 18-bit input data is exposed to bits and then fed to the MULT18x18 object. Before exposing 18-bit data the bits have to be reversed, and again the product has to be reversed after collecting. This is because in MULT18x18 object LSB-MSB is from top to

Fig. 1 Hardware Multiplier

bottom and for Expose/Collect object it is reverse. One more point that has to be kept in mind is that if inputs are less than 18-bit they must be converted to 18-bit. In the case of not doing that LUTs will be used instead of MULT18x18 blocks. For example, 32-bit

D&Q requires three 16-bit multipliers for which we use hardware multiplier (MULT18x18). Before the 16-bit operands are given to MULT18x18 they must to be converted to 18-bit. Following design will enforce the use of hardware multiplier blocks.

Design Implementation & Results: D&Q multiplier was designed bottom-up. First, D&Q32 was designed using three hardware multipliers, and three of these multipliers were used in designing D&Q64, which in turn were used for D&Q128. This recursion can be easily extended to 256 or higher bit lengths. For each multiplier the basic design remains same and we need to only replace the three base multipliers.

Table 1 shows the resource utilization for D&Q multipliers. For D&Q64, adders and other logic take 730 slices and the same for D&Q128 is 1827. From this we can estimate that for D&Q256 slice usage for adders and other logic will be 4568, and for the whole design approximately 32,414 slices. This would barely fit in Vertex II 6000 that has 33,792 slices. To be on the safer side resource usage for adders and other logic could be tightened up which I plan to do in coming months.

Slices / MULT18x18 / Clock Cycles / Clock Period (ns)
D&Q32 / 585 / 3 / 6 / 15
D&Q64 / 2485 / 9 / 16 / 20
D&Q128 / 9282 / 27 / 21 / 30

Table 1 Resource usage for D&Q multipliers

To show how good D&Q is in terms of speed we can compare it against the Viva Mul object, which is a Shift-Add multiplier. Table 2 gives resource utilization for Viva Mul object. 128-bit version is notoriously slow at 120ns where as D&Q version is at a reasonable speed of 30ns.

Slices / Clock Cycles / Clock Period (ns)
ShiftAdd32 / 361 / 34 / 20
ShiftAdd64 / 721 / 66 / 30
ShiftAdd128 / 1493 / 131 / 120

Table 2 Resource utilization for Viva Mul object

Broadcast multiplier:

Figure 2 gives a pictorial description of the operation of a broadcast multiplier. This particular example uses four base multipliers each of bit-width equal to (1/4) of the bit-width of the operands to be multiplied. For example, to multiply two 192-bit operands we would require four 64-bit base multipliers. You can have any number of base multipliers and the results for some selections are shown in Table 3. The base multipliers themselves are implemented in this fashion, and this continues until you reach a bit-width of less than 18 for which use can use the hardware multipliers available in Vertex-II 6000 FPGA chip.

Fig. 2 Broadcast Multiplier
Bit-width / Number of blocks / Area (Slices)
32 / 2 / 425
48 / 3 / 611
64 / 2 / 1442
64 / 4 / 795
80 / 5 / 1073
96 / 6 / 1276

Table. 3 Area consumption for different bit-widths with different number of base multipliers

Hybrid Multipliers:

Elliptic curves are defined for 192, 224, 256, 384, and 521 bits, and arithmetic operations over such wide-bit operands are of major concern. The dominating factor in the cost of arithmetic is the cost of multiplication, both in terms of area and timing. So, there is an absolute need for a reasonably fast multiplier that can fit in the available resources. Divide-and-conquer (D&Q) multiplier is fast but takes a lot of area and on the other hand Broadcast multiplier (BC) takes less area but is comparatively slower. These two multipliers can be combined to form, what we call a “Hybrid Multiplier” to make a good trade off between area and time. This hybrid model is hierarchical with either D&Q or BC (with 1-6 blocks) used at each level and 18-bit hardware multipliers (HM) used as base unit for the lowest level multiplier. For details on the implementation refer [3].

Elliptic Curve Addition (ECCAdd)

Here we implemented a design to add two points on an elliptic curve. Addition of two points on an elliptic curve requires a sequence of arithmetic operations like addition/subtraction, multiplication and squaring, and all these operations are performed modulo a large prime number. The only arithmetic operation of much concern is multiplication because it is expensive both in terms of area and speed. As of now we are not constrained by area as we have a set of four Vertex II 6000 FPGA chips to place our design on and for this reason D&Q is used as the base multiplier in the for ECC addition to achieve good results in terms of speed.

Design Implementation and Results: The building blocks for ECCAdd are AddModN (modular addition), SubModN (modular subtraction), LeftShiftModN (modular left-shift) and MontMul (Montgomery multiplier). All the arithmetic operations are modulo a prime N. The inputs to these objects are assumed to be multiplied by R modulo N, where R is a power of 2 greater than N. In this version of ECC addition module, we have used as many arithmetic blocks as required. There is another version in which arithmetic blocks are shared among different operations, and that involves a lot of control logic, which is very difficult to implement in Viva. One-way to come over this is to implement the control logic in VHDL and import it into Viva. Currently, the process of importing VHDL modules into Viva is still in early stages and have wait for some time before this can be realized.

a) AddModN: This is addition modulo a prime N. The sum is reduced only if carry bit is set, and the second adder carries out reduction. Mux selects one of the two values based on the carry bit.


Fig. 3. AddModN

b) SubModN: This is subtraction modulo a prime N. AddSub object does 2’s complement arithmetic so MSB of the sum is checked to see if it is set. If set reduced sum is selected as output.


Fig. 4. SubModN

c) LeftShiftModN: This performs single-bit left shift and if there is an overflow the left shifted result is subtracted by N. Basically, we try to keep the bit length of the result to within the range of N and this holds for all other building blocks of ECCAdd.

Fig. 5. LeftShiftModN

d) Montgomery Multiplier (MontMul): In all previous operations the result could be at most 1-bit more than the input operands. But, here the result will be double the length of the operands. We could directly divide the result by N to get the remainder, but this would be expensive in hardware. So we use an algorithm proposed by Peter Montgomery that uses two multiplications, two additions and few shift operations in place of division to reduce double-length product to single-length.

Fig. 6. Montgomery Multiplier

The algorithm** is as follows

A = A * R mod N, B = B * R mod N

T = A * B * R2

m = ((T mod R) . N’) mod R

t = ((T + m * N) / R) mod N

where A & B are inputs , T is the product, m is the intermediate result in reduction and t is the reduced value.

Since R is a power of 2, modulo R operation is equivalent to stripping off the upper half bits of T and division by R is same as stripping off the lower half bits. If the final value t falls outside the range of N then N subtracts it.

e) Complete Design:The total sequence of arithmetic operations can be broken into seven steps based on the data dependency of the operations. For a single point addition on an elliptic curve we could require 14 multiplications, 7 addition/subtractions and 2 left-shift operations. Table 3 gives the resource utilization for 14, 17, 24 and 30-bit versions of ECC addition. To do one Montgomery multiplication we require three ordinary multipliers, and there are 14 such Montgomery multipliers. So, we would require a total of 42 ordinary multipliers for which we can use D&Q or Broadcast or Hybrid multipliers. For example, in 24 and 30-bit versions of ECCAdd we use 32-bit D&Q as the base multiplier. 32-bit D&Q uses three hardware multipliers and for the complete ECCAdd we would require a total of 126 hardware multipliers. For more information on the implementation refer to [4].

We used Viva 2.2 and one of the most annoying parts during the design process was the synthesis time that used to go for nearly 24 hours for designs as big as 30-bit version of ECCAdd. For 14 and 17-bit versions synthesis time used to be around 1-2 hours. We also ran these designs on the recently released beta version of Viva 2.3 and observed a considerably reduction in the synthesis time up to orders of 5x in some cases.

Looking at the figures for 30-bit version it is apparent that for 64-bit version we run out of both slices and hardware multipliers. So, we partition the design onto multiple chips and this is what I am working on currently.

ECCAdd / Slices / MULT18x18 / Clock Cycles / Clock Period (ns)
14-Bit / 2969 / 42 / 34 / 15
17-Bit / 4390 / 42 / 34 / 15
24-Bit / 20649 / 126 / 147 / 20
30-Bit / 23599 / 126 / 147 / 25

Table 4 Resource utilization for ECC Addition

Multi-chip design:

For D&Q multipliers we have seen that 256-bit version barely fits. Obviously for higher bit versions we have to place the design on multiple chips. Also, for ECC addition of 64-bit and above we need to partition the design across multiple chips.

Hypercomputer provides 50-bit wide communication between each of the four chips. To move data larger than 50-bit we need to multiplex. Currently, we are able to send a list of any number of words from one chip to another. We have conducted simple tests to be satisfied that this really works. What we did was to send a list of 32-bit numbers from PE5-PE6 increment them on PE6 and send to PE7, again increment them and send back to PE5.

For 64-bit ECC addition we need to partition the design across three chips because the design requires 378 hardware multipliers. In a single chip a maximum of five 64-bit Montgomery multipliers can be fitted. Taking into account these constraints here is my scheme for partitioning 64-bit ECC addition. Step1 has five multipliers so it can be placed on one chip, steps 3, 4 and 5 have combined five multipliers, so those can be placed on second chip and the rest of the design can be placed on the third.

If we are to use D&Q for 128-bit or higher bit versions of ECC addition we have to reuse the resources. Or else we have to use Broadcast or some of the hybrid multiplier models.

Future Work:

For the 192-bit version of the ECCAdd we will not be able to fit in as many multiplier modules as required by the design in the available resources. This means we will have to allocate some modules and re-use them for multiple operations. Other operations like addition, subtraction and left-shift take minimal resources and so we can allocate as modules as required by the design.

So, the first question is how many modules to allocate? Also we have different types of multipliers to choose from like Broadcast, Divide-and-Conquer and various Hybrid models. With the given allocation for each type of multiplier we next have the question of what assignment to have i.e. what operations to run on each of the allocated modules. Given the allocation and assignment we also need to think of the scheduling technique so as to minimize the latency of the design. Currently, we are looking for answers to these questions.

References:

1. Reference: James Ross Goodman, Energy Scalable Reconfigurable Cryptographic1 Hardware for Portable Applications, PhD Thesis, MIT.

2. Reference: Duncan A. Buell, Elliptic Curves and the NIST Standards, May 2002.

3. Estimation of Area and Timing for a Hybrid multiplier.

4. Elliptic Curve Arithmetic on a Reconfigurable Hardware