A Combined Decimal and Binary Floating-Point Multiplier

Literature Review Summary for EE-800

Name: Mehrnoosh Janbakhsh

ID: 11076994

ECE, U of S

Literature Review: 1

Title: Novel Table Lookup-Based Algorithms for High-Performance CRC Generation

Authors: Michael E. Kounavis, Member, IEEE, and Frank L. Berry

CRCs are error detecting codes that are capable to detect the accidental alteration of data. Data in computer systems can be modified due to many reasons like hard drive malfunctions, Gaussian noise, and faulty physical connections. Building high performance CRC generators is based on first, bit slicing principle and then bit replacement.

This algorithm can read large amount of data at the same time and it can be implemented in software using commodity processors. They suggest the current remainder and the next

data bits read from a stream, can be expressed as sums of
smaller terms.

This method needs less memory when reading a large amount of bits at a time in compare with Sarware algorithm and improves the CRC performance by reading 32 bits on Novel slicing by 4 and 64 bits on slicing by 8 algorithms as it uses the parallel Look Up tables to generate the CRC values. Also it computes the next remainder by performing parallel LUTs into smaller tables.

It needs less operation in a loop than Sarware algorithm. As an example ,for each byte of an input stream, the Sarwate needs an XOR operation between a byte read and the least

Significant byte of the current CRC value, a table lookup, a shift operation on the current CRC value, an XOR operation between the shifted CRC value and the word read from the table. In contrast, for every byte of an input stream, the slicing-by- 4-algorithm performs only a table lookup and an XOR operation. This is the reason why the slicing-by-4 algorithm is faster than the Sarwate algorithm. Since the Sarwate algorithm requires 35 instructions to execute over 32 bits and the slicing-by-4 algorithm requires 16 instructions

Literature Review: 2
, 3

Title:
A combined decimal and binary floating-point multiplier

Authors:Charles Tsen, Sonia González-Navarro, Michael Schulte, Brian Hickmann, Katherine Compton

In this contribution they describe the first hardware that works base on both 64-bit DFP (BID) and 64-bit BFP. The BFP and DFP has the same format as (-1)^s.C.2^(E-Bias). C or significand could be any number between [0, 10^p-1], s is the sign and Bias is 398 for decimal64. All the rounding methods could be used here; however RTA is required only for DFP.

In this method, many hard wares like 54x54-bit multiplier, the Right shifter, the Sticky calculation and the Exponent calculation also the Incrementer and the Increment decision

logic in rounding logic could be shared between BID and BFP as the exponent and significant widths differ by only one bit and the special cases detection for infinity, sNaN, qNaN and exceptions are the same.

The combined BID/BFP multiplication algorithm is based on the similarities. First, it decodes the inputs to obtain the sign, Exponent and C. Also detects the special input operands. Second, it computes the intermediate product CIP with using a binary multiplier CIP = CA . CB. In parallel, it calculates the intermediate exponent ( EIP = EA + EB – bias) and the final sign as signZ = signA XOR signB

In third step, it examines CIP to determine if rounding is needed. It is needed if CIP exceeds p bits or digits. Then CZ will get created via a conditional increment of CTP based

On the round and sticky bits. If rounding causes a carry out, the final exponent, EZ, will get adjusted and finally, it encodes the output based on the signZ, EZ, and CZ.

This combined multiplier controls everything. If a BID multiply enters the unit while it is idle, the operation begins immediately if not it should wait until the current task be done, which takes five or fifteen cycles, depends on if rounding is needed or not. However, it is 5 cycles for BFP.

The results shows this design improves the area by 58% in compare with individual BID and BFP, the delay is slightly longer than BID but about 37.8% more than BFP.

P.S. This paper is presented twice in the class as my second and third lecture.

Literature Review: 4

Title:
A Novel Directory-Based Non-Busy, Non-Blocking Cache Coherence

Authors: Huang Yongqin, Yuan Aidong, Li Jun, Hu Xiangdong

Jiangnan Institute of Computing Technology, Wuxi214083, China

The implementation of multiprocessors cache coherence and memory consistency can help the homemade CPUs support a wide range of system designs.NB2CC protocol divides the serial processing into conflict detection and conflict solution steps. Conflict detection is completed at the home node, while conflict solution is distributed to owners.

There are two ways to solve the cache coherence problem: direction and Indirection

Direction pays a performance penalty on sharing misses; however, indirection has more sharing miss latency to obtain scalability and they use indirection method.To achieve high efficiency and concurrency at small cost, NB2CC uses a lot of characteristics of traditional protocols, such as relaxed memory model, the method of avoiding protocol deadlock and basic process of request races.

In this prototype, everything is based on the order. It means the first local request will be answered based on the first home request and everything is in order there.

In this method we won’t have any special hot zones and each part contributes in the best way to keep the network free and avoid of sending unnecessary messages and they don't send more than two forward request to owner so they try to reduce the requests.