1

Locke/Fynn



Intelligent Data Fuzzing in Hardware as a Means of Security Auditing

Dustin C. Locke, Anthony Fynn, WashingtonUniversity in St. Louis

Abstract—Fuzzing is a common technique used in penetration testing and security auditing. Many things can be fuzzed (network protocols, file data, APIs, etc.), and it is a generally effective technique for finding security flaws. Existing tools are primarily implemented as software solutions. We have developed a proof-of-concept hardware fuzzing appliance as an extension of a general-purpose processor which can perform intelligent (using knowledge of data structure) fuzzing at high speeds. Applications could include large-scale security auditing of high-bandwidth network gateways, or high-speed application auditing using clustering.

Index Terms—fuzzing, security auditing, hardware, flaws, testing

I.INTRODUCTION

T

hereare many different tools employed by security auditors. Static analysis usually involves poring over source code and manually checking for vulnerabilities in the implementation or design flaws. Dynamic analysis is generally more interesting, yields quicker initial results, and is a good method for finding the proverbial “low-hanging fruit.” Fuzzing is a common type of dynamic analysis which involves sending corrupt data to an application and then observing how the application handles that data. The desired result is for the application to crash in a certain way. For example, if the application crashes with a segmentation violation while reading the next instruction and the value in the instruction pointer came from the input data, there is a usable vector for injecting arbitrary code into the application and pointing the processor at it (the cause of the fault could be a buffer overflow, format string error, integer overflow, or any number of other types of vulnerabilities). In short, by randomizing the input data, sending it to the application, and recording results (whether it crashes or not, what the state was if it did crash, etc.) many times, it is very probable that exploitable vulnerabilities will be found. Many published security warnings and vulnerabilities have been found using fuzzing.

There are a number of ways in which data can be fuzzed, but they are generally broken into two categories: dumb fuzzing and intelligent fuzzing. Dumb (or naïve) fuzzing is the blind mangling of input data without regard for what the data means. This type of fuzzing is still a surprisingly effective tool for finding latent vulnerabilities. However, the desired method is referred to as “intelligent” or “structural” fuzzing, and involves selectively corrupting data based on its structure. For example, if you are fuzzing a network protocol such as IP or TCP, there are certain fields which you cannot corrupt, such as the source and destination addresses/ports, and checksums. Intelligent fuzzing is the process of applying specific knowledge such as this about the input data format while mangling the data. This is a more effective and generally more accepted method of fuzzing.

Many different types of fuzzers exist: network fuzzers, file fuzzers, API fuzzers, etc. But as far as the authors know, all existing fuzzers are software-based. One particular product billed as a “network fuzzing appliance,” called Hydra, is produced by a company called Security Innovations in Melbourne, Florida. However, it is simply a machine with two network cards that is running Linux. The most common fuzzing tool is called SPIKE. SPIKE, which is released under the GPL, is basically an API and a set of related tools that allow you to quickly perform network “stress-testing.” It has a unique method of “data marshalling” that allows intelligent fuzzing to be performed on virtually any network protocol. It is, however, a software-based tool.

Our goal was to create a hardware data fuzzer with the versatility of an application such as SPIKE, but with advantages (such as speed and efficiency) that can only come from a hardware application.

II.Design Goals and Project Overview

The canonical example in the minds of the authors when describing the goals of this project is a network fuzzing appliance, similar to the aforementioned Hydra, but with all of the logic built entirely into hardware. The appliance would sit, for example, on a network line between two nodes, taking data in on one end and sending the data out again on the other. It should be a transparent “node” which divides the network link. As the data passes through it is mangled according to the protocol, and the results can be observed on the destination node.

Our goal was not to create this entire solution, but to design and simulate the hardware which would be necessary to make this example a reality. We set out to design a piece of hardware that could perform data fuzzing according to some user-supplied input (based on structural knowledge of the data) at a high rate. Once this proof-of-concept is complete, it is a matter of exercise to add the necessary hardware for network I/O.

So, explicitly stated, here were our main goals:

-Design a data fuzzing device entirely in hardware

-The device should be able to perform intelligent fuzzing

-The device should be robust as to the type of data it is fuzzing (i.e., we did not want to create strictly a TCP fuzzer, for example)

-The device should perform at a rate sufficient to keep up with data traversing over a high-speed network link

-Once created, the device should be easy to “configure” for a new protocol/data type

-The device should be able to be produced in high volume at a minimal cost

These were our main goals for the project. As we completed our design and began simulations, some of these goals forced us to make design and implementation changes, as we will describe later.

The design was completed in VHDL using Xilinx’s tools, and simulated using ModelSim.

III.General Architecture

A.Basic Processor

The fuzzer was implemented on top of an existing design for a 5-stage pipelined RISC microprocessor with a custom Instruction Set Architecture and assembler, designed and completed by the authors. New modules were added to the processor to facilitate the fuzzing process, and new instructions were added to the ISA and assembler to allow user-specified control of the fuzzing process.

The original processor design was fairly standard. It consisted of a 5-stage pipeline (instruction fetch, instruction decode, instruction execute, memory access, and writeback), a program counter and accumulator, separate 16-bit addressable 32-bit instruction and data memory (Harvard architecture), a simple 32-bit ALU, and the glue logic necessary to complete the pipeline.

There were 22 separate instructions in the ISA, all supported by the custom assembler (which was written in Perl). The instructions were grouped into three basic types: R-type (register), I-type (immediate) and J-type (branching). The ISA and accompanying assembler is rich enough to support test programs that perform fairly complex operations, such as computing a Fibonacci sequence and sorting an array of numbers.

The SRAM modules used are based off of the Samsung K7A203200A 64Kx32-bit Synchronous Pipelined Burst SRAM module. This specific SRAM module supports a 32-bit data word, 16-bit addressable memory, and has a maximum lookup time of 4.5ns (depending on the input voltage), making it perfect for our applications. Since a Harvard architecture isused, there will not be conflicts between the instruction fetch and memory access stages.

The register file contains 32 separate 32-bit general purpose registers (R0 through R31). Similar to the MIPS architecture, R0 is a special register which always contains the value 0.

The 32-bit ALU is fairly standard, supporting addition, subtraction, less than, greater than, equal to, rotate left, rotate right, shift left, and shift right operations. It takes two 32-bit operands and a 12-bit control code, and produces a 32-bit result as well as a zero flag and overflow flag as its output. The ALU is synchronous, and will produce the result within a single clock cycle (the maximum gate delay is less than 20ns).

B.Fuzzing Operation

Fuzzing itself is a very simple process, usually accomplished by performing an XOR of the data to be fuzzed with some random value. Our fuzzing unit is no different, but has some extra functionality to enable intelligent fuzzing.

To perform intelligent fuzzing, we take not only the data to be fuzzed as input, but a mask value which we use to selectively fuzz certain parts of the input data. The mask is essentially a bitmap—each bit of the mask marks a full byte of the input data to either be fuzzed or left alone. Thus, for a 256-bit data word, a 32-bit mask is necessary. A random number is XORed with the input data for bytes where the corresponding mask bit is set to 1, and the result is a “fuzzed” number.

C.Hardware to Support Fuzzing

In order to support the fuzzing operation in hardware, several new pieces of logic had to be added to the processor. A special register file, a fuzzing ALU, and a special SRAM module were created and integrated into the original processor pipeline. The specifics of these devices will be discussed.

IV.Fuzzing Hardware

A.Special Register File

The register file used for fuzzing operations is similar to the general purpose register file, with a couple of significant enhancements. Firstly, it consists of 8 256-bit data registers instead of 32-bits, which allows higher data throughput by fuzzing more data at once. Secondly, each of the 256-bit data

Figure 1. Special masking large-word register file

registers contains a parallel 32-bit mask register which is used to specify which bytes in the data register are to be fuzzed. So, for example, the mask value 0011(for an arbitrary four-byte number) would mean: fuzz the two least significant bytes, but leave the two most significant bytes alone. 32 bits in the mask, with each bit representing a full byte (8 bits), results in a full 256-bit data word. Data registers are populated only through loads from the special SRAM module. Mask registers are set through immediate values in assembly instructions by the programmer, since these will be data-specific (the mask will depend on the type of data being fuzzed). When the fuzz instruction is executed, the selected register data and its mask are sent out on the register file’s output lines to the fuzzing ALU. The result from the fuzzing ALU is then written back into the correct register in the special register file. Store instructions move values from the data registers back into the special SRAM module.

B.Fuzzing ALU

The actual fuzzing operation is performed in the fuzzing ALU, or fuzzing unit (FU). Other than containing an internal pseudo-random number generator, the fuzzing unit is fairly simple. The FU takes as input a 256-bit data word and a 32-bit mask. It then expands the 32-bit mask into a 256-bit value by essentially repeating each bit 8 times, then performing an AND of this value with the generated pseudo-random number. This result is then XORed with the data word to produce the final, 256-bit fuzzed result.

The random number generator effectively produces a 256-bit random number, but it does it in a quick and efficient manner which will be described in a later section. Other than generating the random number, the only operations required are an AND and an XOR. Thus, though the FU is the bottleneck of the entire processor, it has a gate delay of only 21ns (which is only slightly larger than the maximum gate delay of the general purpose ALU).

C.Special SRAM

The SRAM used in fuzzing is exactly the same as the Samsung module used in the general-purpose processor, but with a 256-bit wide word. It is 16-bit addressable, and supports both reading and writing with a maximum response time of 5ns.

V.Extended ISA

In addition to our original 22 general-purpose instructions, five more were added to support the fuzzing operations: fzlw, fzsw, fuzz, mskh, and mskl.

The first two are used to load/store a 256-bit data word to/from the special memory module, respectively. The load grabs a word from memory and places it into the specified special data register, and the store removes the word from the specified data register and writes it to the specified address in the vector SRAM.

The fuzz instruction takes a special register as an operand, and causes the specified register data and mask to be sent from the special register file to the fuzzing unit. The FU performs the fuzzing operation, and the result is sent back to the special register file and placed back into the data register from which it originated.

The mskh and mskl instructions are used to set the high and low nibbles of the mask value, respectively. Since the mask value is 32 bits, the entire immediate value will not fit in a single instruction, which has exactly 32 bits available for both the control code, register number, and immediate data. Thus, two instructions are required to set the mask for a specific fuzzing register.

VI.Optimizations

One of our main goals was to produce a hardware data fuzzer that would realize high throughput. During the simulation process, several bottlenecks were discovered that would prevent this from happening, and thus required a change from our original design and/or implementation. Two of the more important issues are discussed here.

The first issue we encountered was the setting of the mask value. Our original design specified a parallel mask with a bit-to-bit mapping, which would require a 256-bit mask value. Since the mask is set through immediate values from the instruction memory, this would require a prohibitively large number of instructions to set the mask value 16 bits at a time. Obviously, this was unacceptable. Upon further reflection, it was realized that most data is structured at the byte granularity, meaning that our mask could also be based on the granularity of a byte. Thus, we modified our mask to be a bit-to-byte mapping (as in a bitmap), and the result was a 32-bit mask which could be set using immediate values in at most two instructions.

The second issue was the implementation of the random number generator. The random number generator was built using the same methodology as glibc’s rand() function. It performs a multiplication of a seed value with a large number, then adds a constant value. Unfortunately, to generate a 256-

Figure 2. Processor diagram

bit random number requires a very expensive multiply operation, and resulted in maximum combinational gate delays of well over 300ns. This was prohibitively slow, and would require the entire processor to be clocked at a sluggish speed. To combat this, we instead implemented the random number generators as parallel 8-bit RNGs, which combined through two stages to produce the final 256-bit random number. The result was an equally functional RNG, but with a maximum combinational gate delay of 21ns, a marked improvement.

VII.Performance

One of our stated goals was to produce a hardware fuzzing device which would be useful in live network scenarios. The example used was a transparent node that would bridge a standard network link, mangling data as it passed through. For this to be successful, the device must be able to keep up with the native line speed of a network, preferably a high-speed link to make it more applicable.

The maximum gate delay of our entire processor, as mentioned before, lies in the fuzzing ALU at 21ns. Some calculations show this produces a maximum clock speed of 48 MHz. It takes a maximum of five clock cycles to fuzz a single 256-bit word of data (one load, two set-masks, the fuzzing operation, and a store), resulting in a minimum throughput of slightly over 2.5 Gbps. This exceeds the native line speed of an OC-48 fiber optic line, which clocks in at just under 2.5 Gbps.

It is important to realize that this is a lower bound on the processor’s fuzzing rate. Most data will consist of a header, which will be selectively fuzzed using the mask registers, and a data payload which will be completely fuzzed (i.e., all mask register values will have all bits set to 1, and will not be changed). For bulk fuzzing, where the mask is not changed, this removes two clock cycles from the fuzzing process, resulting in a required time of 3 clock cycles to fuzz 256 bits of data. This results in an upper bound of more than 4 Gbps.

VIII.Conclusion

We were able to successfully create a data fuzzing device entirely in hardware, which is able to perform intelligent fuzzing at high speeds.

In addition, the hardware is fairly simple and low-cost (and is only clocked at 48 MHz), which makes it easy to mass-produce and market.