Proposal for Optimal Document Hashes

Proposal for efficient hashes effectively differentiating application generated data

Abstract

Hash functions are widely used in computer applications to optimize various computing resources. In many situations (such as duplicate detection, synchronization/mirroring, indexing, and many other storage settings), hashes are used to represent large amounts of data with a significantly small number of bits. In order for these hashed representation to be suitable, it is important that the likelihood of two different data streams hashing to the same value (called collision probability) is negligible. Given this property, one also wishes to use hashes that may be promptly computed, and whose hash values are represented with a small number of bits.

Yet often programs will use hash functions which were not designed for the situation at hand. For example, in the context of duplicate detection and synchronization/mirroring, CRC and MD5 hashes are frequently employed. These hashes were designed for other purposes than distinguishing data streams generated by computer applications. The collision probabilities claimed by systems using these hashes are those inferred by the assumption that all bit-streams are equiprobable (i.e. occur with the same probability).

In situations where this assumption is valid, it would be more advantageous to use faster hashes than CRCs and MD5s since the computational complexity of these is only necessary for the properties sought by their design, which are irrelevant in our present situation. Yet, in many cases, computer applications will not produce all possible bit-streams with equal probability. Rather, typical bit sequences will occur frequently, others less often, while some will never occur.

Thus there is an opportunity to create hash functions better fit for effectively differentiating data generated by typical computer applications. Further, by exploring a wide solution space, we believe it is likely that we may find effective hashes having additionally lower hash size and computational complexity than aforementioned customary hashes.

In short, we believe that we can find hash functions which will be, in most storage management circumstances, shorter, faster, and more effective than commonly used hash functions.

Table of Contents

1. Introduction

1.1 Hash functions

1.2 Balanced hash functions: An example

1.3 Hashes in storage problems: Duplicate detection synchronization as examples

2. Problems with current use of hash functions in the context of duplicate detection and synchronization

2.1 Customary Hash Functions: CRC and MD5

2.2 Purpose of CRC and MD5 hash functions

2.3 Differing Purposes of data and storage management

2.4 Probabilistically balanced hash functions

2.5 Problems in using CRC and MD5 for distinguishing computer generated bit-streams

2.6 The case of two Avamar Technologies patents

2.7 Hash size and computational complexity considerations

2.8 Conclusion

3. Proposed approach for finding better hash functions

3.1 An overview of the problems and solution approaches

3.2 Work plan

3.3 Deliverables

3.4 Further details

Introduction

1.1Hash functions

A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). Hash functions vary in the domain of their inputs and the range of their outputs and in how patterns and similarities of input data affect output data. Hash functions are used in hash tables, cryptography and data processing. A good hash function is one that experiences few hash collisions in the expected domain of values it will have to deal with; that is, it would be possible to uniquely identify most of these values using this hash. (from Wikipedia.com)

Many computer applications use hash functions to optimize storage and speed when performing certain tasks. This is the case of synchronization/mirroringand duplicate detection where the hash values of documents or parts of documents (generally “bit-streams”) are calculated continuously in order to represent these with a small number of bits. These hashes can then be used to compare bit-streams with each other by comparing the smaller hash representations of these.

The idea here is to design the hash function so that the hash values are small (thus requiring little memory), fast to compute, and so that if two bit-streams have the same hash then the probability that the bit-streams are different is very small: We say then the collision probability is small. It is in this latter sense that we say that the hash value is said to “represent” the bit-streams.

1.2 Balanced hash functions: An example

Fig. 1

Fig. 1 illustrates a hash function mapping nine possible bit-streams to three distinct hash values. If every bit-stream has the same probability of being present, then each has a 1/9 chance of being present. It may be verified that the hash function represented in Fig. 1 has a 2/9=0.222collision probability. This is the lowest collision probability one can achieve with a hash function having 3 hash values where all bit-streams have an equal probability of being present.

In general, the lowest collision probability one can achieve when hashing B bit-streams with H hash values is (B/H-1)/H.

A balanced hash function is one whose inverse images are all more or less the same size: That is, one where the sets of bit-stream mapping to a same hash value are more or less of equal size. If all the bit-streams are equiprobable, then a balanced hash function achieves the lowest collision probability.

1.3Hashes in storage problems: Duplicate detection synchronization as examples

In many storage problems, it is useful to have hashes that will differentiate bit-streams from each other. In the case of duplicate detection, hashes are used to separate a collection of documents into probable duplicate groups, allowing an eventual further process decide if the documents in those groups (having at least two elements) are actually duplicates. Dividing the collection of documents into groups accelerates the duplication detection.

File synchronization in computing is a two-way synchronization of files and directories rather than mirroring which is one-way. There is a left side and a right side to perform synchronization between.

A mirror in computing is a direct copy of a data set. On the Internet, a mirror site is an exact copy of another Internet site (often a web site). Mirror sites are most commonly used to provide multiple sources of the same information, and are of particular value as a way of providing reliable access to large downloads. Mirroring is a one-way operation whereas file synchronization is two-way.

In the case of synchronization/mirroring, the hashes are again used to compare two (here specific) documents (or often portions of these documents) stored in different locations to see if they are the same, and therefore need to be synched. Sending whole documents over the network would be infeasible on a regular basis, thus the smaller hashes are sent instead. In this case, if the hashes are the same, the documents are considered to be the same, and no synchronization takes place. Hence it is important here to have a very low collision probability.

Problems with current use of hash functions in the context of duplicate detection and synchronization

2.1 Customary Hash Functions: CRC and MD5

Most current duplicate detection processes and synchronization schemes use CRC and MD5 hash functions. The Wikipedia.comdefinitions of these are provided below.

A cyclic redundancy check (CRC) is a type of hash function used to produce a checksum, which is a small number of bits, from a large block of data, such as a packet of network traffic or a block of a computer file, in order to detect errors in transmission or storage. A CRC is computed and appended before transmission or storage, and verified afterwards to confirm that no changes occurred. CRCs are popular because they are simple to implement in binary hardware, are easy to analyze mathematically, and are particularly good at detecting common errors caused by noise in transmission channels.

In cryptography, MD5 (Message-Digest algorithm 5) is a widely-used cryptographic hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has been employed in a wide variety of security applications, and is also commonly used to check the integrity of files.

2.2 Purpose of CRC and MD5 hash functions

These two classes of hash functions are ingenious and very appropriate in some situations, and have thus known much popularity in the IT world. Yet these were designed for a specific purpose, and are often used in situations they are not suitable for.

The CRC hasheswere designed for burst errors, which is the type of data corruption or loss that occurs when this data is transmitted or stored. A burst error is a random swapping of a consecutive section of bits of a bit-stream. Since this type of clustered errors are much more likely to happen than corruption of bits in scattered throughout the bit-stream, it makes sense to tailor a hash that will distinguish a bit-stream from other bit-streams that were obtained from this original one by introducing burst errors. The CRC hash functions do exactly this.

The MD5 hash functions are, on the other hand, designed specifically so that knowing a hash value, it is very hard (intractable) to create a bit-stream having that would produce this hash value. Hence MD5 hash functions are useful in when privacy/security are at stake.

2.3Differing Purposes of data and storage management

Yet, in the case of data and storage management in general—and duplicate detection and synchronization in particular—we are not concerned with burst errors or privacy issues. Are goal here is only to distinguish computer generated bit-streams from each other. Using CRC and MD5 hashes for this purpose has several downfalls:

1)It is computationally intensive to compute the hash values (especially in the case of MD5 hashes).

2)They are not adapted to the probability distribution of typical bit-streams produced by computer applications.

2.4 Probabilistically balanced hash functions

In order to convey some understanding of this latter issue, we will illustrate the problem of adapting a hash function to the probability distribution of its domain (here, bit-streams).

Fig. 2 Fig. 3

Consider Fig. 2 where the same hash function as Fig. 1 is represented, but where the probabilities of the bit-streams are not equal. Note that the probability that a bit-stream hashes to the first hash value is 16/36=0.444, where the probabilities for the second and third hash values are respectively 11/36=0.306 and 9/36=0.250.The hash function is balanced with respect to the number of bit-streams, but not with respect to their probabilities. In this particular case, the probability collision is 0.233.

Methods using balanced hashes usually claim a collision probability corresponding to the case where all bit-streams are equiprobable. In the case of Fig. 2, we would then claim a collision probability of 0.222 (resulting from the Fig. 1 situation). Yet, if the bit-stream probabilities are as indicated in Fig. 2, one sees that the claimed 0.222 collision probability is clearly underestimated, the actual collision probability being 0.233.

Not only can one achieve lower collision probabilities than the balanced hash of Fig. 2, but moreover can achieve lower collision probabilities than the theoretical lower bound 0.222 for balanced hash functions in the equiprobable bit-streams case (with three hash values). For example, the hash function depicted in Fig. 3 achieves a collision probability of 0.213. Note that this hash function is not balanced in the conventional sense, but is what we will call “probabilistically balanced’’ since the probability of obtaining any given hash is uniform.

Note that the collision probabilities of the illustrative hash functions mentioned above (Fig.1, 2, and 3) are not too far apart. This is only due to the fact that these were small examples. If one increases the ratio of the number of bit-streams to the number of hash values and/or the skew of the bit-stream probability distribution, these differences become more extreme.

We conducted some experiments to look into how different the collision probabilities of balanced and probabilistically balanced hash functions would yield under different bit-stream probability skews. Some results are presented here. In this experiment, we consider a group of 1000 bit-streams hashed to 10 hash values. We defined a family of Poisson distributions with a skew factor varying from 0 to 1, which we use as the bit-stream probabilities. Five such distributions are graphed in Fig. 4 with skew k=0, 0.25, 0.50, 0.75, 1.

Fig. 4

Fig. 5 displays the probabilities of each “same hash” bit-streams in the balanced case (in green) and the probabilistically balanced case (in yellow) for skews 0.25, 0.50, 0.75, and 1.00. Fig. 6 graphs the collision probabilities (in the balanced and the probabilistically balanced cases) resulting from increased bit-stream distribution skew.

Fig. 5

Fig. 6

2.5 Problems in using CRC and MD5 for distinguishing computer generated bit-streams

So we see that in the case where the bit-streams that we are hashing are weighted by a non-uniform probability distribution, balanced hash functions do not effectively yield optimal collision probability. Yet, when these bit-streams are documents or portions of documents that have been produced by computer application, the domain probability distribution is in fact not uniform.

Many studies have been made on the structure of (natural and computer) languages. Some language models are deterministic (e.g. Chomsky grammars) and some are statistical (e.g. Markov models, Bayesian inference, Entropy theory). All of these naturally yield skewed bit-stream probabilities of computer data since whether entered by a human or generated by a computer program, this data will follow certain patterns, as opposed to being completely random.

Some authors/inventors make the mistake of claiming theoretical optimal collision probabilities when using CRC, MD5, or other balanced hashes for computer generated bit-streams. In reality, when taking into account the actual skewed distribution of the domain, the collision probabilities would be found to be higher. How much higher these collision probabilities are depends on how skewed the domain probability distribution is.

The underestimation of collision probability could potentially lead to faulty systems if non-collision is a critical system of this system. In this case, one could increase the hash size (hence the number of possible hash values) sufficiently to compensate for possible underestimation. Yet even this won’t help in some situations, as we will exemplify now.

2.6 The case of two Avamar Technologies patents

Consider the twin patents assigned to Avamar Technologies: “Hash file system and method for use in a commonality factoring system” and “System and method for unorchestrated determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences”. These describe methods to represent the contents of the files of a file system as a sequence of hashes associated with commonly encountered bit sequences. This is an ingenious idea that could yield considerable storage savings. Only the hashes proposed to implement the scheme might impede its functionality.

Indeed, since bit sequences will be represented by there hash value, it is imperative that no two distinct sequences hash to the same value. Hence, in order for this scheme to not lead to a disastrous loss of data, it is essential that the collision probability of the hash function be very close to zero. In order to claim the optimal collision probability implied by balanced hash functions, we must assume that all bit sequences are equiprobable. Yet, if this were the case, then the said scheme wouldn’t be advantageous since then there wouldn’t be many common bit sequences throughout the file system which would yield storage savings.

2.7 Hash size and computational complexityconsiderations

We have seen the importance of minimizing actual collision probability when designing a hash function. Further, when designing any algorithm, one wishes to minimize the time and space complexity. In the case of hash functions, this translates to

Minimizing the hash size, that is the number of bits that will be used to represent the hash values.
Minimizing the time (and space) required to compute the hash value of a bit-stream.

One wishes to minimize the hash size because

If the hash values are stored on RAM or disk drives, this will minimize the space required to store the hash values.
More critically, if the hash values are transmitted—as is the case in mirroring, synchronization, and distributed duplicate detection—this will minimize the amount of data to be communicated, thus the amount of time needed to communicate the hash values.

Of course, lowering the hash size will increase the collision probability, thus an appropriate tradeoff must be decided upon.

One wishes to minimize the time (and space) required to compute the hash value for obvious reasons. Yet, more we restrict the amount of operations we wish to undertake to compute hash values, more we restrict the what hash functions we can compute, thus less we can lower the collision probability. So again, an appropriate tradeoff must be decided upon.

Even if we assume that the bit-stream probability distribution is indeed nearly uniform, it suffices to use any balanced hash function to obtain minimal collision probabilities since we are not concerned with the circumstances CRCs and MD5s are designed for. In this case, one should choose to use hash functions that are faster to compute than CRCs and MD5s.

This threefold tradeoff is depicted in Fig. 7.