In This Presentation, We Will See

Synopsis

The hash functions (or “hashes”) used in Storage Management are typically “off the shelf” hashes that were not designed for the purpose at hand. As a result, storage management applications are less efficient and less effective than they could be, would they use hashes tailored to their purpose.

In this presentation, we will see

What the properties of customary hashes are.
How the storage management objectives differ from the intent of these popular hashes.
How specialized hashes would increase efficiency and effectiveness of storage management systems. (Ask me about the mathematical blunder that will eventually lead a disastrous “Y2K” of some storage management systems.)

 Below, you will find a few samples of slides of my presentation and paragraphs of my proposal.

Hash functions (or “hashes”) are widely used in computer applications to optimize various computing resources. A hash function of a bit-stream (array of bits) acts as a digital fingerprint of this bit-stream, intended to characterize the latter with a small number of bits.

Different hashes have been designed for different purposes and situations (types of data, implementation environment, etc.). Often, industrial applications use hashes that were not intended for the situation at hand. Consequentially, the resulting product is less effective and efficient as it could have been, if the hashes had been chosen—or created—for the intended purpose.

Hashes are used in many situations arising in storage and data management. In order for these hash representations to be suitable, it is important that the likelihood of two different data streams hashing to the same value (called collision probability) is negligible. The collision probability of a hash measures how effectively the hash differentiates the bit-streams.

Often programs will use popular hashes such as check sums, CRC, MD5, and SHA for the sole purpose of differentiating files or portions of files. Check sum and CRC hashes originated to answer to the problem of data reliability, whereas MD5 and SHA hashes are aimed at data security/privacy (cryptography). These hashes effectively fulfill their purpose, but in many storage settings, using these is akin to using a crane to pick up a needle.

“It took 300,000 years for the human race to accumulate 12 exabytes of information. That's 12 billion gigabytes. With the digital age and numerous software applications, e-business and other enterprise technology trends such as ERP, CRM, and data warehousing are doubling the amount of corporate data every 6 months. Businesses are looking for technologies to help them manage and store the information flowing in and out of their computer systems at an alarming rate, 12 times the data stored in 1998.Data Storage systems have always been a critical part of any business IT infrastructure, whether it is a server's own direct access storage hard disks, or shared network storage.” (

While storage needs are increasing exponentially, storage technology improvements are saturating, reaching the theoretically achievable storage densities. Hence the need for new solutions, many offered by the numerous emerging storage management companies.

Three (important) aspects of storage/data management are:

Reliability (backing up files)
Efficiency (reducing storage space)
Effective retrieval of desired information (data mining/warehousing)

We will give four examples of techniques that are used to achieve the goals of storage/data management, namely:

Mirroring/Synchronization, which indicates the (automatic) copying of files to a secondary location
Duplicate detection, which denotes finding, or “catching on-the-fly” duplicate files in view of avoiding redundant backups, or consolidating stored data so as to reduce storage requirements
Compressed Representation, which designates the technique of representing files of the file system efficiently. For example, by identifying common bit-streams contained in the files of the file system, so as to guide compression schemes or reduced storage representations of the files of the file system. These alternate representations may also be designed so as to facilitate data mining/warehousing.
Indexing denotes the casting of contents to a data structure (an index) which allows for efficient lookup.

All four of these techniques employ hash functions as an important, if not crucial, element of their implementation.