Algorithms to Improve the Efficiency of Data Compression And

Algorithms to Improve the Efficiency of Data Compression and

Caching on Wide-Area Networks

Amar Mukherjee, Principal Investigator

School of Electrical Engineering and Computer Science

University of Central Florida

Orlando, FL 32816

Contact Information

Contact Name of PI:

Amar Mukherjee
School of Electrical Engineering and Computer Science

University of Central Florida

Orlando, FL 32816
Phone: (407) 823-2763
Fax: (407) 823-5419
Email:

WWW PAGE

Project URL (To be added)

List of Supported Students

Nan Zhang, Graduate Research Assistant

Raja Iqbal, Graduate Research Assistant

Nitin Motgi, Graduate Research Assistant

Research Collaborators

Dr. Robert Franceschini

Mr. Holger Kruse

Project Award Information

· Award Number: IIS-9977336

· Period of Performance of the entire project: 10/01/1999 - 9/30/2002 and the award period is 3 years

· Current Reporting Period: 10/01/1999 – 6/30/2000

· Title of the Project: Algorithms to Improve Efficiency of Data Compression and Caching on Wide-area Networks.

Keywords

Data Compression, Lossless Compression, Text Compression, Wide-area network, Caching.

Project Summary

The goal of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression in MIME/HTML standards. The approach consists of encoding the text to exploit the natural redundancy of a language via the use of a dictionary and then compressing it using a pre-existing compression algorithm. The encoding scheme depends on the specific characteristics of the compression algorithms and several algorithms including the Bzip2 algorithm based on Burrows-Wheeler transform, are used as the backend compression algorithms. A basic understanding of the interaction of the encoding schemes and the compression algorithms is being developed. The performance of the algorithms is being measured taking into account both compression and communication metrics. Infrastructure tools are being developed using dynamic caching of dictionaries to embed compression into MIME/HTML standards. The impact of the research on the future of information technology is to develop data delivery systems where communication bandwidth is at a premium and archival storage is an exponentially costly endeavor. It is expected that the new lossless text compression algorithms will have 3 to 30% improved compression ratio over the best known pre-existing compression algorithms which might translate into a reduction of more than 70% of the text traffic on the Internet. The experimental research is linked to educational goals via rapid dissemination of results via reports, conference and journal papers, doctoral dissertation and master’s thesis, and transferring the research knowledge into the graduate curriculum. Software tools developed under this grant will be shared via a web site.

Goals and Objectives

The major goal of the project is to develop data delivery systems where communication bandwidth is at a premium and archival storage is an exponentially costly endeavor. Specific objectives for this period were:

Development of new lossless text compression algorithms and software tools to incorporate compression in text transmission over the Internet.

Basic understanding of the interaction of encoding schemes and compression algorithms.

Measurement of performance of the algorithms taking into account both compression and communication metrics.

Method of Approach

The basic philosophy of our compression algorithm is to transform the text into some intermediate form, which can be compressed with better efficiency. The transformation is designed to exploit the natural redundancy of the language. We have developed a class of such transformations each giving better compression performance over the previous one and all of them giving better compression over most of the current and classical compression algorithms (viz. Bzip2 (based on Burrows –Wheeler Transform), the class of PPM (Partial Predicate Match) algorithms (such as PPMA, PPMC, PPMD, PPMZ and PPM*), Gzip (based on LZW algorithms), Arithmetic and Huffman algorithms). We also measured the execution times needed to produce the pre-processing and its impact on the total execution time is negligible. The algorithm has a fixed amount of storage overhead in the form of a word dictionary for the particular corpus of interest and must be shared by all the users. Typical size of dictionary for the English language is about 1 MB and can be downloaded using caching and memory management techniques that are being developed for use in the context of the Internet technologies. If the compression algorithms are going to be used over and over again, which is true in all practical applications, the amortized storage overhead is negligibly small and is no more than the backend compression algorithm used after the transformation. Our recommendation is to use the Bzip2 algorithm for backend compression along with our proposed transformation. The combined effect is to produce the best average compression ratio of 2.14 BPC (bits per character) over the Calgary and Canterbury corpus with no appreciable degradation over the execution time or storage overhead. We also have measured the impact of using these algorithms on the transmission time and bandwidth over the Internet for text file transfer using both http and ftp protocols. On the average we established over 700% improvement of transmission time and 70% reduction of bandwidth requirements for streamed text data of size around 500,00 Kbytes..

The basic idea underlying the transformations that we invented is that one can replace letters in a word by a special placeholder character (*) and retain at most two characters of the original word. Given an encoding, the original word can be retrieved from a dictionary that contains a one-to-one mapping between encoded words and original words. The encoding produces an abundance of * characters in the transformed text making it the most frequently occurring character. The transformed text can be compressed better when all the available compression algorithms are applied except in one case: the Bzip2 algorithm. For Bzip2, the run-length encoding step at the beginning destroys the benefit of *-encoding. For Bzip2, we propose three new transformations called Fixed Context Length Preserving Transformation (LPT) where strings of * characters are replaced by a fixed sequence of characters in alphabetic order sharing common suffixes depending on the lengths of the strings (viz. ‘stuvw’ or ‘qrstuvw’ etc), Fixed Context Reverse LPT (RLPT, same as LPT with the sequence of characters reversed) and shortened-context LPT (SCLPT, where only the first character of SCLPT is kept, which uniquely identifies the sequence) all of which improve the compression performance and uniformly beat the best of the available compression algorithms over an extensive text corpus.

Targeted Activities, Summary of Results and Indication of Success

Our activities have been directed to achieve results in all three areas of goals and objectives stated earlier. The major results of our achievements can be summarized as follows.

1. Performance Measurements of the Compression Algorithms

We have implemented the four lossless reversible text compression algorithms (*-encoded, LPT, RLPT and SCLPT) as discussed above and measured their compression performance using a combination of the Canterbury and Calgary corpus (See Table 1) to do our testing and compared them with the existing algorithms. The results are summarized as follows.

1. When compared to the standard compression algorithms, the *-encoded compression algorithms produce a compression improvement of as high as 33% over Huffman and arithmetic algorithms, about 10% over Unix compress and 9% to 19% over Gzip algorithms and average 1.7% over Bzip2 and 0.3% over PPMD algorithms. The last two algorithms are known to perform the best compression so far in the literature. The improvement over Bzip2 algorithms is not always guaranteed because of the effect that we explained earlier. Fig. 1 illustrates typical performance results.

Figure 1. Comparison of BPC on plrabn12.txt (Paradise Lost) with original file and star converted file using different compression algorithms.

The three new algorithms that we proposed (LPT, RLPT and SCLPT) produce further improvement uniformly over the corpus. LPT has an average improvement of 4.4% on Bzip2 and 1.5% over PPMD; RLPT has an average improvement of 4.9% on Bzip2 and 3.4% over PPMD. The SCLPT has an average improvement of 7.1% on Bzip2 and 3.8% over PPMD. The compression ratios are given in terms of average BPC (bits per character) over our test corpus. The results of comparison with Bzip2 and PPMD are shown only (see Table 2). As can be seen from this table, our results indicate the best average BPC of 2.147 which beats the best results available in the literature. See also Fig.2 and Fig.3.

File Name / File Size
1musk10.txt / 1344739
alice29.txt / 152087
anne11.txt / 586960
asyoulik.txt / 125179
bib / 111261
book1 / 768771
book2 / 610856
crowd13 / 777028
dracula / 863326
franken / 427990
ivanhoe / 1135308
lcet10.txt / 426754
mobydick / 987597
news / 377109
paper1 / 53161
paper2 / 82199
paper3 / 46526
paper4 / 13286
paper5 / 11954
paper6 / 38105
plrabn12.txt / 481861
twocity / 760697
world95.txt / 2736128

Table 1. Files in the test corpus and their sizes.

Bzip2 / PPMD
Original / 2.411 / 2.229
*-encoded / 2.377 / 2.223
LPT / 2.311 / 2.195
RLPT / 2.300 / 2.155
SCLPT / 2.251 / 2.147

Table 2. Average BPC over the corpus with Original, Star-encoded, LPT, RLPT, and SCLPT using Bzip2 and PPMD.

2. Execution Time Measurements

In making average timing measurements, we chose to compare with the Bzip2 algorithm, which so far has given one of the best compression ratios with lowest execution time. We also compare with the family of PPM algorithms, which gives the best compression ratios but are very slow.

· The *-transform plus Bzip2 on our corpus is 6.32 times slower than without the transform. However, with the file size increasing, the difference is significantly smaller. It is 19.5 times slower than Bzip2 for a file with size of 11,954 bytes and 1.21 times slower for a file size of 2,736,128 bytes.

· The LPT plus Bzip2 on our corpus is 6.7 times slower than without the transform. However, with the file size increasing, the difference is significantly smaller. It is 19.5 times slower than Bzip2 for a file with size of 11,954 bytes and 1.39 times slower for a files size of 2,736,128 bytes.

· The RLPT plus Bzip2 on our corpus is 6.44 times slower than without the transform. However, with the file size increasing, the difference is significantly smaller. It is 19.67 times slower than Bzip2 for a file with size of 11,954 bytes and 0.99 times slower for a files size of 2,736,128 bytes.

· The SCLPT plus Bzip2 on our corpus is 4.69 times slower than without the transform. However, with the file size increasing, the difference is significantly smaller. It is 14.5 times slower than Bzip2 for a file with size of 11,954 bytes and 0.61 times slower for a file size of 2,736,128 bytes.

Note the above times are only for encoding and one can afford to spend more time off line encoding files, particularly if the difference in execution time becomes negligibly small with increasing file size which is true in our case. Our initial measurements on decoding times show no significant differences.

When we compared the average execution times with PPMD, the results are as follows:

· The *-transform plus PPMD on our corpus is 18% slower than without the transform.

· The LPT plus PPMD on our corpus is 5% faster than without the transform.

· The RLPT plus PPMD on our corpus is 2% faster than without the transform.

· The SCLPT plus PPMD on our corpus is 14% faster than without transform.

· In general, SCLPT runs fastest among all the PPM algorithms

3. Memory Usage Estimation:

For memory usage of the programs, the *-transform, LPT and RLPT need to load two dictionaries with 55K bytes each, requiring about 110K bytes and for SCLPT, it takes about 90K bytes because of a smaller dictionary size. It is claimed that Bzip2 uses 400K + (8 ´ Blocksize) for compression. We used the –9 option, i.e., 900K of block size for the test. So, we need about 7600K. For PPM, it is programmed as about 5100K + file size. So the *-transform family takes insignificant overhead compared to Bzip2 and PPM in memory usage. It should be pointed out that all the above programs have not yet been well optimized. There is potential for less time and smaller memory usage.

Figure 2. Compression BPC over the corpus with different encoding scheme using Bzip2.

Figure 3. Compression BPC over the corpus with different encoding schemes using PPMD.

Original:
Order / Count / % / Bits / % / Bytes / Ratio / BPC
5 / 114279 / 0.751 / 226297 / 0.682 / 28287 / 4.040 / 1.980
4 / 18820 / 0.875 / 47015 / 0.823 / 5876 / 3.202 / 2.498
3 / 12306 / 0.956 / 35077 / 0.929 / 4384 / 2.807 / 2.850
2 / 5395 / 0.992 / 18118 / 0.984 / 2264 / 2.382 / 3.358
1 / 1213 / 1.000 / 4972 / 0.999 / 621 / 1.952 / 4.099
0 / 75 / 1.000 / 465 / 1.000 / 58 / 1.290 / 6.200
152088 / 331944
*-transformed:
Order / Count / % / Bits / % / Bytes / Ratio / BPC
5 / 134377 / 0.860 / 280203 / 0.857 / 35025 / 3.837 / 2.085
4 / 10349 / 0.926 / 22823 / 0.927 / 2852 / 3.628 / 2.205
3 / 6346 / 0.967 / 12911 / 0.966 / 1613 / 3.932 / 2.035
2 / 3577 / 0.989 / 6791 / 0.987 / 848 / 4.214 / 1.899
1 / 1579 / 0.999 / 3789 / 0.999 / 473 / 3.334 / 2.400
0 / 79 / 1.000 / 409 / 1.000 / 51 / 1.545 / 5.177
156307 / 326926
LPT-:
Order / Count / % / Bits / % / Bytes / Ratio / BPC
5 / 123537 / 0.790 / 250873 / 0.769 / 31359 / 3.939 / 2.031
4 / 13897 / 0.879 / 29749 / 0.860 / 3718 / 3.737 / 2.141
3 / 10323 / 0.945 / 21804 / 0.927 / 2725 / 3.788 / 2.112
2 / 6548 / 0.987 / 15861 / 0.976 / 1982 / 3.303 / 2.422
1 / 1923 / 0.999 / 7459 / 0.998 / 932 / 2.062 / 3.879
0 / 79 / 1.000 / 513 / 1.000 / 64 / 1.232 / 6.494
156307 / 326259
RLPT-:
Order / Count / % / Bits / % / Bytes / Ratio / BPC
5 / 124117 / 0.794 / 247458 / 0.767 / 30932 / 4.013 / 1.994
4 / 12880 / 0.876 / 26738 / 0.850 / 3342 / 3.854 / 2.076
3 / 10114 / 0.941 / 20159 / 0.912 / 2519 / 4.014 / 1.993
2 / 7199 / 0.987 / 20312 / 0.975 / 2539 / 2.835 / 2.822
1 / 1918 / 0.999 / 7549 / 0.999 / 943 / 2.033 / 3.936
0 / 79 / 1.000 / 474 / 1.000 / 59 / 1.333 / 6.000
156307 / 322690
SCLPT-:
Order / Count / % / Bits / % / Bytes / Ratio / BPC
5 / 113141 / 0.777 / 235115 / 0.735 / 29389 / 3.850 / 2.078
4 / 15400 / 0.883 / 42847 / 0.869 / 5355 / 2.875 / 2.782
3 / 8363 / 0.940 / 17225 / 0.922 / 2153 / 3.884 / 2.060
2 / 6649 / 0.986 / 17140 / 0.976 / 2142 / 3.103 / 2.578
1 / 1943 / 0.999 / 7248 / 0.999 / 906 / 2.145 / 3.730
0 / 79 / 1.000 / 458 / 1.000 / 57 / 1.380 / 5.797
145575 / 320033

Table 3. Statistics of compressions in different context order for alice.txt with PPMD