Project 1. Image compression system
The representation of color images called Bitmap or (BMP) representation is the most commonly used format for keeping half-tone information for Windows 95, Windows 98, Windows 2000, Windows NT, and Windows XP operation systems. Its particular case is the representation of color images using RGB components.
A BMP file consists of either 3 or 4 parts as shown in the diagram. The first part is a header, which is followed by an information section, if the image is indexed colour then the palette follows, and last of all is the pixel data. The image width and height as well as type of compression (or absence of compression) and the number of colours are contained in the information header.
Images in BMP representation can be color (4, 8, 16, 24 or 32 bits/pixel) or monochrome (1 bit/pixel). The data can be compressed by run length coding or be given without compression.
The header is 14 bytes in length and information is 40 bytes in length. The useful fields of the header are the type field (should be ‘BM’), the file size and the offset field which gives the number of bytes before the actual pixel data. The most important fields of the image info data are: the image width and height, the number of bits per pixel (should be 1,4,8, or 24), the number of planes (assumed to be 1 here), and the compression type (assumed to be 0 here).
The RGB format or the simplest 24 bit true colour images we deal with represents a colour image with header of length 54 bytes containing R, G, and B components of image. In this case the image data follows immediately after the information header, that is, there is no colour palette. Each pixel of each component takes 8 bits (has value in range 0…255) and each component represents an array of size bytes for the image of size pixels. In other words it consists of three bytes per pixel in B, G, R order. Each byte gives the saturation for that colour component. For example, a colour image of size 120 160 pixels represented in RGB format takes 54+1201603 =57654 bytes.
Color representation such as RGB is not always the most convenient. Components R, G, and B of the image are highly-correlated but they are processed independently. Since all components are equally important from the image reproduction point of view then we have to apply the same compression algorithm to each component and cannot prefer some of them. On the other hand other colour formats are available that use colour components that are closely related to the criteria used to describe colour perception: brightness, hue, and saturation. Brightness describes the intensity of the light (revealing whether it is white, gray, or black) and this can be related to the luminance of the source. Hue describes what color is present (red, green, yellow, etc.) and this can be related to the dominant wavelength of the light source. Saturation describes how vivid the color is (very strong, pastel, nearly white) and this can be related to the purity or narrowness of the spectral distribution of the source.
Color spaces or color coordinate systems in which one component is the luminance and the other two components are related to hue and saturation are called luminance-chrominance representations. The luminance provides a grayscale version of the image (such as the image on a monochrome receiver), and the chrominance components provide the extra information that converts the grayscale image to a color image. Luminance-chrominance representations are particularly important for good image compression. One of the luminance-chrominance representations is called YUV format and the other is called YCbCr format. We can convert RGB format to YUV format and to YCbCr format using the following linear transforms, respectively
,
,
,
,
.
The inverse transform can be described as follows
,
,
.
,
,
.
The components Y, U, and V (Y, Cb, Cr) are almost not correlated. Moreover the most important information is concentrated in the luminance component. Thus we do not loose much information if we decimate chrominance components. Usually U and V components are decimated with factor 2. In other words 4 neighboring pixels which form the square of size are described by 4 values of component Y, one value of component U, and one value of component V. Each chrominance value is computed as the rounded-off arithmetic mean of the corresponding 4 pixel values belonging to the considered square. As a result we obtain the so-called YUV 4:1:1 standard video format which is usually used as the input format for most of videocodecs. It is easy to compute that using this format we spend only 6 bytes for each square of size instead of 12 bytes spent by the original YUV format. Thus we already twice compressed the image without any visible distortions.
JPEG standard is based on DCT coding technique. Component Y and decimated components U and V are processed by blocks. Each block of Y contains 88 pixels and the corresponding blocks of U and V contain 44 pixels. The 2-D DCT is applied to each block of pixels.
The one dimensional DCT for sequence of length 8 is given by the formula
,
where
.
The inverse transform can be written as
.
We decompose the image using a set of eight different cosine waveforms sampled at eight points. The coefficient that scales the constant basis function () is called the DC coefficient. The other coefficients are called AC coefficients.
Since DCT is a separable transform the two-dimensional DCT for blocks 88 can be performed by first applying 1-D transform to rows of each 88 block and then applying 1-D transform to the columns of the resulting block, that is
,
where denotes the pixel of the image component ( or ) and is the transform coefficient.
Notice that performing of 2-D DCT is equivalent to decomposition of the original image block using a set of 64 2-D cosine basis functions. These functions are created by multiplying a horizontally oriented set of 1-D 8-point basis functions by a vertically oriented set of the same functions. The horizontally oriented set of basis functions represents horizontal frequencies and the other set of basis functions represents vertical frequencies. By convention, the DC term of the horizontal basis functions is to the left, and the DC term for vertical functions is at the top. Because the 2-D basis functions are products of two 1-D DCT basis functions, the only constant basis function is in the upper left corner of the array. The coefficient for this basis function is called the DC coefficient, whereas the rest of the coefficients are called AC coefficients. The horizontal DCT frequency of the basis function increases from left to right and the vertical DCT frequency of the basis function increases from top to bottom.
The obtained transform coefficients are quantized by the uniform scalar quantizer. The quantization is implemented as rounding-off of the DCT coefficients divided by a quantization step. The values of steps are set individually for each DCT coefficient, using criteria based on visibility of the basis functions. Thus the quantized coefficient is
,
where is the th entry of the quantization matrix of size . The standard JPEG uses two different quantization matrices. The first matrix is used to quantize the luminance component () and has the form
.
The second matrix is used to quantize the chrominance components and looks like
.
The quantized DCT coefficients are coded by a variable-length coder. The coding procedure is performed in two steps. At the first step DC coefficient is DPCM coded, using the first order predictor that is the DC value of the previous 88 coded block is subtracted from the current DC coefficient. The AC coefficients are coded by run-length coder. At the second step the obtained values are coded by the Huffman code.
Let and denote the DC coefficients of the th and th blocks, respectively. Due to high correlation between DC coefficients they are DPCM coded, that is their difference is computed and then coded. For gray scale images (or one of the , , V components of color image) a pixel is represented by 8 bits. Thus, the difference takes values from the range [-2047, 2047]. This range is split into 12 categories, where the th category includes the differences with length of their binary representation equal to bits. These categories are the first 12 categories shown in Table 1.
Each DC coefficient is described by a pair (category, amplitude). If the value , then the amplitude is the binary representation of this value of length equal to the category. If , then the amplitude is the codeword of the complement binary code for the absolute value of , which also has length equal to category. The category value is then coded by the Huffman code.
Example. Let and . Then the difference . It follows from Table 1 that the value belongs to the category 4. The binary representation of value 11 is 1011 and the codeword of the complementary code is 0100. Thus, the value is represented as (4,0100). If the codeword of the Huffman code for 4 is 110 then is coded by the codeword 1100100 of length 7. The decoder first processes the category value (in our case it is 4) then the next 4 bits correspond to the value of . Since the most significant bit is equal to 0 then the value is negative. Inverting bits we obtain the binary representation of 11. Notice that using the categories simplifies the Huffman code. Without using categories we would need the Huffman code for the alphabet of much larger size that is coding and decoding would be much more complicated.
Table 1. Categories of integer numbers.
Category / Numbers0 / 0
1 / -1,1
2 / -3,-2,2,3
3 / -7,...,-4,4,...7
4 / -15,...,-8,8,...15
5 / -31,...-16,16,...31
6 / -63,...-32,32,...63
7 / -127,...-64,64,...127
8 / -255,...-128,128,...255
9 / -511,...-256,256,...511
10 / -1023,..-.512,512,...1023
11 / -2047,...-1024,1024,...2047
12 / -4095,...-2048,2048,...4095
13 / -8191,...-4096,4096,...8191
14 / -16383,...-8192,8192,...16383
15 / -32767,...-16384,16384,...32767
16 / 32768
For gray scale images or (, or components) the AC coefficients can take values from the range [1023,1023]. After quantization many of these coefficients become zeros. In other words it is necessary to code only small number of non-zero coefficients simply indicating before their positions. To do this efficiently the 2-D array of DCT coefficients is rearranged into a 1-D linear array by scanning in the zigzag order as shown in Fig.2. This zigzag index sequence creates a 1-D vector of coefficients, where the lower DCT frequencies tend to be at lower indices. This zigzag sequence is an important part of the coding model, as it affects the statistics of the symbols. When the coefficients are ordered in this fashion the probability of coefficients being zero is an approximately monotonic increasing function of the index.
The run-length coder generates a codeword , where run-length is the length of zero run followed by the given non-zero coefficient, amplitude is the value of this non-zero coefficient and category is the number of bits needed to represent the amplitude.The pair (run-length, category) is coded by the 2-D Huffman code and the amplitude is coded as in the case of DC coefficients and is added to the codeword.
Example. Let the nonzero coefficient preceded by 6 zeros be equal to 18. It follows from Table 1 that 18 belongs to the category 5. The codeword of the complement code is 01101. Thus, the coefficient is represented by ((6,5), 01101). The pair (6,5) is coded by the Huffman code and the value 01101 is added to the codeword. If the codeword of the Huffman code for (6,5) is 1101, then the codeword for 18 is 110101101.
There are two special cases when we encode the AC coefficients.
1.After a non-zero coefficient all other AC coefficients are zero. In this case the special symbol (EOB) is transmitted which codes end-of-block condition.
2.A pair (run-length, category) appeared which is not included in the table of Huffman code. In this case a special codeword called escape-code followed by the uniform codes for run-length and non-zero value is transmitted.
Coding efficiency is usually evaluated as a compression ratio which is the ratio of the original file size in bits divided by the size of the compressed file in bits. The quality of the synthesized image is characterized by the signal-to-noise ratio (SNR) at the output of the decoder
(dB),
where denotes the energy of the original signal, is the energy of the quantization noise. represents theenergy of the difference between the original and the reconstructed images. Moreoftento characterize the synthesized image quality the peak signal-to-noise ratio (PSNR) is used. It is defined as follows
,
where 255 is the maximal pixel value.
1. Transform the image given in BMP format into YUV format.
2. Decimate U, V components and compute PSNR for reconstructed U and V components. Plot PSNR as a function of coefficient of decimation. Perform YUV to RGB transform. Compare the reconstructed and the original images for different coefficients of decimation.
3. Compute DCT coefficients for Y component. Uniformly scalar quantize the obtained coefficients. Reconstruct Y component using IDCT for quantized coefficients. Plot PSNR (for Y) as a function of quantization step.
4. Code the quantized transform coefficients for Y component using run-length variable encoder. Estimate entropy of the obtained streams of run lengths and levels.
5. Plot PSNR (for Y) as a function of estimated compression ratio. Compare the reconstructed and the original images for different compression ratios.
6. Use standard archivers to compress the original file. Compare compression ratio with entropy estimates.
7. Compress the original file using any available image editor. Compare compression ratios.
Project 2. Wavelet image coder.
The DFT and DCT are linear transforms based on decomposition of the input signal over system of orthogonal harmonic functions. The main shortcoming of these transforms is that basis functions are uniformly distributed over frequency axis. It means that all frequencies of the input signal are considered as equally important in the sense of recovering original signal from the transform coefficients. On the other hand it is clear that low-frequency components of the signal are more important than high-frequency components, that is, resolution of system of basis functions should be non-uniform over frequency axis. The problem of constructing such transform is solved by using filter banks. One of the most efficient transforms is based on wavelet filter banks and called wavelet filtering.
The wavelet filter banks have special properties. The most important feature of these filter banks is their hierarchical structure.The input signal is decomposed using two filters into low-frequency and high-frequency parts. Then each components of the input signal is decimated, that is the only even-numbered samples are kept. The downsampled high-frequency part represents a final output because it is not transformed again. Since this part of the signal contains rather insignificant part of the signal energy it can be encoded by using a small number of bits. The decimated low-frequency component contains the main part of the signal energy and it is filtered again by the same pair of filters. The decimated high-frequency part of the low-frequency signal component is not transformed again but the decimated low-frequency part of the low-frequency signal component can be filtered again and so on. By choosing filter bank in a proper way it is possible to provide twice as large compression ratio compared to DCT based codecs on the assumption of the same quality of the synthesized signal.
Thus the main idea of the wavelet transform is a hierarchical decomposition of the input sequence into the so-called reference (low-frequency) subsequences with diminishing resolutions and related with them the so-called detail (high-frequency) subsequences. At each level of decomposition the wavelet transform is invertible, that is, the reference signal of this level together with the corresponding detail signal provide perfect reconstruction of the reference signal of the next level (with higher resolution).
Fig.3 illustrates one level of wavelet decomposition followed by reconstruction. The input sequence is filtered by lowpass filter with impulse response and by highpass filter with impulse response . The downsampling step is symbolized by . The sequence is the reference signal (decimated result of lowpass filtering), and is the detail signal (decimated result of highpass filtering). It is evident that this scheme transforms one sequence of length into two subsequences of length each.
In the theory of wavelet filter banks such pairs of filters and are found that there exist pairs of the inverse filters and providing the perfect reconstruction of the input signal. To reconstruct the input signal from signals and these signals are first upsampled with factor 2. In Fig. 3 upsampling is symbolized by . Then upsampled low-frequency and high-frequency components are filtered by the inverse lowpass filter with impulse response and the inverse highpass filter with impulse response , respectively. The sum of the results of filtering is the output signal . The wavelet transform (wavelet filtering) provides perfect reconstruction of the input signal, that is, the output signal is determined as
,