Compression of 3D Scientific Data for Archival, Distribution and Fast Classification



Giovanni Motta, Francesco Rizzo, James A. Storer, and Bruno Carpentieri

Abstract—We propose a compression algorithm based on Partitioned Vector Quantization targeted to 3D scientific data, remote sensing and hyperspectral imagery. Along one of its dimensions, the input data is treated as a single vector that is partitioned into subvectors of (possibly) different lengths, each quantized separately. Quantization indices and residual errors are entropy coded. Partitioning serves different purposes: vector quantization looses efficiency in high dimension and partitioning is a way of improving it, it allows a better decorrelation of the data in the given dimension, improves compression. The concatenation of the quantization indices can be seen as a projection to a lower dimensional subspace that preserves the original statistical properties of the data; partition indices can be used for fast (remote) browsing and classification. In fact, experimental results with NASA AVIRIS hyperspectral images show that the proposed algorithm achieves, on such data, state of the art lossless compression, it provides nearlossless and lossy capabilities (that could be used for fast browsing and broadcasting on limited bandwidth communication channels), it is scalable to arbitrary data size, and it allows fast classification and target detection in the compressed domain.

Index Terms — Image coding, Image processing, Pattern classification, Hyperspectral imaging.

I.INTRODUCTION

nincreasing number of scientific instruments produce high volume of data in the form of twodimensional matrices of high dimensionality vectors, also called data cubes. Typical examples are biomedical images, multi, hyper and ultra spectral imagery, measurements of volumetric parameters like atmospheric temperature and pressure via microwaves sounders. Compressing these data presents a challenge for the currently standardized general purpose compressors. Beside the three dimensional nature of the measurements, which exhibit correlation along each dimension, the single measurements have values of 16 bits or more. Current state of the art compressors do not perform well on data sources with large alphabets. Furthermore, it is desirable to have a single compressor supporting lossless, near lossless, and lossy compression modes and can scale to data cubes of arbitrary sizes. Lossless compression is often required for data collection and archiving due to the cost of the acquisition and transmission process and due to the fact that original data may be needed for unforeseen analyses or further elaboration. However, when data is downloaded by users, a lossy mode can be employed to speed up the transmission, since the final users may not need all measurements at their full precision, and in any case, can request further precision only when needed.

Traditional approaches to the compression of hyperspectral imagery are based on differential prediction via DPCM ([1]-[3]), vector quantization ([4]-[7]), or dimensionality reduction through principal component analysis ([8]-[9]). An inter-band linear prediction approach based on least square optimization is presented in [10]; this compression method optimizes, for each sample, the parameters of a linear predictor with spatial and spectral support. A more complex least square optimized scheme combined with a vector quantizer has been recently presented in [11].

In [12] we propose a compression algorithm based on Partitioned Vector Quantization, where one dimension of the cube is treated as a single vector, partitioned and quantized. Partitioning the vector into subvectors of (possibly) different lengths is necessary because of the possibly large number of components. Subvectors are individually quantized with appropriate codebooks. The adaptive partitioning uses a novel locally optimal algorithm and provides a tool to reduce the size of the source alphabet. Correlation among the other two dimensions is exploited with methods derived from the image coding domain. Lossless or near-lossless entropy coding can be applied to the residual error. An additional feature of this compression method is the possibility to tightly bound error on a pixel by pixel basis. Due to the hierarchical structure of the data compressed with the proposed algorithm, it is possible to perform classification and target detection directly in the compressed domain, with considerable speed up and memory savings. The high speed classification and target detection algorithm that we describe here can process 90 percent of the compressed pixels with a simple table lookup, full decompression is only required if exact classification is necessary and it is limited to the 10 percent remaining pixels. The proposed algorithms have been verified experimentally on the lossless and near lossless compression of NASA AVIRIS images where, under conservative assumptions, is able to improve the naïve classification of a factor of five or more.

Section II discusses the communication paradigm under consideration. Section III introduces the necessary notation and briefly reviews the proposed algorithm. Section IV discusses compression experiments with LPVQ.

The problem of classification and the necessary notation is introduced in Section V. In this same section experimental results on the classification error introduced by LPVQ lossy and near lossless compression modes are discussed first. Building on these experimental observations, we describe an algorithm that speeds up the classification by accepting or rejecting vectors based only on bounds of the VQ error. These error bounds are available at design time or can be extracted from the compressed image. In the same section, the asymptotic behavior is derived. Given a compressed image and a classification threshold, the speed up achieved on the naïve algorithm depends on the acceptance and rejection rates. Experiments are performed on typical AVIRIS images in order to assess these rates for a number of classification thresholds. The acceptance and rejection rates only depend on the bounds on the quantization error and can be easily derived during the design of the LPVQ. So, these considerations also provide a tool useful to select the quantization parameters (codebook entries and number of partitions) in order to match a desired range of classification speeds.

II.The Communication Model

An application for which our algorithms are designed is the encoding of information collected by a remote sensing device having limited computational power (satellite, airplane, biomedical imager) and losslessly transmitted to a powerful central server (see Figure 1). The central server decompresses and elaborates the data, performing all necessary adjustments and corrections. Then, re-codes the data in a more compact representation and distributes the result among a number of users that may subscribe the service at different quality levels. A common example is weather information collected by geostationary satellites. The satellites observe events evolving in a specific geographic area, such as hurricanes, storms, flooding and volcanic eruptions. The collected data have to be transmitted to a land based central station that analyzes the data, applies some form of enhancement, further compresses the data and sends the result back to the satellite in order to broadcast it to the final users. In this setting, the same satellite is used for both acquisition and broadcasting. More common is the case of the central server performing the transmission. While the compression hardware used in the satellite must be simple and has to require very low power to function, there is typically little or no limitation to the computational power of the base station.

III.Definitions

We model a three dimensional data cube as a discrete time, discrete values, bidimensional random source that emits pixels that are dimensional vectors . Each vector component , , is drawn from a finite alphabet and is distributed according to a space variant probability distribution that may depend on other components. We assume that the alphabet has the canonical form .

Our encoder is based on a Partitioned Vector Quantizer (or PVQ), which consists of independent, levels, dimensional exhaustive search vector quantizers , such that and:

is a finite indexed subset of called codebook. Its elements are the code vectors.
is a partition of and its equivalence classes (or cells) satisfy:

and for .

is a function defining the relation between the codebook and the partition as if and only if .

The index of the codeword , result of the quantization of the dimensional subvector , is the information that is sent to the decoder.

With reference to the previously defined vector quantizers , a Partitioned Vector Quantizer can be formally described by a triple where:

is a codebook in;
is a partition of;
is computed on an input vector as the concatenation of the independent quantization of the subvectors of . Similarly, the index vector sent to the decoder is obtained as a concatenation of the indices.

The design of a Partitioned VQ aims at the joint determination of the partition boundaries and at the design of the independent vector quantizers having dimension , .

In the following, given a source vector , we indicate the subvector of boundaries and with the symbol (for simplicity, the and spatial coordinates are omitted when clear from the context). The mean squared quantization error between the vector and its quantized representation , is given by

where is the codeword of the codebook minimizing the reconstruction error on , and:

A.A Locally Optimal PVQ Design

The complexity of building a quantizer that scales to vectors of arbitrary dimension is known to be computationally prohibitive. Furthermore, the efficiency of a VQ decreases with its dimensionality: the fraction of the volume of a hypersphere inscribed in a hypercube of the same dimension d goes to zero when d goes to infinity ([13]-[14]). Partitioning the vectors in consecutive, non-overlapping subvectors is a common solution to this problem (see the book of Gersho and Gray [15]). While a partitioned VQ leads to a suboptimal solution in terms of Mean Squared Error (MSE) because it does not exploit correlation among subvectors, the resulting design is practical and coding and decoding present a number of advantages in terms of speed, memory requirements and exploitable parallelism.

In a Partitioned VQ, we divide the input vectors into subvectors and quantize each of them with an levels, exhaustive search VQ. Having subvectors of the same dimension (except perhaps one, if does not divide ) is generally not a possibility since, in this context, we assume that the components of may be drawn from different alphabets. In this case their distributions may be significantly different and partitioning the components uniformly into blocks may not be optimal. We wish to determine the size of the sub vectors (of possibly different size) adaptively, while minimizing the quantization error, measured for example in terms of MSE. Once the codebooks are designed, input vectors are encoded by partitioning them into subvectors of appropriate length, each of which is quantized independently with the corresponding VQ. The index of the partitioned vector is given by the concatenation of the indices of the subvectors.

Given the number of partitions N and the number of levels per codebook L, it is possible to find the partition boundaries achieving minimum distortion with a brute-force approach: first determine, for every , the distortion that an -levels vector quantizer achieves on the input subvectors of boundaries and. Then, with a dynamic programming approach, traverse the matrix and find the costs corresponding to the input partition of boundaries and whose sum is minimal. (A more sophisticated approach to partitioned vector quantization is discussed by Matsuyama in [16]. It uses dynamic programming at each step to decide the current optimal partition, instead of designing one vector quantizer for each possible partition configuration first and then applying dynamic programming. Nevertheless, even this approach is computational intensive.)

The locally optimal partitioning algorithm that we propose in this paper (LPVQ) provides an efficient alternative to dynamic programming, while performing comparably in most applications. Our partitioning algorithm is based on a variation of the Generalized Lloyd Algorithm (or GLA, see Linde, Buzo and Gray [17]).

Unconstrained vector quantization can be seen as the joint optimization of an encoder (the function described before) and a decoder (the determination of the codewords for the equivalence classes of the partition). GLA is an iterative algorithm that, starting from the source sample vectors chooses a set of codewords (also called centroids) and optimizes in turns encoder and decoder until improvements on the predefined distortion measure are negligible. In order to define our PVQ, the boundaries of the vector partition have to be determined as well. Our design follows the same spirit of the GLA. The key observation is that, once the partition boundaries are kept fixed, the Mean Square Error (the distortion we use in our application) is minimized independently for each partition by applying the wellknown optimality conditions on the centroids and on the cells. Similarly, when the centroids and the cells are held fixed, the (locally optimal) partitions boundaries can be determined in a greedy fashion. The GLA step is independently applied to each partition. The equivalence classes are determined as usual, but as shown in Figure 2, the computation keeps a record of the contribution to the quantization error of the leftmost and rightmost components of each partition:

and

Except for the leftmost and rightmost partition, two extra components are also computed:

and

The reconstruction values used in the expressions for and are determined by the classification performed on the components. The boundary between the partitions and is changed according to the criteria

= min(,,)

if (=)

else if (=)

IV.Data Compression with LPVQ

The partitioned vector quantizer is used here as a tool to reduce the dimensionality of the source vectors. Dependence between the other two dimensions has to be exploited by means of an entropy coder. Furthermore, when lossless coding is required, quantization residuals have to be entropy coded with a conditioning on the subvector indices.

The index vectors and the codewords , with and , are sufficient to a lossy reconstruction of the data. When higher accuracy is needed, the compressed data can be augmented with the quantization error:

where, for each :

The unconstrained quantizers work independently from each other and independently on each source vector and an entropy encoder must be used to exploit this residual redundancy. In particular, each VQ index is encoded by conditioning its probability with respect to a set of causal indices spatially and spectrally adjacent. The components of the residual vector are entropy coded with their probability conditioned on the VQ index .In near-lossless applications, a small, controlled error can be introduced at this stage. Figure3depicts a diagram of the LPVQ encoder.

A.Computational Complexity

Vector Quantization is known to be computationally asymmetrical, in the sense that encoding has a computational complexity which is orders of magnitude higher than decoding. Even more time consuming is the design of the dictionary which is also not guaranteed to converge into an easily predictable number of iterations. Numerous methods have been designed to speed up the convergence, for example, by constraining the search of the closest codeword like in [18], or by seeding the dictionary with vectors having special properties [19]. Adding structure to the VQ can also simplify encoding at the price of a limited degradation of the output quality [15]. All these improvements clearly apply to our method as well; furthermore, practical scenarios arise in which our algorithm can be applied as is, without the computational complexity being an obstacle.

A possible way of mapping the LPVQ algorithm to the communication model described in Section II is to equip the platform used for the remote acquisition with an encoder that is able to download a dictionary transmitted from the central server (see Figure 1). This dictionary can be, for the very first scene, bootstrapped as described in [19], with a number of vectors randomly selected from the image itself, with a dictionary designed for another image, or with any dictionary that is believed to be statistically significant. Due to the nature of our algorithm, the choice of the first dictionary will only impact the compression ratio of the very first transmission.

Once the central server receives a scene, at time instant t, a few iterations of LBG may be necessary to refine the dictionary, to adjust the partition boundaries and to improve compression. The refined version of the dictionary is transmitted back to the remote platform and is used for the encoding of the scene acquired at time t+1. Later in this section we report on experiments intended to assess the performance degradation derived from a mismatched dictionary.

It has been observed in [1] that if the codewords in each codebook are sorted according to their energy, the planes constituted by the indices retain important information on the structure of the original image. These planes alone can be used to browse and select areas of interest and, as we show in Section VI, to extract important information on the nature of the original pixel vectors.

B.Experimental results

LPVQ has been tested on a set of five AVIRIS images ([20]). AVIRIS images are obtained by flying a spectrometer over the target area. They are 614 pixels wide and typically have a height on the order of 2,000 pixels, depending on the duration of the flight. Each pixel represents the light reflected by a 20 x 20 meters area (high altitude) or 4 x 4 meters area (low altitude). The spectral response of the reflected light is decomposed into 224 contiguous bands (or channels), approximately 10nm wide, spanning from visible to near infrared light (400nm to 2500nm). Each of these vectors provides a spectral signature of the area. Spectral components are acquired in floating point 12-bit precision and then scaled and packed into signed 16 bit integers. After acquisition, AVIRIS images are processed to correct various physical effects (geometrical distortion due to the flight trajectory, time of day, etc.) and stored in scenes of 614 by 512 pixels each (when the image height is not a multiple of 512, the last scene will be smaller). All files for each of the 5 test images can be downloaded from the NASA web site (JPL [20]) and the scenes belonging to the same flight are merged together to form a complete image.

Several experiments have been performed for various numbers of partitions and for different codebook sizes. The results that we describe here were obtained for partitions and codebook levels. The choice of the number of levels makes also practical the use of offtheshelf image compression tools that are fine-tuned for 8 bit data. The LPVQ algorithm trains on each image independently and the codebooks are sent to the decoder as side information. The size of the codebook is negligible with respect the size of the compressed data (256 x 224 x 2 bytes = 112 KB) and its cost is included in the reported results.