BDM — A Bit Level Decomposition Storage Model
Qiang Ding, William Perrizo and Victor Shi
North Dakota State University
IACC 258, C.S. Department
Fargo, ND USA
{Qiang.Ding,William.Perrizo,Victor.Shi}@ndsu.nodak.edu
1
Abstract
Query Optimization is one of the most important traditional core areas in database research. In order to obtain the best performance advantages, exceptional selection, project, and join (SPJ) strategies need be developed. In addition it is no longer satisfactory to separate data mining operations from query processing. Data mining is at one end of the query spectrum and standard SPJ queries are at the other. Furthermore, often a user request combines both activities. It is the aim of this paper to introduce a bit-level decomposition storage model (BDM) facilitating efficient SPJ query processing and data mining in a unified manner. Through complete vertical decomposition of data files, field-level access is possible, in which only those participating fields are retrieved. Using a compression method called Peano-Trees or P-trees, query optimisation, data mining and concurrency control can be unified in an optimal way. In addition, the approach taken facilitates snapshot concurrency control at the field level. Using the BDM model for query processing and data mining, optimistic snapshot concurrency control providing full serializable execution is automatically facilitated.
1. Introduction
The P-tree and bSQ technologies discussed in this paper are patent pending by North Dakota State University.
This work was partially supported by GSA grant number ACT# K96130308.
Vertical partitioning has been studied within the context of both centralized and distributed database system. It is a good strategy when small numbers of columns are retrieved by most queries. The decomposition of a relation also permits a number of transactions to execute concurrently. Copeland et al presented an attribute level Decomposition Storage Model called DSM [CK85], storing each column of a relational table into a separate table. DSM was shown to perform well. Attribute level Vertical decomposition is also used in Remotely Sensed Imagery (e.g., Landsat Thematic Mapper Imagery), where it is called Band Sequential (BSQ) format.
Beyond attribute level decomposition, Wong et al took advantage of encoded attribute values using a small number of bits to reduce the storage space [WLO+85]. In this paper, we will decompose attributes of relational tables into separate files by bit position. The model is called bit Sequential (bSQ). The bSQ model is similar to Wong et al Bit Transposed File model (BTF). In addition, we will compress the vertical bit files using a query-and-data-mining-ready structure called the Peano Tree.
Our method offers these advantages:
(1) By vertical partitioning, we only need to read what we need.
(2) We encode attribute values into bit vector format, which makes compression easy to do.
(3) SPJ queries can be formulated as Boolean expressions, which facilitates fast hardware implementation.
(4) Our model works well not only for query processing but for data mining as well.
The rest of this paper is organizes as follows. Section 2 gives the various encoding strategies used in BDM. Section 3 presents P-tree compression of vertical bit files that also facilitates fast Boolean operations. In Section 4 the query optimization strategy is presented. Some interesting future research directions are given in Section 5. Some other related work is given in Section 6. Section 7 contains the summary and concludes the paper.
2. Encoding Strategies
We use different encoding strategies on different types of attributes. For easy retrieval, we limit them to fixed length encoding. Below we will describe each of them with examples.
2.1 Binary Encoding
For most numeric values (exclude floating point values), we can use bits to represent values between 0 and n. This strategy is also very suitable for attributes with fixed possible values. For example, gender attributes can be encodes as 0 or 1; months of a year can be encoded as 4 bits values range from 0000 to 1011. Floating point values can also be coded using combinations of the fraction and exponent.
2.2 Lookup-table Encoding
For most non-numeric discrete and categorical values, a lookup table can be maintained for all possible values. There are several standard such encodings, such as EBCDIC and ASCII, however these can be inefficient of long strings. An example of lookup-table encoding is given in Figure 1.
Attribute
a 0 0 0 Lookup
b 0 0 1 Table
c 0 1 0 a 000
a 0 0 0 b 001
d 0 1 1 c 010
c 0 1 0 d 011
a 0 0 0 e 100
e 1 0 0
Figure 1. An example using lookup-table encoding.
2.3 Bitmap Encoding
For numeric values, especially sparse values with a wide range, bitmap encoding is very useful. If m is the cardinality of the relational table, n is the number of different values for an attribute, then the corresponding column of a table can be encoded by a m by n matrix, where the ith record has a value of v if and only if the ith bit in the bitmap associated with the attribute value v is set to 1, and the ith bit in each of the other bitmaps is set to 0. An example using bitmap encoding is given in Figure 2.
Attribute 90 80 70 60
90 1 0 0 0
80 0 1 0 0
70 0 0 1 0
90 1 0 0 0
60 0 0 0 1
90 1 0 0 0
80 0 1 0 0
70 0 0 1 0
Figure 2. An example using bitmap encoding.
3. Compression Using P-trees
In [PDD+01], a quadrant-based tree structures, called the Peano Trees or P-tree, was developed to facilitate compression and very fast processing (logical ANDing) of bit sequential (bSQ) data. P-trees can be 1-dimensional, 2-dimensional, 3-dimensional, etc. If the data has a natural dimension (e.g., spatial data) the P-tree dimension is matched to the data dimension. Otherwise, the dimension can be chosen to optimize compression. In this paper we will use 2-dimensions throughout.
The Peano Tree is designed specifically to facilitate very fast logical AND operations used in data mining and query processing. The most useful form of a Ptree is the predicate-P-tree in which a 1-bit appears at those tree nodes corresponding to quadrants for which the predicate holds. In Figure 3, a bSQ file with 64 rows is on the left, the file is rearranged into 2-D Peano or Z order in the middle and the P-tree is on the right.
Figure 3. bSQ file, 2-D Peano order bSQ file, and P-tree.
In this example, The count of 1-bits in the entire file is called root count of the P-tree (equals 39 in this example). The root count or any other quadrant count can be computed quickly by summing from the bottom up. A P-tree is a type of quadrant tree.
If we compute all quadrant counts and place them at the nodes of a P-tree, it is called a Peano Count tree. In a Peano Count tree, the leaf sequence (depth-first) is a partial run-length compressed version of the original bit vector [DKR+02]. Therefore, P-trees can save substantial amounts of storage. Furthermore, P-tree Boolean operations (AND, OR, and NOT) can be conducted directly without decompression, eliminatig a high CPU cost required in most compression algorithms.
Three encoding strategies are described in section 2. For binary encoding and lookup-table encoding, each bit vector is compressed and stored as a basic P-tree. As discussed in previous work [DKR+02], basic P-trees of different bit positions can be ANDed together resulting in a value P-tree (1 for each quadrant that has that value throughout) or tuple P-tree (1 for each quadrant that has that tuple throughout). The collection of possible value P-trees is precisely a P-tree representation of the bit vectors in bitmap encoding.
4. SPJ Query Optimization Strategies
4.1 One-table Selections
There are two categories of queries in one-table selections: Equality Queries and Range Queries. Most techniques [WLO+85, OQ97, CI98, CI99] used to optimize them employ encoding schemes – equality encoding and range encoding.
We defined interval P-trees in [DKR+02], in which an tree node has a 1-bit iff all values in the corresponding quadrant are from the interval. So for any one-table selction, we have one corresponding interval P-tree. The ANDing result of all the corresponding interval P-trees represents all the rows satisfy the conjunction of all the restriction in the one-table where clause.
4.2 Select-Project-StarJoin (SPSJ) Queries
A Select-Project-StarJoin query is a SPJ query in which there is one multiway join along with selections and projections. Typically there is a central fact relation to which several dimension relations are joined. The dimension relations can be viewed as points on a star centered on the fact relation. Consider, for example, the Student (S), Course (C), and Enrollment (E) database shown below (note a bit encoding is shown in italics for certain attributes), and the SPSJ query,
SELECT S.s, S.name, C.name
FROM S, C, E
WHERE S.s=E.s AND C.c=E.c AND S.gen='M' AND
E.grade='A' AND C.term='S';
_______________ __________________
S|s |name |gen| C|c |name|st|term|
|0 000|CLAY |M 0| |0 000|BI |ND|F 0|
|1 001|THAIS|M 0| |1 001|DB |ND|S 1|
|2 010|GOOD |F 1| |2 010|DM |NJ|S 1|
|3 011|BARD |F 1| |3 011|DS |ND|F 0|
|4 100|PERRY|M 0| |4 100|SE |NJ|S 1|
|5 101|JOAN |F 1| |5 101|AI |ND|F 0|
___________________
E|s |c |grade |
|0 000|1 001|B 10|
|0 000|0 000|A 11|
|3 011|1 001|A 11|
|3 011|3 011|D 00|
|1 001|3 011|D 00|
|1 001|0 000|B 10|
|2 010|2 010|B 10|
|2 010|3 011|A 11|
|4 100|4 100|B 10|
|5 101|5 101|B 10|
The Decomposition Storage Model (DSM) [CK85], Attribute Transpose File model (ATF) [Bat79] and the Band Sequential Model (BSQ) all refer to the vertical partitioning of the original relations into single attribute relations with a consistent surrogate ordering. We further vertical partition some attributes to the bit level - referred to in [WLO+85] as the Bit Transpose Model (BTF) and in [PDD+01, DKR+02] as bit-Sequential (bSQ) storage model. In this paper, form this point forward, we will use BSQ instead of DSM or ATF, and we will use bSQ instead of BTF.
We organize those attributes expected to be involved in frequent join or select operations into the bSQ format and leave the others (typically large character string attributes) in the BSQ format. Most join and selection attributes are numeric (e.g., S.s, C.c, E.s, E.c), but even if they are categorical, they can first be coded numeric (e.g., S.gender, C.term, E.grade). In our BDM model, the bSQ attributes are then compressed into P-trees. The P-tree dimension is a parameter which can be chosen to optimize compression. We will assume 2-dimensional P-trees throughout this paper. P-tree format provides considerable compression and acceleration of the logical operations used to answer the query or data mine.
However, since this example is artificially small, we show the bSQ attributes as uncompressed, Peano-ordered, 2-D matrixes rather than compressed P-trees so that the presentation will be easy to follow. The 2-D matrix forms will always be chosen to be 2n´2n for some n=1,2,3… This will accommodate later inserts (n is incremented by one each time the matrix saturates). We note that the quadrant-wise compression provided by P-trees will fully compress most of the as-yet unused 4n entries. In our simplified presentation as 2n´2n maticies, we simply leave the unused entries blank. The bSQ attributes for our example are stored as shown below (note, e.g., bit-1 of S.s has been labeled Ss1 for simplicity of notation, etc.). The reader is reminded again that the matrix ordering is Peano or Z ordering., not raster ordering.
Ss1 Ss2 Ss3 Sgen Cc1 Cc2 Cc3
0011 0000 0101 0001 0011 0000 0101
00 11 01 11 00 11 01
Cterm Es1 Es2 Es3
0110 0000 0000 0011
10 0000 1111 1100
11 00 01
Ec1 Ec2 Ec3 Egrade1 Egrade2
0000 0010 1010 1101 0100
0000 0111 1101 1011 1001
11 00 01 11 00
The BSQ attributes are stored as single attribute tables as follows.
S.name C.name C.st
|CLAY | |BI | |ND|
|THAIS| |DB | |ND|
|GOOD | |DM | |NJ|
|BARD | |DS | |ND|
|PERRY| |SE | |NJ|
|JOAN | |AI | |ND|
For BSQ character string attributes, LZW [Wel84] (or some other run-length compression) would further reduce storage requirements. The compression scheme should be chosen so that any range of offset entries can be uncompressed independently. Each of these BSQ files would then require only a few pages, allowing the entire file to be brought into memory whenever any portion of it is needed, eliminating the need for disk-based access paths.
A bit mask is formed for each selection as follows. The bit mask for the selection, S.gen='M', is just the complement of S.gen (since M has been coded as 0). We denote this fact by mS=Sgen'. Similarly, the selection mask for the Course relation is mC=Cterm and for the Enrollment relations is mE=Eg1 AND Eg2.
mS mC mE
1110 0110 0100
00 10 1001
00
The selection masks are applied and the resulting reduced relations are semijoined to the central fact relation (in this case, E). The query diagram for this approach is a wheel with E at the center and S, C at the rim.
Logically ANDing mE into the E.s and E.c attributes, reduces E.s and E.c as follows.