Vertical Set Square Distance:
A Fast and Scalable Technique to Compute Total Variation in Large Datasets
Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo
ComputerScienceDepartment
North DakotaStateUniversity
Fargo, ND58105USA
{taufik.abidin, amal.perera, md.serazi, william.perrizo}@ndsu.edu
Abstract
In this paper, we introduce the vertical set square distance (VSSD) technique that is designed to efficiently and scalably measure the total variation of a set about a fixed point in large datasets. The set can be any projected subspace of any vector space, including oblique subspaces (not just dimensional subspaces). VSSD can determine the closeness of a point to a set of points in a dataset, which can be very useful for classification, clustering and outlier detection tasks. The technique employs a vertical data structure called the Predicate-tree (P-tree)[1]. Performance evaluations based on both synthetic and real-world datasets show that VSSD technology is fast, accurate and scales well to very large datasets, as compared to similar techniques utilizing horizontal record-based data structure.
Keywords
Vertical Set Square Distance, P-trees, Total Variation.
1.INTRODUCTION
The determination of closeness or distance of a point to another point or to a set of points is often required in data mining tasks. For example, in partitioning clustering which similar points (objects) are group together, a closeness of points is used for expressing similarity i.e. k-mean, k-medoids [2][3], and it is also true for classification tasks [4][6]. Similarly,in outlier detectiona closeness of a pointto other pointsin a dataset is often used as a basis for determiningoutlier [5].
One way to measure the closeness of a point to some other predefined collection of points is by the computation of total variation of a set of points about a point in question. The analysis of total variation can reveal how well and poor the point is distinguished to that set of points [3].
In this paper, we introduce a new technique called the vertical set square distance (VSSD) that scalably and accurately computes total variation of a set of points about a fixed point. The scalability of the technique to very large cardinality datasets becomes our main concern because data mining tasks frequently deal with large datasets. The VSSD technique employs the P-tree vertical representation as its underlying data structure, which makes it very fast and scales well to compute total variation in large datasets, compared to similar technique that uses horizontal record-based data, as shown in our performance evaluations. The horizontal record-based set square distance technique as a comparison technique is called horizontal set square distance (HSSD) in this paper.
The rest of the paper is organized as follows. Section 2provides a short review of P-tree vertical representation. Section 3 presents vertical set formulas and simple example. Section 4 discusses performance evaluations. Section 5 presents conclusions with some directions for future work.
2.VERTICAL REPRESENTATION
Vertical data representation consists of set structures representing the data column-by-column rather than row-by-row (relational data). P-trees are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. P-trees were initially developed for mining spatial data [4][7]. However, since then this vertical data representation has been adapted for mining many types of data [6][8].
P-trees vertical data structures are particularly well suited for data mining without sub sampling. The creation of P-tree is typically started by converting a relational table of horizontal records to a set of vertical, compressed P-trees by decomposing each attribute in the table into separate bit vectors (e.g., one for each bit position of a numeric attribute or one bitmap for each category in a categorical attribute). Such vertical partitioning guarantees that the information is not lost.
The complete set of (uncompressed) vertical bit slices (for numeric attributes) or bitmaps (for categorical attributes) will be called the 0-dimensional P-trees for the given table. 1-dimensional P-trees involve a binary tree compression of those 0-dimensional P-trees as shown in the next figure. 2-dimensional, 3-dimensional or multi-dimensional P-trees are also possible, depending upon the nature of the data and the compression achieved for the particular dataset. Roughly, the q-dimensional P-tree set which losslessly represents a tabular dataset, is constructed by recursively partitioning into q pieces, and recording the truth of the given predicate regarding each piece in the appropriate node of a fan-out = q tree. In this paper, we will focus on 1-dimensional P-tree in which each piece is equi-width (recursively halving, i.e., partitioning into 1/21-pieces). However, everything done with 1-dimensional P-trees could as well be done with 2-dimensional P-trees (e.g., in which each equi-width piece is recursively quartered, i.e., partitioning into 1/22-pieces), etc.
For example, let R be a relational table consists of three numeric attributes R(A1, A2, A3). Assuming the binary representation (e.g., (5)10 in binary is (101)2). In 0-dimensional P-trees each position is treated as a vertical bit-slice and is stored in a separate file. For 1-dimensional Predicate-trees using the universal predicate (purely 1-bits), each bit slice is converted into a compressed tree structure by recursively halving and recording the truth of the predicate in a binary tree. Figure 1 depicts the conversion of a numerical attribute, A1, into 1-dimensional P-trees.
Figure 1. 1-dimensional P-tree of attribute A1.
The P-trees are built from the top down, stopping any branch as soon as purity (either purely 1-bits or purely 0-bits) is reached (giving the compression).
Logical operations AND (), OR () and NOT or complement (') are the main operations used to horizontally process these vertical P-trees. The operations are performed level-by-level starting from the root level. They are associative, commutative and distributive, since they are simply pruned bit-by-bit operations. Refer to [1] for an excellent overview about P-tree algebra and its logical operations. We note here, only that the AND operation, for instance, can be accelerated by realizing that any operand P-tree that has a pure-0 terminal node at any level will cause the result P-tree to have a pure-0 terminal node at the same position, regardless of the other operands. This and other operational speedups contribute strongly to the effectiveness of the approach.
The count of 1-bits from the resulting P-trees logical operations is called root count. It can be computed quickly by summing from the bottom up. For example a root count of P11 P12 is equal to 7, computed from 1 · 20 + 1 · 21 + 1 · 22, as there is only a single bit of 1 found in each level.
3.VERTICAL SET FORMULAS
In this section, we define vertical set formulas performed on training set with n dimensions and represented in b-bits. At the end of the section, the horizontal set square distance formula is also provided.
3.1Binary Representation
Binary representation is intrinsically a fundamental concept in vertical data structures. Let x be a numeric value of attribute A1. Then the representation of x in b bits is written as:
andare the highest and lowest order bits respectively.
3.2Vertical Set Inner Product (VSIP)
Let X, any set of vectors in R(A1…An) with P-tree class mask, PX, andis represented in b bits,
and be a
target vector, then the vertical set inner product (X o a) is defined as:
Suppose there is a dataset with three attributes A1, A2, and A3, each of which has a numerical domain and another attribute Rank has a single categorical domain as illustrated in table 1, P-tree class masks (bitmaps), PXi, are obtained by creating a vertical bit vector, one for each class, where bit 1 is assigned to every tuple containing that class and bit 0 to all the other tuples. Let attribute Rank a class attribute containing two types of values then two P-tree class masks are created, one for each distinct value.
Table 1. Training set example.
A1 / A2 / A3 / Rank9 / 31 / 6 / Low
11 / 20 / 5 / Low
11 / 21 / 4 / High
7 / 23 / 3 / High
7 / 27 / 1 / High
8 / 31 / 0 / High
Subsequently, we convert each numerical domain into based-two representation with a uniform width of b bits. The maximum width is determined from the largest value in the training set. After that, we create P-tree class masks as illustrated in table 2.
Table 2. P-tree class masks of attribute Rank.
A1 / A2 / A3 / RankPX1 / PX2
01001 / 11111 / 00110 / 0 / 1
01011 / 10100 / 00101 / 0 / 1
01011 / 10101 / 00100 / 1 / 0
00111 / 10111 / 00011 / 1 / 0
00111 / 11011 / 00001 / 1 / 0
01000 / 11111 / 00000 / 1 / 0
We pad a reasonable number of zeros to have a uniform bit width for any attributes that can be represented in less than b number of bits. Zero padding is a prerequisite for the formula to obtain a correct result. In table 2, attribute A3 has been padded with two additional zero bits to have a uniform width of 5 bits each because 31 or (11111)2 is the largest value found in the training set (attribute A2 of tuple 1 or 7).
As we mentioned before, a root count is the total number of 1 bits counted from the resultant of operations of P-tree operands. A root count (rc) of PX1 P13, written as rc(PX1 P13), is equal to 2, where PX1 is the P-tree class mask for class high and P13 is the P-tree of attribute A1 at the fourth bit position. Let a target vector a = (14, 10, 19) = (01110, 01010, 10011) in binary, the vertical set inner product for class High, (X1 o a) = 1,634 and for class Low, (X2 o a) = 999.
3.3Vertical Set Difference (VSD)
Vertical set difference, denoted as (X - a),computes the sum of vector difference from a set of vectors X about a target vector a. Letare vectors that belong to the same class and X is any set of vectors in R(A1…An) with PX is a class mask, the vertical set difference is defined:
The formula returns a single vector that represents a cumulative length of a set of vectors about a. However, since a vector has direction, the final summation may mislead the actual separation, especially when negative vectors are involved in the summation. Therefore, squaring the formula can avoid this misleading. We will describe this new formula thoroughly in the next sub section.
3.4Vertical Set Square Distance (VSSD)
To alleviate the problem of canceling out when a set of vector difference is summed due to the existence of direction in a vector, we introduce the vertical set square distance (VSSD). VSSD measures the total variation and the variance of a set of points about a fixed point. The VSSD is defined as follows:
where
=
The, where N refers to the total number of vectors in X, measures the variance of X about a. Notice that N can be easily computed using rc(PX), that is the total number of 1 bits counted from P-tree class mask X.
The advantage of VSSD is that root counts can be pre-computed and stored, as their operations are obviously independent from a, thus allowing us to pre-compute them in advance. These root counts include the root counts of P-tree class masks PX, the root counts of PXPij and the root counts of PXPijPil where Pij and Pil are the corresponding P-trees of the training set.
We calculation the total variation and variance using the same training example found in table 2 and a target vector a = (14, 10, 19). Using the VSSD, the total variation of X1 about a is . We get T1 = 2,969, T2 = -3,268 and T3 = 2,628. Therefore, the total variation of X1 about a= 2,329 and the variance of X1 about a:
Similarly, the total variation and the variance of X2 about a are 940 and 470 respectively. Since variance of X1>X2, we conclude that vector a is closer to X2 than to X1.
3.5Horizontal Set Square Distance (HSSD)
Let X, a set of vectors in R(A1…An) and x = (x1, x2, …, xn)is a vector belong to class X and a = (a1, a2, …, an) is a target vector, then the horizontal set square distance is defined as:
4.EXPERIMENTAL RESULTS
This section reports experiments performed to evaluate the VSSD algorithm. The experiments were conducted using both real and synthetic datasets. The objective was to compare the execution time and scalability of our algorithm employing a vertical approach (vertical data structure and horizontal bitwise AND operation) with a horizontal approach (horizontal data structure and vertical scan operation). We show the results of experiments of execution time with respect to scalability. Performance of both algorithms was observed under different machine specifications, including an SGI Altix CC-NUMA machine. Table 3 summarizes the machines used for the experiments.
Table 3. The specification of machines used.
Machine / SpecificationAMD1GB / AMD Athlon K7 1.4GHz, 1GB RAM
P42GB / Intel P4 2.4GHz processor, 2GB RAM
SGI Altix / SGI Altix CC-NUMA 12 processor shared memory (12 x 4 GB RAM).
4.1Datasets
The experimental data was generated based on a set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and longitude are 970 42'18"W, taken in 1998. The image contains three bands: red, green, and blue reflectance values. We use the original image of size 1024x1024 pixels (having cardinality of 1,048,576). Corresponding synchronized data for soil moisture, soil nitrate and crop yield were also used for experimental evaluations. Combining all bands and synchronized data, we obtained a dataset with 6 dimensions.
Additional datasets with different sizes were synthetically generated based on the original datasets to study the timing and scalability of VSSD technique presented in this paper. Both timing and scalability were evaluated with respect to data size. Due to the small number of cardinality obtained from the original dataset (1,048,576 records), we super sampled the dataset by using a simple image processing tool on the original dataset to produce five other larger datasets, each of which having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608, 16,777,216 (4096x4096 pixels) and 25,160,256 (5016x5016 pixels).
4.2Timing and Scalability Results
The first performance evaluation was done using a P4 with 2 GB RAM. We used synthetic datasets having 4.1 and 8.3 million rows to evaluate the execution time of the algorithms to compute total variation for 100 different test cases. Datasets of size greater than 8.3 million rows cannot be executed in this machine due to out of memory problem when running HSSD. Figure 2 and 3 depict the execution time comparison between VSSD and HSSD.
Figure 2. Time comparison on 4.1 million rows dataset.
Figure 3. Time comparison on 8.3 million rows dataset.
The figures show that up to 8.3 million rows both algorithms apparently scale, however VSSD is significantly fast compared to HSSD. It requires only 0.0003 and 0.0004 seconds on average to complete the calculation on each dataset, very much less than HSSD, which needs 7.3800 and 15.1600 seconds on average respectively. These significant disparities are due to the superiority of VSSD algorithm to reuse the same root count values that have been pre-computed and stored during P-tree creation even though various different test case vectors are fed during calculation. If we refer back to the VSSD formula defined in section 3.4, the test case vector a only appeared in the calculation of T2 and T3 and independent from root count rc(PX Pij) operations. Thus allowing us to pre-compute the operations and reuse the values repeatedly regardless how many different target vectors are used as long as the dataset and set of classes remain unchanged.
Notice also that VSSD tends to have a constant execution time even though the datasets size is increased, where as HSSD tends to require a significantly increased execution time. One may argue that pre-calculation of root count makes this comparison fallacious. However, compare the time required for loading the vertical data structure to memory and one time root count operations for VSSD, and loading horizontal records to memory for HSSD given in table 4. The performance with respect to time of VSSD is comparable to HSSD. There is a slight increase in the amount of time required to load horizontal records than to load P-trees and to compute the root counts. This illustrates the ability of the P-tree data structure to efficiently load and compute the simple counts. These timings were obtained on a P4 with 2 GB of memory.
Table 4. Time for loading and computing root count.
Cardinality of Dataset / Time (Seconds)VSSD / HSSD
Root Count Pre-Computation and P-trees Loading / Horizontal Dataset Loading
1,048,576 / 3.900 / 4.974
2,097,152 / 8.620 / 10.470
4,194,304 / 18.690 / 19.914
8,388,608 / 38.450 / 39.646
Our next experiment was to observe the algorithm’s timing and scalability performance when executing on machines with different specifications, especially for HSSD, which is very sensitive to the availability of memory to execute successfully. This sensitivity was proven when we run HSSD on AMD with 1 GB memory. HSSD successfully completed the total variation computation using dataset with cardinality of 1,048,576, 2,097,152, and 4,194,304, yet suffered from out of memory problem when computing total variation using dataset with cardinality of more than 4.1 million. Similarly, when we run HSSD on P42GB machine, HSSD scales to compute total variation only for datasets with cardinality less than 8.3 million. Nevertheless, HSSD performed better in term of scalability under the SGI Altix and successfully computed total variation for all datasets, but also suffered from out of memory problem when trying to load a dataset with more than 25 million rows. However the timing performance of HSSD on this machine degrades significantly compared to the timing of HSSD running on the P4 2 GB RAM. This is because of not utilizing the full capability of the shared memory 12-processor parallel architecture of the machine, which is beyond the scope of this paper. This machine with 12x4G of RAM was used in the performance study since it was the only machine capable of loading the entire dataset for the HSSD for larger datasets.
On the other hand, our VSSD technique was successful with respect to time and scalability. There was no memory problem, when effectively computing total variation with large datasets having more than 25 million rows. We were able to compute VSSD for the largest dataset on the smallest machine (AMD1GB). Much faster results are evident when running VSSD using the other more powerful machines. We only report the average time for VSSD which was extremely stable, that is around 0.0003 to 0.0004 seconds for the smallest machine with respect to compute power and memory. Table 5 presents the average time when executing the two techniques under different machines and figure 4 further illustrates the performance with respect to scalability.