Derive High Confidence Rules for Spatial Data using Tuple Count Cube
William Perrizo1, Qin Ding1, Qiang Ding1, and Amalendu Roy1
1 Department of Computer Science, North Dakota State University,
Fargo, ND 58105-5164, USA
{William_Perrizo, Qin_Ding, Qiang_Ding, Amalendu_Roy}@ndsu.nodak.edu
Abstract. The traditional task of association rule mining is to find all rules with high support and high confidence. In some applications, such as mining spatial datasets for natural resources, the task is to find high confidence rules even though their supports may be low. In still other applications, such as the identification of agricultural pest infestations, the task is to find high confidence rules preferably while the support is still very low. The basic Apriori algorithm cannot be used to solve these problems efficiently. In this paper, we propose a new model to derive high confidence rules for spatial data. A new data structure, the Peano Count Tree (PC-tree), is used in our model to represent all the information we need. PC-trees represent spatial data bit-by-bit in a recursive quadrant-by-quadrant arrangement. Based on the PC-tree, we build a special data cube, the Tuple Count Cube (TC-cube), to derive high confidence rules. Our algorithm for deriving confident rules is fast and efficient. In addition, we discuss some strategies for avoiding over-fitting (removing redundant and misleading rules).
1 Introduction
Association rule mining [1,2,3,4,5], proposed by Agrawal, Imielinski and Swami in 1993, is one of the important tasks of data mining. The original application of association rule mining is on market basket data. A typical example is “customers who purchase one item are very likely to purchase another item at the same time”. There are two accuracy measures, support and confidence, for each rule. The problem of association rule mining is to find all the rules with support and confidence exceeding some user specified thresholds. The basic algorithms, such as Apriori [1] and DHP [4], use the downward closure property of support to find frequent itemsets, whose supports are above the threshold. After obtaining all frequent itemsets, which is very time consuming, high confidence rules are derived in a very straightforward way.
However, in some applications, such as spatial data mining, we are also interested in rules with high confidence that do not necessarily have high support. In still other applications, such as the identification of agricultural pest infestations, the task is to find high confidence rules preferably while the support is still very low. In these cases, the traditional algorithms are not suitable. One may think that we can simply set the minimal support to a very low value, so that high confidence rules with almost no support limit can be derived. However, this will lead to a huge number of frequent itemsets, and is, thus, impractical.
In this paper, we propose a new model, including new data structures and algorithms, to derive “confident” rules (high confidence only rules), especially for spatial data. We use a data structure, called the Peano Count Tree (PC-tree), to store all the information we need. A PC-tree is a quadrant based count tree. From the PC-trees, we build a data cube, the Tuple Count Cube or TC-cube which exposes the confident rules. We also use the attribute precision concept hierarchies and a natural rule ranking to prune the complexity of our data mining algorithm.
The rest of the paper is organized as follows. In section 2, we provide some background on spatial data. In section 3, we describe the data structures we use for association rule mining, including PC-trees and TC-cubes. In section 4, we detail our algorithms for deriving confident rules. Performance analysis and implementation issues are given in section 5, followed by related work in section 6. Finally, the conclusion is given.
2 Formats of Spatial Data
There are huge amounts of spatial data on which we can perform data mining to obtain useful information[16]. Spatial data are collected in different ways and are organized in different formats. BSQ, BIL and BIP are three typical formats.
An image contains several bands. For example, TM6 (Thermatic Mapper) scene contains 6 bands, while TM7 scene contains 7 bands, including Blue, Green, Red, NIR, MIR, TIR, MIR2, each of which contains reflectance values in the range, 0~255.
An image can be organized into a relational table in which each pixel is a tuple and each spectral band is an attribute. The primary key can be latitude and longitude pairs which uniquely identify the pixels.
BSQ (Band Sequential) is a similar format, in which each band is stored as a separate file. Raster order is used for each individual band. TM scenes are in BSQ format. BIL (Band Interleaved by Line) is another format in which all the bands are organized in one file and bands are interleaved by row (the first row of all bands are followed by the second row of all bands, and so on). For example, SPOT data from French satellites are in BIL format. In the BIP (Band Interleaved by Pixel) format, there is also just one file in which the first pixel-value of the first band is followed by the first pixel-value of the second band, ..., the first pixel-value of the last band, followed by the second pixel-value of the first band, and so on. For example, TIFF images are in BIP format. Fig. 1 gives an example of using BSQ, BIL and BIP formats.
In this paper, we propose a new format, called bSQ (bit Sequential), to organize images. The reflectance values of each band range from 0 to 255, represented as 8 bits. We split each band into a separate file for each bit position. Fig. 1 also gives an example of bSQ format.
There are several reasons to use the bSQ format. First, different bits have different degrees of contribution to the value. In some applications, we do not need all the bits because the high order bits give us enough information. Second, the bSQ format facilitates the representation of a precision hierarchy. Third, and most importantly, bSQ format facilitates the creation of an efficient, rich data structure, the PC-tree, and accomodates algorithm pruning based on a one-bit-at-a-time approach.
We give a very simple illustrative example with only 2 data bands for a scene having only 2 rows and 2 columns (both decimal and binary representation are shown).
Fig. 1. Two bands of a 2-row-2-column image and its BSQ, BIP, BIL and bSQ formats
3 Data Structures
3.1 Basic PC-trees
We organize each bit file in the bSQ format into a tree structure, called a Peano Count Tree (PC-tree). A PC-tree is a quadrant based tree. The idea is to recursively divide the entire image into quadrants and record the count of 1-bits for each quadrant, thus forming a quadrant count tree. PC-trees are somewhat similar in construction to other data structures in the literature (e.g., Quadtrees[10] and HHcodes [14]).
For example, given an 8-row-8-column image, the PC-tree is as shown in Fig. 2.
Fig. 2. 8*8 image and its PC-tree (PC-tree and PM-tree)
In this example, 55 is the number of 1’s in the entire image. This root level is labeled as level 0. The numbers at the next level (level 1), 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant are composed entirely of 1-bits (called a “pure 1 quadrant”), we do not need subtrees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0-bits are called “pure 0 quadrants” which also terminate these tree branches. This pattern is continued recursively using the Peano or Z-ordering of the four subquadrants at each new level. Every branch terminates eventually (at the “leaf” level, each quadrant is a pure quadrant). If we were to expand all subtrees, including those for pure quadrants, then the leaf sequence is just the Peano-ordering (or, Z-ordering) of the original raster image. Thus, we use the name Peano Count Tree.
We note that, the fan-out of the PC-tree need not be limited to 4. It can be any power of 4 (effectively skipping that number of levels in the tree). Also, the fanout at any one level need not coincide with the fanout at another level. The fanout pattern can be chosen to produce maximum compression for each bSQ file.
For each band (assuming 8-bit data values), we get 8 basic PC-trees, one for each bit position. For band B1, we will label the basic PC-trees, P11, P12, …, P18. Pij is a lossless representation of the jth bits of the values from the ith band. In addition, Pij provides the 1-bit count for every quadrant of every dimension. Finally, we note that these PC-trees can be generated quite quickly and can be viewed as a “data mining ready”, lossless format for storing spatial data.
The 8 basic PC-Trees defined above can be combined using simple logical operations (AND, NOT, OR, COMPLEMENT) to produce PC-Trees for the original values in a band (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision. Pb,v is called a value PC-tree. Using the full 8-bit precision (all 8 –bits) for values, value PC-tree Pb,11010011 can be constructed from the basic PC-trees by ANDing basic PC-trees (for each 1-bit) and their complements (for each 0 bit):
PCb,11010011 = PCb1 AND PCb2 AND PCb3’ AND PCb4 AND PCb5’ AND PCb6’ AND PCb7 AND PCb8
where ‘ indicates the bit-complement (which is simply the PC-tree with each count replaced by its count complement in each quadrant).
From value PC-trees, we can construct tuple PC-trees. Tuple PC-tree for tuple (v1,v2,…,vn), denoted PC (v1, v2, …,vn), is:
PC(v1,v2,…,vn) = PC1,v1 AND PC2,v2 AND … AND PCn,vn
where n is the total number of bands.
Fig. 3. Basic PC-trees, Value PC-trees (for 3-bit values) and Tuple PC-trees
The AND operation is simply the pixel-wise AND of the bits. Before going further, we note that the process of converting the BSQ data for a TM satellite image (approximately 60 million pixels) to its basic PC-trees can be done in just a few seconds using a high performance PC computer. This is a one-time process. We also note that we are storing the basic PC-trees in a “breadth-first” data structure which specifies the pure-1 quadrants only. Using this data structure, each AND can be completed in a few milliseconds and the result counts can be accumulated easily once the AND and COMPLEMENT program has completed.
3.2 Variations of PC-tree
In order to optimize the AND operation, we use a variation of the PC-tree, called PM-tree (Pure Mask tree). In the PM-tree, we use a 3-value logic to represent pure-1, pure-0 and mixed quadrant. To simplify the exposition, we use 1 for pure 1, 0 for pure 0, and m for mixed quadrants. Thus, the PM-tree for the previous example is also given in Fig. 2.
The PM-tree specifies the location of the pure-1 quadrants of the operands, so that the pure-1 quadrants of the AND result can be easily identified by the coincidence of pure-1 quadrants in both operands and pure-0 quadrants of the AND result occur wherever a pure-0 quadrant occurs on at least one of the operands.
3.3 Value Concept Hierarchy
Using bSQ format, we can easily represent the value concept hierarchy of spatial data. For example, for band n, we can use from 1 bit up to 8 bits to represent the reflectances (Fig. 4).
[0,0] [1,1] ------1 bit
( 0~127 ) (128~255)
[00,01) [01,10) [10,11) [11,11] ------2 bits
(0~63) (64~127) (128~191) (192~255)
[000, [001, [010, [011, [100, [101, [110, [111, ------3 bits
001) 010) 011) 100) 101) 110) 111) 111]
(0~31) (32~63) (64~95) (96~127) (128~159) (160~191) (192~223) (224~255)
Fig. 4. Value Concept Hierarchy
3.4Tuple Count Cube
For most spatial data mining, the root counts of the tuple PC-trees (e.g., PC(v1,v2,…,vn) = PC1,v1 AND PC2,v2 AND … AND PCn,vn), are the numbers required, since root counts tell us exactly the number of occurrences of that particular pattern over the space in question. These root counts can be inserted into a data cube, called the Tuple Count cube (TC-cube) of the spatial dataset. Each band corresponds to a dimension of the cube, the band values labeling that dimension. The TC-cube cell at location, (v1,v2,…,vn), contains the root count of PC(v1,v2,…,vn). For example, assuming just 3 bands, the (v1,v2,v3)th cell of the TC-cube contains the root count of PC(v1,v2,v3) = PC1,v1 AND PC2,v2 AND PC3,v3. The cube can be contracted or expanded by going up [down] in the value concept hierarchy.
4 Confident Rule Mining Algorithm
4.1 PC-tree ANDing algorithm
We begin this section with a description of the AND algorithm. This algorithm is used to compose the value PC-trees and to populate the TC-cube. The approach is to store only the basic PC-trees and then generate value PC-tree root counts “on-the-fly” when needed (in Section 5 we show this can be done in about 50ms). In this algorithm we will assume the PC-tree is coded in its most compact form, a depth-first ordering of the paths to each pure-1 quadrant.
Let’s look at the operand 1 first (Fig. 5). Each path is represented by the sequence of quadrants in Peano order, beginning just below the root. Therefore, the depth-first pure1 path code for this example is: 0 100 101 102 12 132 20 21 220 221 223 23 3 (0 indicates the entire level 1 upper left quadrant is pure 1s, 100 indicates the level 3 quadrant arrived at along the branch through node 1 (2nd node) of level 1, node 0 (1st node) of level 2 and node 0 of level 3, etc.). We will take the second operand (Fig. 5), with depth-first pure1 path code: 0 20 21 22 231. Since a quadrant will be pure 1’s in the result only if it is pure1’s in both operands (or all operands, in the case there are more than 2), the AND is done by: scan the operands; output matching pure1 paths. Therefore we get the result (Fig. 5).
Fig. 5. Operand 1, Operand 2, AND Result and AND Process
The pseudo code for the ANDing algorithm is given below.
Fig. 6. PC-tree ANDing algorithm
4.2Mining Confident Rules from Spatial Data Using TC-cubes
In this section a TC-cube based method for mining non-redundant, low-support, high-confidence rules is introduced. Such rules will be called confident rules. The main interest is in rules with low support, which are important for many application areas such as, natural resource searches, agriculture pest infestations identification, etc. However, a small positive support threshold is set, in order to eliminate rules that result from noise and outliers (similar to [7], [8] and [15]). A high threshold for confidence is set in order to find only the most confident rules.
To eliminate redundant rules resulting from over-fitting, an algorithm similar to the one introduced in [8] is used. In [8] rules are ranked based on confidence, support, rule-size and data-value ordering, respectively. Rules are compared with their generalizations for redundancy before they are included in the set of confident rules. In this paper, we use a similar rank definition, except that we do not use support level and data-value ordering. Since support level is expected to be very low in many spatial applications, and since we set a minimum support only to eliminate rules resulting from noise, it is not used in rule ranking. Rules are declared redundant only if they are outranked by a generalization. We choose not to eliminate a rule which is outranked only by virtue the specific data values involved.
A rule, r, ranks higher than rule, r', if confidence[r] > confidence[r'], or if confidence[r] = confidence[r'] and the number of attributes in the antecedent of r is less than the number in the antecedent of r'.
A rule, r, generalizes a rule, r’, if they have the same consequent and the antecedent of r is properly contained in the antecedent of r’. The algorithm for mining confident rules from spatial data is given in Fig. 7.
Fig. 7. Algorithm for mining confident rules
The following example contains 3 bands of 3-bit spatial data in bSQ format.
Fig. 8. 88 image data and its PM-trees
Assume minimum confidence threshold of 80% and minimum support threshold of 10%. Start with 1-bit values and 2 bands, B1 and B2. The TC-cube values (root counts from the PC-trees) are given in Fig. 9, while the rolled-up sums and confidence thresholds are given in Fig. 10.
Fig. 9. TC-cube for band 1 and band 2 Fig. 10. Rolled-up sums and confidence thresholds
All sums are at least 10% support (6.4). There is one confident rule:
C: B1={0} => B2={0} with confidence = 83.3%
Continue with 1-bit values and the 2 bands, B1 and B3, we can get the following TC-cube with rolled-up sums and confidence thresholds (Fig. 11). There are no new confident rules. Similarly, the 1-bit TC-cube for band B2 and B3 can be constructed (Fig. 12).
Fig. 11. TC-cube for band 1 and band 3 Fig. 12. TC-cube for band 2 and band 3
All sums are at least 10% of 64 (6.4), thus, all rules will have enough support. There are two confident rule, B2={1} => B3={0} with confidence = 100% and B3={1} => B2={0} with confidence = 100%. Thus,
C: B1={0} => B2={0} c = 83.3%
B2={1} => B3={0} c = 100%
B3={1} => B2={0} c = 100%
Next consider 1-bit values and bands, B1, B2 and B3. The counts, sums and confidence thresholds are given in Fig. 13:
