Bayesian Classification on Spatial Data Streams Using P-Trees 1 2

Bayesian Classification on Spatial Data Streams Using P-Trees[1][2]

Mohammad Hossain, Amal Shehan Perera and William Perrizo

Computer Science Department, North Dakota State University

Fargo, ND 58105, USA

{Mohammad_Hossain, Amal_Perera, William_Perrizo}@ndsu.nodak.edu

Abstract:

Classification of spatial data can be difficult with existing methods due to the large numbers and sizes of spatial data sets. The task becomes even more difficult when we consider continuous spatial data streams. Data streams require a classifier that can be built and rebuilt repeatedly in near real time. This paper presents an approach to deal with this challenge. Our approach uses the Peano Count Tree (P-tree), which provides a lossless, compressed and data-mining-ready representation (data structure) for spatial data. This data structure can be applied in many classification techniques. In this paper we focus on Bayesian classification. A Bayesian classifier is a statistical classifier, which uses the Bayes theorem to predict class membership as a conditional probability that a given data sample falls into a particular class. In this paper we demonstrate how P-trees can improve the classification of spatial data when using a Bayesian classifier. We also introduce the use of information gain calculations with Bayesian classification to improve its accuracy. The use of a P-tree based Bayesian classifier can not only make classification more effective on spatial data, but also can reduce the build time of the classifier considerably. This improvement in build time makes it feasible for use with streaming data.

Keywords: Data Mining, Bayesian Classification, P-tree, Spatial Data

1. Introduction:

Classification is an important data mining technique which predicts the class of a given data sample. Given a relation, R(k, A1, …, An, C), where k is the key of the relation R and A1, …, An, C are different attributes and among them C is the class label attribute [1]. Given an unclassified data sample (having a value for all attributes except C), a classification technique will predict the C-value for the given sample and thus determine its class. In the case of spatial data, the key, k in R, usually represents some location (or pixel) over a space and each Ai is a descriptive attribute of the locations. A typical example of such spatial data is an image of the earth collected as a satellite image or aerial photograph. The attributes may be different reflectance bands such as red, green, blue, infra-red, near infra-red, thermal infra-red, etc [2]. The attributes may also include ground measurements such as yield, soil type, zoning category, a weather attribute, etc.). A classifier may predict the yield value from different reflectance band values extracted from a satellite image.

Many classification techniques have been proposed by statisticians and machine learning experts [3]. They include decision trees, neural networks, Bayesian classification, and k-nearest neighbor classification. Bayesian classification is a statistical classifier based on Bayes theorem [4], as follows. Let X be a data sample whose class label is unknown. Let H be a hypothesis (ie, X belongs to class, C). P(H|X) is the posterior probability of H given X. P(H) is the prior probability of H then P(H|X) = P(X|H)P(H)/P(X)

Where P(X|H) is the posterior probability of X given H and P(X) is the prior probability of X.

Bayesian classification uses this theorem in the following way. Each data sample is represented by a feature vector, X=(x1..,xn) depicting the measurements made on the sample from A1,..An, respectively. Given classes, C1,...Cm, the Bayesian Classifier will predict the class label, Cj , that an unknown data sample, X (with no class label), belongs to as the one having the highest posterior probability, conditioned on X

P(Cj|X) > P(Ci|X), where i is not j ...... … ... (1)

P(X) is constant for all classes so we maximize P(X|Cj)P(Cj). Bayesian classification can be naïve or based on Bayesian belief networks. In naive Bayesian the naive assumption of ‘class conditional independence of values’ is made to reduce the computational complexity of calculating all P(X|Cj)'s. It assumes that the value of an attribute is independent of that of all others. Thus,

P(X|Ci) = P(xk|Ci)*…*P(xn|Ci)...... (2)

For categorical attributes, P(xk|Ci) = sixk/si where si = number of samples in class Ci and sixk = number of training samples of class Ci, having Ak-value xk.

A Bayesian belief network is a graphical model [5]. It represents the dependencies among subsets of attributes in a directed acyclic graph. A table called the ‘conditional probability table’ (CPT) is constructed. From the CPT different probabilities are established to estimate P(X|C).

The Bayesian Classification techniques have some problems. Naive Bayesian depends on the assumption of ‘class conditional independency’, which is not satisfied in all the cases. In the case of belief network, it is computationally very complex to build the network and to create the CPT [4,5]. It is time consuming to calculate the probabilities in traditional ways because of the need to scan the whole database again and again. This often makes its use with large spatial data sets and streaming spatial computationally expensive.

Our approach is to use the P-trees data structure to reduce the computational expense in classifying a new sample. The use of P-tree gives us a simple solution to the problem of scanning the database. Using a simple P-tree algebra, we can determine the number of occurrences of a sample and the number of co-occurrences of a sample and a class label, providing the necessary components of the probability calculation.

2. Bayesian Classification Using P-tree:

In case of spatial data, in the relation R(k, A1,...,An, C) the Ai’s and C could be viewed as different bands (denoted by B1,...,Bn in this paper). For example, in a remotely sensed image it may be the reflectance value of red, green, blue color for a particular pixel in the space identified by k. These bands are converted to P-trees and stored in that form.

Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted to BSQ). BSQ data has a separate file for each attribute or band. The ordering of the data values within a band is raster ordering with respect to the spatial area represented in the dataset. This ordering is assumed and therefore is not explicitly indicated as a key attribute in each band (bands have just one column). In this paper, we divided each BSQ band into several files, one for each bit position of the data values. We call this format bit Sequential or bSQ. A Landsat Thematic Mapper satellite image, for example, is in BSQ format with 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values. In this case, the bSQ format will consist of 56 separate files, B11,…,B78, each containing ~40,000,000 bits. A typical TIFF image aerial digital photograph is in what is called Band Interleaved by Bit (BIP) format, in which there is one file containing ~24,000,000 bits ordered by bit-position, then band and then raster-ordered-pixel-location. A simple transform can be used to convert TIFF images to bSQ format.

We organize each bSQ bit file, Bij, into a tree structure, called a Peano Count Tree (P-tree). A P-tree is a quadrant-based tree. The root of a P-tree contains the 1-bit count of the entire bit-band. The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This is construction is continued recursively down each tree path until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits), which may or may not be at the leaf level (1-by-1 sub-quadrant level). P-trees are related in various ways to many other data structures in the literature (see the appendix for details).

For example, the P-tree for a 8-row-8-column bit-band is shown below.

In this example, 55 is the count of 1’s in the entire image, the numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made up of entirely 1-bits, we do not need sub-trees for these two quadrants. This pattern is continued recursively. Recursive raster ordering is called the Peano or Z-ordering in the literature – therefore, the name Peano Count trees. The recursive process will definitely terminates at the “leaf” level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including those for the pure quadrants, then the leaf sequence is just the Peano space-filling curve for the original raster image

A P-tree is a lossless data structure, which maintains the raster spatial order of the original data. There are many variations in P-trees. P-trees built from bSQ files are called basic P-trees. The symbol Pi,j denotes a P-tree built from band i and bit j. A function COMP(Pi,j), defined for P-tree Pi,j gives the complement of the P-tree Pi,j. We can built a value P-tree Pi,v denoting band i and value v, which gives us the P-tree for a specific value v in the original band data. The tuple P-tree PX1..Xn denotes the P-tree of tuple x1…xn. The function RootCount gives the number of occurrence of a particular pattern in a P-tree. If the P-tree is the basic P-tree it gives the number of 1’s in the tree or if the P-tree is a value P-tree or tuple P-tree it gives the number occurrence of that value or tuple in that tree. A detailed description of P-trees and related things are included in the appendix.

2.1 Calculating the probabilities using P-tree:

We showed earlier that to classify a tuple X=(x1..,xn) we need to find the value of P(xk|Ci) = sixk/si where si = number of samples in class Ci and sixk = number of training samples of class Ci, having Ak-value xk. To find the values of sixk and si we need two value P-trees, Pk,xk (Value P-tree of band k, value xk) and Pc,ci (value P-tree of class label band C, value Ci) and

sixk= RootCount [(Pk,xk) AND (Pc,ci)],

si= RootCount[Pc,ci] ...... (3)

In this way we find the value of all probabilities in (2) and can use (1) to classify a new tuple, X.

To improve the accuracy of the classifier, we can find the value in (2) directly by calculating the tuple P-tree Px1..,xn (tuple P-tree of band x1..xn) which is simply

(P1,x1)AND ...AND (Pn,xn)

Now P(X|Ci) = P x1..,xn = six1..n/si and

six1..n= RootCount [(Px1..,xn) AND (Pc,ci)] ...... (4)

By doing this we are finding the probability of occurrence of the whole tuple in the data sample. We do not need to care about the inter-dependency between different bands (attributes). In this way, we can find the value of P(X|Ci) without the naive assumption of ‘class conditional independency’ which will improve the accuracy of the classifier. Thus we can keep the simplicity of naive Bayesian as well as get the higher accuracy.

One problem of this approach is that if the tuple X that we want to classify is not present in our data set, the Root count of TPx1..,xn will be zero and the value of P(X|Ci) will be zero as well. In that case we will not be able to classify the tuple. To deal with that problem we introduce the measure of information gain of an attribute of the data in our classification.

2.2 Introduction of information gain.

Not all the attributes carry the same amount of information that may be useful in the classification task. For example, in case of a bank loan, the attribute INCOME might carry more information than attribute AGE of the client. This significance could be mathematically measured by calculation the information gain of an attribute.

2.2.1 Calculation of Information Gain:

In a data sample, class label attribute Cn has m different values or classes, Ci,

i = 1...m. Let si = number of samples in Ci

Information needed to classify a given sample is [4]:

I(s1..sm)= -SUM(i =1..m)[pi*log2(pi)]

where pi=si/s is the probability that a sample belongs to Ci.

Let attribute, A, have v distinct values, {a1...av}. Entropy or expected information based on partition into subsets by A is:

E(A) = SUM(j =1..v)[ (SUM(I =1..m)[sij] / s) * I(sij..smj) ]

Now the information gain can be calculated by :

Gain(A) = I(s1..sm) - E(A)

(expected reduction of entropy caused by knowing the values of A)

2.2.2 Use of info gain to handle the situation of TP-tree RootCount = 0

Back to the problem in our classification technique, when the Root count of TPx1..,xn is zero we can form a TP-tree by reducing its size. We can form a TPx1..xi..xn, where i =1 to n but i¹k and the band xk has the lower information gain than any other bands. So (2) will be

P(X|Ci) = P(x1..n|Ci)*..*P(xk|Ci)...... (5)

P(x1..n|Ci) can be calculated by (4) with out using the band k and P(xk|Ci) can be calculated by (3). But if the root count of TPx1..xi..xn is still zero, we will remove another band with the 2nd lowest information gain and proceed in the same way. The general equation of (5) is