CLASSIFICATION

An important premise of data mining is that databases are rich with hidden information which could be useful for decision making. The information we seek can be in many different forms and could be used for a verity of decision making tasks. An important type of decision making task is called pattern recognition. Pattern recognition includes a wide range of problems of practical significance; from speech recognition to classification of postal addresses characters. The pattern recognition tasks can be of two types: classification and prediction. Given a set of values about an object or phenomenon, classification involves classifying it into one of two or more classes. Prediction involves predicting a value for an unknown variable given the values of some know variables. Here we provide a brief description of the classification task.

1.  Classification model

A classification model is built form available data using the known values of the variables and the known class. The available data is called training set and class value is called class label. Constructing the classification model is called supervised learning. Then, given some previously unseen data about an object or phenomenon whose class label is not known, we use the classification model to determine its class.

There are many reasons why we may wish to set up a classification procedure or develop a classification model.

l  Such a procedure may be much faster than humans (postal code reading).

l  Procedure needs to be unbiased (credit applications where humans may have bias

l  Accuracy (medical diagnosis).

l  Supervisor may be a verdict of experts used for developing the classification model.

2.  Classifier Issues

Some issues that must be considering in developing a classifier are listed below:

Accuracy: represented as proportion of correct classifications. However, some errors may be more serious than others and it may be necessary to control different error rates.

Speed: the speed of classification is important in some applications (real time control systems). Sometimes there is a tradeoff between accuracy and speed.

Comprehensibility of classification model especially when humans are involved in decision making (medical diagnosis).

Time to learn: the classification model from the training data, e.g. in a battlefield where correct and fast classification is necessary based on available data.

3.  Classification by Decision Tree Induction

A decision tree or classification tree is a tree structure, where root node is an attribute (or feature). Internal nodes represent attributes and branches represent attribute values. The leaf nodes are classes. A path in a tree is a traversal from the root to a leaf node and represents a decision predicate.

Steps in developing a decision tree classifier.

i. Start with all data – if all data belong to one class, label it as a leaf node with that class.

ii. If not, use some attribute selection criterion. Suppose we use information gain. Compute the information gain for each attribute and select that attribute which has the highest information gain. This becomes the root node.

iii. Divide the data according to the attribute values of the root node.

iv. Repeat the above for each branch (for the data at that branch) from the root node.

v. Stop the process according to following rules:

--When all data at a node belong to one class. Then it is the label of that node.

--When all the attributes have been used, i.e., when no more attributes are available for further partitioning. Use majority criterion for determining the leaf label. Sometimes other criteria may be more appropriate for determining g the class label. Since such a node is not pure, the node has a classification error. The totality of such errors is the total classification error of the tree.

Classification accuracy

The C4.5 algorithm has functions to determine the classification accuracy of the constructed tree. The accuracy figures are generated by the algorithm based on training data and provide accuracy for current samples as well as for test data.

4.  Tree size

When constructing a decision tree, it may contain too many branches; some of them lead to poor classification accuracy. The C4.5 algorithm has functions for pruning the tree (reducing the total number of nodes in the tree) and the algorithm also provides the performance of the original and pruned trees.

5.  Classification Rules form Decision Tree

The C4.5 algorithm can also generate classification rules. It re‐expresses a classification model as production rules, a format that seems to be more intelligible that a tree. The left hand side represents attributes values and the right hand is the classification of decision results.

6.  Key Requirement for C4.5 Approach ( from Quinlin’s book)

Attribute value description

Data must be a flat file; all information about an object must be expressible in terms of fixed collection of properties or attributes. Each attributes may be discrete or numeric but each object must have the same attributes; attributes should not vary from object to object.

Predefined class

Categories to which cases are to be assigned must be prespecified.

Discrete Classes

Classes must be sharply delineated; a case does or does not belong to a class. For example, cannot use C4.5 to predict temperature if categories are labeled as hard, quite hard, flexible, soft.

Sufficient Data

Inductive generalization proceeds by identifying patterns in data. This approach fails if robust patterns cannot distinguish from chance occurrences (of classes). Since such differentiation depends on statistical tests, there must be sufficient data to allow these tests to be effective.

Logical Classification Model:

C4.5 constructs only classifier that can be expressed as decision trees or set of production rules.