Data Mining Assignment #1

Assignment 3

Overview

The purpose of this assignment is to perform some classifying exercises. Classification is the process of associating an instance with a pre-existing category or class. Classification has two phases, training and testing. The training phase requires a set of instances that are already labelled with the correct class, and builds a model representing the relationship between the instances and the given class. The testing phase uses this model to assign a class to instances of unknown class. The classification algorithms provided with Weka include NaiveBayes, IBk, J48, and Decision Table. In this assignment, you will do a comparison of these methods. First, a brief reminder of how these methods work.

Decision tree using C4.5

The idea of this algorithm is to form a decision tree by splitting the datasets into smaller and smaller bits. Then you can decide how to classify a instance by following the tree until you get to the bottom, where it tells you what class you have. Each split should be the one that makes the biggest progress towards correctly classifying the data instances. We measure this by the amount of information gained by making the split. Refer to section 4.3, page 97 of the textbook.

Decision Table

The Decision Table method is not covered in the text, You do not need to know this algorithm.

Naïve Bayes

This is probably the simplest classifier algorithm. The training phase consists in estimating conditional probabilities from the example instances. Refer to section 4.2, page 88 of the textbook.

Nearest Neighbours (IB1)

This algorithm has no training phase, the example instances are simply stored. What is the main advantage and the main disadvantage in doing this? In the testing phase, an unclassified instance is compared to all the stored instances until the nearest one is found. The new instance is assigned to the same class as the nearest one. A variation involves taking a weighted sum of more than one instance. In this exercise we will use just 1 neighbour. Refer to section 4.7, page 128 of the textbook.

Questions

Algorithm / Kappa / Accuracy
J48
Dec.Table
NaiveBayes
IB1

Record the value of the Kappa statistic and Accuracy for each algorithm.

What conclusions can be drawn from comparing the confusion matrices associated with each algorithm?

Record the Precision and Recall for each algorithm.

Algorithm / Precision / Recall
J48
Dec.Table
NaiveBayes
IB1

Which algorithm best describes the structure of the data? Why?

Can you understand that description?

Instructions

· Start the Weka explorer.

· Load the iris.arff file using the ‘preprocess’ tab.

· Switch tabs to ‘classify’ and, using the ‘choose’ button, select:

o For the C4.5 Decision Tree: the J48 algorithm under ‘trees’

o For the Decision Table: the Decision Table algorithm under ‘rules’

o For Naïve Bayes: the NaïveBayes algorithm under “bayes”

o For Nearest Neighbor: the IBk algorithm under ‘lazy’

· Set the ‘test options’ to ‘Percentage split’

· Make sure that ‘class’ is the attribute that shows below the test options panel.

· Hit the ‘start’ button – the big pane should display information about the classification algorithm being run.

Grading

Each question is of equal value.