Assignment #3

CSc 4740/6740 Data Mining

(email your solution to David Bolding [,

email subject is Data Mining Assignment #3 )

Due 3/27/2013 (Wednesday)

For this assignment, students will be responsible for preparing datasets and training SVM classifiers.

Students may use any SVM implementation they like; that said. Students may wish to consider libsvm, which is a freely-available and easy-to-use SVM implementation. For whatever SVM implementation that the student selects, the student will be responsible for preparing a data-set in the appropriate input format, for selecting from their dataset a training and verification set, for training a Support Vector Machine, and for verifying its performance on unseen data.

The student should submit a zip-file[1] containing a write-up, any code that was used and the resulting SVM inputs. Students are encouraged to include any other supporting material that helps show their work, such as sample program outputs or intermediate data files.

The included write-up should briefly describe the process by which the student produced their data and trained their SVM, and should present their classification results. Student should note and explain any assumptions or engineering decisions that where made, such as what standard the student elected to use for “good” or “high” accuracy.

For undergraduate students:

1. (100 points) 3-D Classification

For this assignment, you will be responsible for creating a dataset representing points inside a shape in a 3-D space, and then training a Support Vector Machine to recognize points inside the shape.

Consider a unit cube in – a cube in 3-D space, with one corner at the origin, extending in the +x, +y and +z directions, with sides of length 1; this will be the volume within which we place points.

Begin by creating a dataset consisting of 200 random points; for each point, if it is in the bottom half of the cube, give the point the label bottom; otherwise, if it is in the top half of the cube, give it the label top.

Train an SVM to recognize the difference between bottom and top. You should be able to train this machine to have very high accuracy. Remember to use separate training and verification sets.

For graduate students:

1. (100 points)

Graduate students, in addition to completing the undergraduate portion, will experiment with using kernels to attempt to train an SVM to recognize a non-linear boundary.

After completing the undergraduate portion of the assignment, create a second dataset of 200 points. This time, if a point has a distance of .6 or less from the origin, label it as near; otherwise, label it as far. Train an SVM to recognize this boundary.

You should not be able to achieve high classification accuracy with a linear separation; you will need to experiment with different kernels. Can you find one that achieves high classification accuracy?

[1]Or gzipped tarball, if the student is so inclined..