Dr. Eick
COSC 4335 “Data Mining” Spring 2017
Assignment1: (Exploratory) Data Analysis
for a Vehicle Silhouette Dataset
Group Project (typically 2-3 students per group)
Due: Saturday, February 18, 11p (electronic Submission)
Last Updated: January 24, 2016, 11:11a
Learning Objectives:
1. Learn how to manage and preprocess datasets and how to compute basic statistics and to create basic data visualizations (using R)
2. Learn how to interpret popular displays, such as histograms, scatter plots, box plots, density plots,…
3. Get some practical experience in exploratory data analysis
4. Learn how to create background knowledge for a dataset
5. Learn to distinguish expected from unexpected results in data analysis and data mining—in general, this task is quite challenging, as it requires background knowledge with respect to the employed data mining technique, and also practical experience.
Download Statlog (Vehicle Silhouettes) Data Set dataset from http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes) limiting yourself to analyzing to the following subset of the dataset; using all examples to create the subset and not changing the order in the dataset:
i. Groups 1-5, analyze the COMPACTNESS (average perim)**2/area), ELONGATEDNESS (area/(shrink width)**2), RADIUS RATIO (max.rad-min.rad)/av.radius,SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS attributes (1st , 4th, 8th , and 11th attribute) and the class attribute.
ii. Groups 6 and higher analyze the COMPACTNESS (average perim)**2/area), CIRCULARITY (average radius)**2/area, SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS attributes, HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (1st , 2nd, 11th , and 18th attribute),and the class attribute.
5 Examples in the raw Vehicle Silhouette Dataset:
96 55 103 201 65 9 204 32 23 166 227 624 246 74 6 2 186 194 opel
89 36 51 109 52 6 118 57 17 129 137 206 125 80 2 14 181 185 van
99 41 77 197 69 6 177 36 21 139 202 485 151 72 4 10 198 199 bus
104 54 100 186 61 10 216 31 24 173 225 686 220 74 5 11 185 195 saab
101 56 100 215 69 10 208 32 24 169 227 651 223 74 6 5 186 193 opel
Assignment1 Tasks:
Apply the following exploratory data analysis techniques using R to your dataset:
0. Compute the mean value and standard deviation of the 4 numerical attributes[1]. 1 point
1. Compute the covariance matrix for each pair of your 4 numerical attributes; next, compute the correlations for each of the 6 pairs of the 4 attributes. Interpret the statistical findings! 6 points
2. Create a scatter plot for COMPACTNESS and SCALED VARIANCE of your dataset. Interpret the scatter plot! 3 points
3. Create histograms for each of the 4 attributes. Then create the same histograms for the 4 attributes for the instances of each of the 4 classes; interpret the obtained 20 histograms. 10 points
4. Create box plots for the first and last numerical attribute of your dataset for the instances of the 4 classes and the whole dataset. Interpret and compare the obtained 5 boxplots for each of the two attributes! 8 points
5. Create supervised scatter plots/supervised density plots for all pairs of your numerical attributes. Next create a 3D-scatterplot using the first 3 numerical attributes and the last 3 numerical attributes of your dataset—that is two 3D-scatterplots have to be created. Interpret the obtained plots; in particular address what can be said about the difficulty in predicting the correct class of the vehicle silhouette. Assess the usefulness of the 3D scatterplot compared to the 2D plots! 10 points
6. Create a Star plot for the first 10 instances of class OPEL and the first 10 instances of VAN (based on the order in the file); interpret the 20 stat plots—star plots should be constructed for the 4 continuous attributes! 3 points
7. Create a new dataset ZVS from your original dataset by transforming the 4 continuous attributes into z-scores; next convert the class attribute as follows: OPELà1, SAABà2, VANà3, BUSà4. Next, fit a linear model that predicts the modified class attribute using the four z-scored, continuous attributes as independent variables. Report the R2 of the linear model and the coefficients of each attribute in the obtained regression function. Do the coefficients tell you anything about the importance of the attribute in predicting four classes of vehicle silhouettes? 8 points
8. Create 3 decision tree models with 20 or less nodes for your dataset (total number of nodes should be less than 21 do not submit models with more than 20 nodes!); Explain how the 3 decision tree models were obtained. Report the training accuracy and the testing accuracy of this decision tree; interpret the learnt decision tree.) What does it tell you about the importance of the 4 continuous attributes for the classification problem? 6 points
9. Write a conclusion (at most 13 sentences!) summarizing the most important findings of the assignment; in particular, address the findings obtained related to predicting the class attribute. 4 points (and up to 4 extra points)
Remark: About 25-33% of the Assignment1 points will be allocated to interpreting statistical findings and visualizations
2
[1] This is more a verification of that you have the correct dataset!