ISC471/HCI571 Fall 2012
Assignment4
Prediction
Due date: Wednesday, November 28, 2012, midnight
The goal of this assignment is to practice prediction methods and to apply them to a dataset using SPSS data analysis tools.
This assignment will be using SPSS data analysis tool.
Heart disease datasets
The dataset studied is the cleveland dataset from UCI repository. This dataset describes numeric factors of heart disease. It can be downloaded from http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the datasets-numeric.jar archive.
The goal of this study is to predict the severity of heart disease in the cleveland dataset (variable num) based on the other attributes:
a. Convert the cleveland file into a comma-delimited file, where the first line is the list of attributes delimited by commas. Save into cleveland.txt.
Import the cleveland.txt file into SPSS, after answering all the questions to convert the comma delimited file into SPSS format. Pay attention to have 303 rows / cases.
b. How many variables are in this dataset?
c. What types of variables are in this dataset (numeric / ordinal / categorical)?
d. Choose Analyze à Regression à Linear. Select as dependent variable num, as independent variables all the other ones, and method Enter (the default), which forces all the variables to be entered into the model together. The other methods select the order of entry of the variables based on mathematical criteria, which is generally not recommended, except when one has previous knowledge about the relative importance of the variables studied. It is also possible to process the independent variables hierarchically by entering on different screens with the Next button. The Statistics tab should be used and all options checked except the covariance matrix, in general. It is possible to run the multiple regression first to detect which variables have the most effect, then to rerun the model hierarchically by processing these variables first. Run the analysis by clicking OK.
e. Analyze the results obtained. Is there any multicollinearity between the variables? Multicollinearity means that there is a strong correlation between two or more of the predictors in the regression model, which should be avoided. To answer this question, look at the correlations between the variables, and state whether there are very strong correlations (R close to 1 or -1) between any pair of predictors. If you find multicollinearity, you may want to redo the model eliminating predictors in double.
f. What is the equation of the line found - coefficients?
g. The model summary provides the success measures: R and R2. The latter represents the percentage of variation in the target figure accounted for by the model. The adjusted R Square states how the model generalizes to new data, and should be as close to R Square as possible. How much of the variability of num can be predicted by the other attributes? How well would the model generalize to new data?
h. The ANOVA table tells whether the model is a good fit for the data. It compares the ability of the model to predict the data, in comparison with using the mean of the data. The F-ratio indicates how this prediction has improved by using the model. The larger the F-ratio, the better (greater than 1 is a minimum), and most importantly this improvement has to be significant, meaning that it could not have happened by chance. The significance is evaluated by a Sig < 0.05. Is the model a good fit for the data?
i. Perform a prediction with Nearest-neighbor (Analyze à Classify à Nearest neighbor). Let the system choose the optimal K (between 1 and 5), and select to split the training set at 80%. What is the optimal value of K found by the system, and the error rate found (it can be determined from the K-selection chart obtained by double-clicking the chart obtained) ?
j. Overall, which method(s) provided the most useful information? Explain your statement.