ICS 491 Spring 2004 Midterm Exam Solution
1. (5 pts) List and briefly describe the steps in data mining.
pp.6-8: 1. State problem and formulate hypothesis; 2. Collect the data, 3. Preprocess (clean up, reduce, fill in missing, etc.) 4. Process 5. Interpret results (errors, conclusions).
2. (20 pts) Using entropy, rank features X, Y and Z using the input data shown below.
Sample / X / Y / Z1 / 1 / 2 / 2
2 / 1 / 4 / 2
Total:
Sij / 1 / 21 / - / 2/3
2 / -
Etotal = -S12logS12 – (1-S12)log(1-S12) = -2/3log2/3 – 1/3log1/3 = 0.27 (with log base 10)
------
1. Eliminate X:
Sij / 1 / 21 / - / 1/2
2 / -
Ex= -S12logS12 – (1-S12)log(1-S12) = -2* 1/2log1/2 = 0.3
2.Eliminate Y
Sij / 1 / 21 / - / 2/2
2 / -
Ey = -S12logS12 – (1-S12)log(1-S12) = -1/1log1/1 – 0/0log0/0 = 0
3. Eliminate Z:
Sij / 1 / 21 / - / 1/2
2 / -
Ez= -S12logS12 – (1-S12)log(1-S12) = -2* 1/2log1/2 = 0.3
------
Etotal – Ex = -.03 = Etotal – Ez
Etotal – Ey = 0.3
Minimum difference is between X or Z, so X and Z are least important and are ranked last (the order does not matter). The order of increasing importance: X, Z, Y.
3. Construct and draw the decision tree and its rules for the data shown below. Assume that feature X is categorical and feature Y is numerical.
Sample / X / Y / Class1 / 1 / 2 / 1
2 / 1 / 4 / 1
3 / 1 / 3 / 2
|S| = 3
Info(S) = -2/3log2/3 -1/3log1/3 = 0.27
1. If we split on X:
|T1| = 3
Info(Tx) = 3/3(-2/3log2/3 – 1/3log1/3) = 0.27
2. If we split on Y:
2.1 split on Y<=3:
|T1| = 2, |T2|=1 T1 is for Y<=3, T2 for Y>3
Info(Ty) = -2/3(1/2log1/2 + 1/2log1/2) + 1/3 (1/1log1/1) = 0.2
2.2. split on Y<=2
|T1| = 1, |T2|=2
Info(Ty) = -2/3(1/2log1/2 + 1/2log1/2) + 1/3 (1/1log1/1) = 0.2
2.3 split on Y<=4 is meaningless, it includes all
So we can use binning on Y<=2 or Y<=3, it does not matter. Let’s do Y<=2.
Info(Ty) = 0.2
Info(S) – Info(Tx) < Info(S) – Info(Ty), so we will split on Y.
For Y>2, we are left with samples:
X Y C
1 4 1
1 3 2
So it is obvious that we can split on Y <=3. Splitting on X does not produce a small tree.
If Y <=2
Class = C1
Else if Y > 2
If Y <=3
Class = c2
Else
Class = C1
4. (15 pts) Find frequent itemsets and association rules for the following transactions. The minimum support is 2.
1 2 3
1 2
1 3
Candidates for frequent itemsets of size 1: 1, 2, 3
Frequent itemsets of size 1: 1, 2, 3
Candidates for frequent itemsets of size 2: 12, 23, 13
Frequent itemsets of size 2: 12, 13
Candidates for frequent itemsets of size 3: 123
Frequent itemsets of size 3: none
Association rules candidates:
All combinations of 1, 2, 3, 12, 13:
1->2
1->3
2->3
1->12
1->13
2->12
2->13
3->12
3->13
12->13
Association rules are the strong rules, i.e. union of all itemsets used in the rule must be frequent and the rule must have confidence > min_confidence. Since we didn’t assign min_confidence, we can assume that it is not important.
So association rules are:
1->2
1->3
1->12
1->13
2->12
3->13
5. (10 pts) For the data below, predict values of Y using linear regression. What will be value of Y for X=3?
Sample / X / Y1 / 1 / 2
2 / 1 / 0
B = [(1-1)(2-1) + (1-1)(0-1)] / [(1-1)^2 + (1-1)^2] = 0
A = 1-0*1 = 1
Regression line: Y = 1
Therefore, Y = 1 for any value of X, including X=3.
6. (20 pts) For the following data, predict the classification of sample C=(1,1) using:
- Bayesian classifier
- Clustering
Sample / X / Y / Class
1 / 1 / 2 / 1
2 / 1 / 0 / 2
- P(C=1) = ½
P(C=2) = ½
P(x=1/c=1) = 1P(y=1/c=1) = 0p(c/c=1) = ½*0 = 0
P(x=1/c=2) = 1p(y=1/c=2) = 0p(c/c=2) = ½*0 = 0
We don’t have enough data for probabilistic analysis.
b. Class 1: M1(1,2) only point A
Class 2: M2(1,0) only point B
Distance(C,M1) = sqrt((1-1^2 + (2-1)^2) = 1
Distance(C, M2) = sqrt((1-1^2 + (0-1)^2) = 1
Since distance is equal, C can be assigned either to class 1 or 2.
7. (10 pts) Calculate the output of the neuron shown below, using hard limit activation function and input vector X=<-1, -1>.
y = hardlimit(-1*1 + -1*2 + 3) = hardlimit(0) = 1