ICS 491 Spring 2004 Midterm Exam Solution

1. (5 pts) List and briefly describe the steps in data mining.

pp.6-8: 1. State problem and formulate hypothesis; 2. Collect the data, 3. Preprocess (clean up, reduce, fill in missing, etc.) 4. Process 5. Interpret results (errors, conclusions).

2. (20 pts) Using entropy, rank features X, Y and Z using the input data shown below.

Sample / X / Y / Z
1 / 1 / 2 / 2
2 / 1 / 4 / 2

Total:

Sij / 1 / 2
1 / - / 2/3
2 / -

Etotal = -S12logS12 – (1-S12)log(1-S12) = -2/3log2/3 – 1/3log1/3 = 0.27 (with log base 10)

------

1. Eliminate X:

Sij / 1 / 2
1 / - / 1/2
2 / -

Ex= -S12logS12 – (1-S12)log(1-S12) = -2* 1/2log1/2 = 0.3

2.Eliminate Y

Sij / 1 / 2
1 / - / 2/2
2 / -

Ey = -S12logS12 – (1-S12)log(1-S12) = -1/1log1/1 – 0/0log0/0 = 0

3. Eliminate Z:

Sij / 1 / 2
1 / - / 1/2
2 / -

Ez= -S12logS12 – (1-S12)log(1-S12) = -2* 1/2log1/2 = 0.3

------

Etotal – Ex = -.03 = Etotal – Ez

Etotal – Ey = 0.3

Minimum difference is between X or Z, so X and Z are least important and are ranked last (the order does not matter). The order of increasing importance: X, Z, Y.

3. Construct and draw the decision tree and its rules for the data shown below. Assume that feature X is categorical and feature Y is numerical.

Sample / X / Y / Class
1 / 1 / 2 / 1
2 / 1 / 4 / 1
3 / 1 / 3 / 2

|S| = 3

Info(S) = -2/3log2/3 -1/3log1/3 = 0.27

1. If we split on X:

|T1| = 3

Info(Tx) = 3/3(-2/3log2/3 – 1/3log1/3) = 0.27

2. If we split on Y:

2.1 split on Y<=3:

|T1| = 2, |T2|=1 T1 is for Y<=3, T2 for Y>3

Info(Ty) = -2/3(1/2log1/2 + 1/2log1/2) + 1/3 (1/1log1/1) = 0.2

2.2. split on Y<=2

|T1| = 1, |T2|=2

Info(Ty) = -2/3(1/2log1/2 + 1/2log1/2) + 1/3 (1/1log1/1) = 0.2

2.3 split on Y<=4 is meaningless, it includes all

So we can use binning on Y<=2 or Y<=3, it does not matter. Let’s do Y<=2.

Info(Ty) = 0.2

Info(S) – Info(Tx) < Info(S) – Info(Ty), so we will split on Y.

For Y>2, we are left with samples:

X Y C

1 4 1

1 3 2

So it is obvious that we can split on Y <=3. Splitting on X does not produce a small tree.

If Y <=2

Class = C1

Else if Y > 2

If Y <=3

Class = c2

Else

Class = C1

4. (15 pts) Find frequent itemsets and association rules for the following transactions. The minimum support is 2.

1 2 3

1 2

1 3

Candidates for frequent itemsets of size 1: 1, 2, 3

Frequent itemsets of size 1: 1, 2, 3

Candidates for frequent itemsets of size 2: 12, 23, 13

Frequent itemsets of size 2: 12, 13

Candidates for frequent itemsets of size 3: 123

Frequent itemsets of size 3: none

Association rules candidates:

All combinations of 1, 2, 3, 12, 13:

1->2

1->3

2->3

1->12

1->13

2->12

2->13

3->12

3->13

12->13

Association rules are the strong rules, i.e. union of all itemsets used in the rule must be frequent and the rule must have confidence > min_confidence. Since we didn’t assign min_confidence, we can assume that it is not important.

So association rules are:

1->2

1->3

1->12

1->13

2->12

3->13

5. (10 pts) For the data below, predict values of Y using linear regression. What will be value of Y for X=3?

Sample / X / Y
1 / 1 / 2
2 / 1 / 0

B = [(1-1)(2-1) + (1-1)(0-1)] / [(1-1)^2 + (1-1)^2] = 0

A = 1-0*1 = 1

Regression line: Y = 1

Therefore, Y = 1 for any value of X, including X=3.

6. (20 pts) For the following data, predict the classification of sample C=(1,1) using:

  1. Bayesian classifier
  2. Clustering

Sample / X / Y / Class
1 / 1 / 2 / 1
2 / 1 / 0 / 2
  1. P(C=1) = ½

P(C=2) = ½

P(x=1/c=1) = 1P(y=1/c=1) = 0p(c/c=1) = ½*0 = 0

P(x=1/c=2) = 1p(y=1/c=2) = 0p(c/c=2) = ½*0 = 0

We don’t have enough data for probabilistic analysis.

b. Class 1: M1(1,2) only point A

Class 2: M2(1,0) only point B

Distance(C,M1) = sqrt((1-1^2 + (2-1)^2) = 1

Distance(C, M2) = sqrt((1-1^2 + (0-1)^2) = 1

Since distance is equal, C can be assigned either to class 1 or 2.

7. (10 pts) Calculate the output of the neuron shown below, using hard limit activation function and input vector X=<-1, -1>.

y = hardlimit(-1*1 + -1*2 + 3) = hardlimit(0) = 1