CS121
Introduction to Artificial Intelligence
Winter 2004
Homework #8
(Probabilistic Inference and Inductive Learning)
Out: 3/3/04 — Due: 3/10/04
How to complete this HW: First copy this file; then insert your answers in the file immediately below each question; finally submit the completed file to the submission webpage before 3/3at midnight.
Note on Honor Code: You must NOT look at previously published solutions of any of these problems in preparing your answers. You may discuss these problems with other students in the class (in fact, you are encouraged to do so) and/or look into other documents (books, web sites), with the exception of published solutions, without taking any written or electronic notes. If you have discussed any of the problems with other students, indicate their name(s) here:
………………………………………………………………………………………………
Any intentional transgression of these rules will be considered an honor code violation.
General information: In the following, whenever an explanation is required (“Why?”), there will not be full-credit without an explanation. Keep explanations short and to the point. Excessive verbosity will be penalized. If you have any doubt on how to interpret a question, either tell us in advance, so that we can help you understand the question, or tell us how you understand it in your returned solution.
Grading:
Problem# / Max. grade / Your gradeI / 5
II / 5
III / 5
IV / 5
Total / 20
I. (5 points)Logic sentenceAB (read “A implies B”) is defined as the equivalent of AB (read “not A or B”). So, AB is True if and only if either A is False or both A and B are True.
Consider the following three sentences: AB, BC, and AC. Given a sentence S, Pr(S) denotes the probability that S is True. We are told that: Pr(AB) = 0.7 and Pr(BC) = 0.8. Given this information, we want to determine Pr(AC). In fact, Pr(AC) is not uniquely determined by the given information. Instead, it can only range over a sub-interval of [0,1]. The goal of this problem is to determine this interval.
- We define abc, ab, ac, …, as follows
abc = Pr(ABC)
ab = Pr(ABC)
ac = Pr(ABC)
...
a = Pr(ABC)
x = Pr(ABC)
Express Pr(AB), Pr(BC), and Pr(AC) using abc, ab, ac, bc, a, b, c, and x.
- Express all constraints on abc, ab, ac, bc, a, b, c, and x, using the axioms of probability and the given information?
- Derive from these constraints the range of values of Pr(AC).
- We have: Pr(AB) = Pr(AB).
We also have: AB  (ABC)(ABC)(ABC)(ABC)(ABC)(ABC).
Hence: Pr(AB) = abc + ab + bc + b + c + x.
Similarly:Pr(BC) = abc + bc + ac + c + a + x.
Pr(AC) = abc + ac + bc + c + b + x.
- Each is a real number that is greater than or equal to 0, and less than or equal to 1.
abc + ab + ac + bc + a + b + c + x = 1.
abc + ab + bc + b + c + x = 0.7 (equ. 1)
ac + a = 0.3.(equ. 2)
abc + bc + ac + c + a + x = 0.8(equ. 3)
ab + b = 0.2.(equ. 4)
Equ. 2 and 4 are derived from the 2nd constraint and equ. 1 and 3. They are used below.
- Pr(AC) = abc + ac + bc + c + b + x.
By substracting equ. 4 from equ. 1, we get:
abc + bc + c + x = 0.5.
So:
Pr(AC) = ac + b + 0.5.
Pr(AC) takes its smallest value when both ac + b is minimal, and its largest value when ac + b is maximal. Equ. 1-4 allow ac and b to be both 0 (minimal values), and they constrain them to be less than 0.3 (equ. 2) and 0.2 (equ. 4), respectively. Therefore, we have:
0.5  Pr(AC)  1.
II. (5 points) Suppose we generate a training set from a decision tree and then apply decision-tree learning to that training set. Is it the case that the learning algorithm will eventually return the correct tree as the training set goes to infinity? Why or why not?
The algorithm may not return the same tree, but it will return a tree that is logically equivalent, assuming that the method for generating the training set eventually generates all possible combinations of attributes. For example, if the method picks the value of each attribute uniformly at random, the probability that it generates all possible combinations goes to one when the training set goes to infinity. The form of the tree may not be the same, because there are multiple ways of representing the same logical function.
III. (5 points) Most likely, at the beginning of an exam, you try to predict whether each problem is easy or difficult (D = + if it is difficult and – if it is easy). Let us assume that you use two observable problem attributes (predicates):
- The text length L of the problem (L = 1 if it is long, 0 otherwise)
- The amount M of math in the text (M = 1 if there is a lot of math, 0 otherwise)
For training data, assume that you have examined 12 previous problems from the homeworks, and have collected the following data:
L / M / D / #0 / 0 /  / 4
0 / 0 / + / 1
0 / 1 /  / 0
0 / 1 / + / 3
1 / 0 /  / 1
1 / 0 / + / 2
1 / 1 /  / 1
1 / 1 / + / 0
The first line of this table reads as follows: 4 problems for which L = 0 and M = 0 were not difficult (D = ). The second line says: 1 problem for which L = 0 and M = 0 was difficult (D = +). Etc… Note that you observed no problem for which L = 0 and M = 1, or L = 1 and M = 1.
Based on this training data, you want to compute a representation of a difficult problem (D) in the form of a decision tree using the two binary attributes L and M.
- Construct the best decision tree you can for the training data. [As presented in class, choose the best attribute to test next by minimizing the number of misclassified examples in the training set. In addition, use no stopping criterion or pruning, so that the tree is grown until you run out of attribute; then, if needed, use majority rule.]
Deciding which attribute to test first, between L and M, we note that L has a 0.5 error rate since in both L=0 and L=1 the training examples are equally distributed between + and -. M has a (4/12)(1/4) + (8/12)(3/8) = 1/3 error rate, so we choose to split on M first.
On M’s 1 branch, L=1 has one – and no +’s, and L=0 has no ‘s and three +’s, so it’s easy to see that we split on L and assign + to the L=0 branch, and – to the L=1 branch.
On M’s 0 branch, L=1 has one – and two +’s, while L=0 has four –‘s and one plus. So to minimize error, we split on L and assign + to the L=1 branch and – to the L = 0 branch.
The final tree is:
(M(0 L (0[] 1[+]))
(1 L(0[+] 1[])))
- Suppose we have 3 test examples, whose correct classification we know, as follows:
- L
- M
- D
- 0
- 0
- 
- 1
- 1
- +
- 0
- 1
- +
Does the decision tree agree with each of these?
The tree gets the second example wrong, but gets the first and third examples correctly.
IV. (5 points) A program uses the version space method to determine the concept of a rewarded card. It is assumed that the concept to be learned is of the form:
REWARDED(x)  r(x) s(x)
where r is a rank predicate and s a suit predicate.
A rank predicate is one of the following: 1, 2, …, 10, J, Q, K, NUM (1, 2, …, 10), FACE (J, Q, K), ODD (1, 3, 5, …, 9, J, K), EVEN (2, 4, …, 10, Q), HONOR (1, J, Q, K), TEN-VALUE (10, J, Q, K), ANY-RANK (1, …, K).
A suit predicate is one of the following: , , , , BLACK, RED, MAJOR(that is,  or), MINOR (that is,  or ), ANY-SUIT.
For simplification we represent the concept REWARDED by such expressions as (FACE, BLACK), (TEN-VALUE, MAJOR), or (ACE, ANY-SUIT).
- Draw the directed acyclical graphs (one for ranks and one for suits) that represent the hierarchy of generality of the predicates.
- Give the specific and general boundaries (respectively, the S-boundary and the G-boundary) of the version space after each of the following three series of examples (each example is a card followed by “+” is the card is a rewarded card and “–” otherwise):
- (K +)(10 -) (8 +)
- (8 +)(J +)
- (8 -)(3 -)(2 -)(9 -)(10 -)(1 -)(J +)(Q +)
- Suit hierarchy:
ANY-SUITRED
BLACK
MAJOR
MINOR
Rank hierarchy:
ANY-RANKODD1
3
5
7
9
J
K
EVEN2
4
6
8
10
Q
NUM1
2
3
4
5
6
7
8
9
10
HONOR1
FACEJ
Q
K
TEN-VALUE10
FACEJ
Q
K
- G = (ANY-RANK, MINOR) and S = (ANY-RANK, CLUB)
- G = (ANY-RANK, ANY-SUIT) and S = (ANY-RANK, ANY-SUIT)
- G = (FACE, ANY-SUIT) and S = (FACE, MAJOR)
