Name: ______
University of Wisconsin – Madison
Computer Sciences Department
CS 760 - Machine Learning
Spring 2001
Exam
7:15-9:15pm, May 2, 2001
Room 2317 Engineering Hall
CLOSED BOOK
(one sheet of notes and a calculator allowed)
Write your answers on these pages and show your work. If you feel that a question is not fully specified, state any assumptions you need to make in order to solve the problem. You may use the backs of these sheets for scratch work.
Write your name on this and all other pages of this exam. Make sure your exam contains
6 problems on 11 pages.
Name ______
Student ID ______
Problem Score Max Score
1 ______25
2 ______10
3 ______23
4 ______17
5 ______15
6 ______10
TOTAL ______100
Problem 1 – Learning from Labelled Examples (25 points)
Imagine that you are given the following set of training examples. All the features are Boolean-valued.
F1 F2 F3 Result
T T F +
F T T +
T F T -
F T F -
F F T -
a) How would a Naive Bayes approach classify the following test example?
Be sure to show your work.
F1 = T F2 = F F3 = F
b) Show the calculations that ID3 would perform to determine the root node of a decision tree using the above training examples.
c) Show what the perceptron learning rule, with the "step function" as the activation function of its output unit, would do on the first and the last training examples (i.e., show the calculations involved in just these two training examples). Initialize all the perceptron's parameters to 0 and let the learning rate equal 0.1. The perceptron should output zero when its weighted input is less than or equal to its threshold. Be sure to draw the initial neural network, as well as the network state after each of the two training examples.
Briefly explain how one might augment the representation of the above training examples in order for a perceptron to be able to separate them (assume that there are not separable using the current representation).
d) Assume that we are using a very stupid learner within the AdaBoost algorithm. This simplistic learner simply chooses for its learned model the lowest-numbered feature that has not yet been used. Its only intelligent aspect is that it decides whether or not to negate this feature, depending on which option works best. I.e., the first time called it will return either F1 or NOT(F1) as its model.
Show the steps that result from using this learner within AdaBoost on the dataset above (in addition to the standard termination criterion for AdaBoost, stop training if all features have been chosen).
Show how the final model will classify the test example from Part a.
Problem 2 – Feature Space (10 points)
Imagine we need to learn some Boolean-valued function from labelled examples that are represented using N numeric features. The graphs below show the Venn diagram view of these training examples plotted in feature space. Each copy of the Venn diagram has associated with it an inductive-learning method. Show how each learning algorithm might partition feature space based on these examples. Briefly explain, to the right of each diagram, why this algorithm might partition the data in this manner. (Qualitative answers are fine; no need for exact calculations.)
Problem 3 – Reinforcement Learning (23 points)
Consider the deterministic reinforcement environment drawn below. The numbers on the arcs indicate the immediate rewards.
a) For what range of values for the discount rate (γ) would a reinforcement learner end up preferring going directly from a to end, rather than going via b? Justify your answer.
For the remainder of Problem 2, let γ = 0.9. Also, break ties when choosing actions in favor of going to states that appear earlier in alphabetic order.
b) Represent the Q table by placing Q values on the arcs on the environment's state-action graph; initialize all of the Q values to 3 except initialize the arc between a and b to 4. For Step 1, do an exploration action. Show on the graph below the full Q table after Step 1. Specify the action chosen and display the calculations involved in altering the Q table.
c) Assume that after Step 1, the RL agent is magically transported back to the state start. Show the resulting Q table after the learner takes its Step 2 from the starting state. Step 2 should be an exploitation action. Be sure to again state the action chosen and display your calculations.
d) Again the learner is magically transported back to the starting state and it takes Step 3 from there. Step 3, and any subsequent steps, should be exploration actions. This time, however, use the SARSA algorithm (but still use your Q table from Part c). Show all the actions and calculations leading up to the first Q table update due to SARSA.
e) Assume that instead of using a Q table, you decide to switch (starting after Part d's calculations) to using the nearest-neighbor approach to represent the Q function. Show the training examples produced by the actions performed in Parts b-d above. (Do not redo Parts b-d using the nearest-neighbor method; simply show the nearest-neighbor training examples.)
What do you feel is the most important "design decision" you would need to make in order to apply the nearest-neighbors approach to this task?
Problem 4 – Experimental Methodology (17 points)
a) You use machine learning to create a model for a Boolean-valued task, and then evaluate this model on 100 new examples. The model correctly classifies 80 of them. In what interval can you say with 95% confidence that the true accuracy of this model lies?
b) On another Boolean-valued task you use algorithm A to learn modelA and you use algorithm B to learn modelB. On 11 randomly selected testsets, the two models produce these error rates:
modelA: 7 6 8 4 3 9 2 0 1 5 7
modelB: 5 7 8 0 2 7 5 2 1 7 4
Can you say one model is statistically significantly better than the other? Why or why not?
c) On a third Boolean-valued task, assume that model1 and model2 produce the following numeric outputs (there is one output per example; i.e., there are 5 positive and 5 negative testset examples):
Positive examples in the testset Negative examples in the testset
model1: 9 3 5 7 6 model1: 4 2 8 1 0
model2: 5 6 7 4 3 model2: 5 3 5 2 1
Draw an ROC curve for each of these two models (on the same graph).
Based on these curves, under which conditions would each of the two models be the preferred one? Explain your answer.
Problem 5 – Determining What to Optimize (15 points)
a) One can interpret the numeric output in a regression-learning task as parameterizing a probability distribution. Explain.
b) According to the probabilistic analysis done in class, what is the proper error function for a regression task? You do not need to show a full derivation, but briefly explain the connection to your answer to Part a.
c) Briefly explain the connection between Bayes Rule and the MDL principle.
Problem 6 – Miscellaneous Short Answers (10 points)
Briefly discuss the importance in machine learning of each of the following:
feature selection
hierarchically valued features
temporal-difference methods
effective number of hidden units
cross-entropy error function
Have a good summer!
7