CIS526: Machine Learning

Lecture 7 (Oct 14, 2003)

Prepared by Troy Schrader and Tom Gradel

Administrative Notes…

Completion of Decision Tree Lecture…

Decision Trees

Decision trees could be transformed into a set of rules in order to prevent overfitting.

C4.5 Algorithm (C++ Open Source Freeware)

  • Grows a decision tree
  • Converts decision tree into a set of rules (1 rule = 1 leaf)
  • Prune separate rules

/ 2  ””
1,3  ”+”
1,4,5  ”+”
1,4,6  ””
Continuous Attributes
/ For continuous attributes, there is a large choice of splitting thresholds.
  • Question: Is information gain a concave function of the splitting threshold.
  • Answer: No, by considering the following example

Non-Binary Splits

  • Using information gain measure prefers a larger number of splits
  • But, we do not want non-binary splits to be preferred over binary splits. Therefore, new measure called GainRatio is introduced.

where Si is the subset of S for which A has value Vi.

Regression Trees

  • Each node can predict a continuous number instead of a class.
  • See CART (1984), the book that gave Machine Learning a start.
  • Regression trees are not as accurate as neural networks.

/ Regression trees are not smooth but are characterized by large number of discontinuities. This property is not desirable since slight changes in input could largely effect the output.
The best split is the split that improves the MSE the most (criterion is maximizing the MSE gain).
/ Use the average output of all examples in the final node as the prediction.

Support Vector Machines (SVM)

  • Used mostly for classification (also, can be modified for regression and even for unsupervised learning applications).
  • Achieve accuracy comparable to Neural Networks

To understand SVMs, we need to define the:

VC Dimension.

The VC Dimension measures how powerful the learning algorithm is (see Chap. 7.4 of Mitchell).

  • A Machine Learning algorithm produces one of the available hypotheses .
  • h = one hypothesis
  • H = hypothesis space (all possible hypotheses)
  • VC Dimension measures the representational of power of a given machine learning algorithm by measuring the expressive power of its hypothesis space.
  • Smaller VC dimension = less power
  • Larger VC dimensions = more power

Definition. A set of examples is shattered by H if and only if for every dichotomyof S, there exists consistent with it.

Dichotomy. A partition of a dataset, S, into positive and negative subsets. Example: if N=4, there are 24 dichotomies.

Definition. VC(H) is the largest size N of a dataset S that can be shattered by H.

/ Linear classifier in 2-D space.

If H can shatter at least one set of N points .

Example. Assume H is a line in 1-D space:

Example. Decision tree (fully grown) has a VC Dimension that is . (The decision tree can represent any Boolean function)

Back to Support Vector Machines…

Assume training data with is separable by a hyperplane.

Question: What is the best linear classifier of the type

.

While there can be an infinite number of hyperplanes that achieve 100% accuracy on training data, the question is what hyperplane is the optimal with respect to the accuracy on test data?

/ Common sense solution: we want to increase the gap (margin) between positive and negative cases as much as possible.
The best linear classifier is the hyperplane in the middle of the gap.

Given f(x), the classification is obtained as

Note: Different w and b can result in the identical classification. For example, we can apply any scalar  such that:

Therefore there are many identical solutions.

Definitions of SVM and Margin

Find with maximal margin, such that for points closest to the separating hyperplane,

(also called the support vectors)

and for other points,

Illustration:

Question: How can we calculate the length of the margin as a function of w? The following diagram shows a point x and its projection xp to the separating hyperplane. r is defined as the distance between data point x and the hyperplane.

Note that w is a vector perpendicular to the hyperplane, so we have:

= = (since )

Therefore:

Now, solve for margin length ρ:

+1 -1

Conclusion: Maximizing the margin is equivalent to minimizing (since we can ignore the constant 2 above).

Some Theory

Expected Loss:

Emprical Loss:

NOTE: In support vector machines we are assuming the 0-1 loss accuracy measure.

We would like to know about the relationship between L(f) and Lemp(f).

NOTE: Generally, it is not a good idea to predict accuracy based on accuracy of training data. However, it will turn out to be very useful in justifying support vector machines.

It can be shown that,

where,N = size of training data

h = VC dimension of the functional class F (i.e., of our ML algorithm)

δ = some small number (a constant)

Let's look at L(f) as a function of complexity

If problem is linearly separable the empirical loss of a linear predictor can be Lemp(f) = 0. Therefore, to minimize the true error, we should minimize h!

The following inequality could be derived:

some constant

margin

Conclusion: In order to minimize h (to minimize L(f)) we need to maximize the margin ρ.

Support Vector Machines: Learning Problem

Assuming a linearly separable dataset, the task of learning coefficients w and b of support vector machine reduces to solving the following constrained optimization problem:

find wand b that minimize:

subject to:

This optimization problem can be solved by using the Lagrangian function defined as:

, such that

where 1,2, … N are Lagrange multipliers and  = [1,2, … N]T.

The solution of the original constrained optimization problem is determined by the saddle point of L(w,b,) which has to be minimized with respect to w and b and maximized with respect to .

Comments about Lagrange multipliers:

  • If , the value of i that maximizes L(w,b,) is i=0.
  • If , the value of i that maximizes L(w,b,) is i=+. However, since w and b are trying to minimize L(w,b,), they will be changed in such a way to make at least equal to +1.
  • From this brief discussion, the so-called Kuhn_Tucker Conditions follow:

Notation:

Data points xi with i> 0 are called the support vectors

Optimality conditions:

The necessary conditions for the saddle point of L(w,b,) are

or, stated a different way,

Solving for the necessary conditions results in

(***)

By replacing into the Lagrangian function and by using as a new constrain the dual optimization problem can be constructed as

Find  that maximizes

subject to

This is a convex quadratic programming problem, so there is a global minimum. There are a number of optimization routines capable of solving this optimization problem. The optimization can be solved in O(N3) time (cubic with the size of training data) and in linear time in the number of attributes. (Compare this to neural networks that are trained in O(N) time)

Support Vector Machine: Final Predictor

Given the values 1,2, … N obtained by solution of the dual problem, the final SVM predictor can be expressed from (***) as

where

and Isupport is the set of support vectors.

Important comments:

  • To obtain the prediction, all data points from the training data are consulted
  • Since i 0 only for the support vectors, only support vectors are used in giving a prediction
  • Note that is a scalar