Draft Document Please Report Errors

DRAFT DOCUMENT – PLEASE REPORT ERRORS

In this first section we review some information about planes and hyperplanes and specifically are interested in the slope of the (hyper)plane in the steepest direction.

A plane has the form Y=b + w1X1+w2X2. Figure 1 shows an example Y = -1 + 3X1 + 2X2. Note that the set of points on such a plane where Y=b is a horizontal line. It is a set of points such that w1X1+w2X2=0 or X2 = - (w1/w2)X1which, at Y=b, is seen to be the equation of a line in the horizontal cyancolored plane at height b. Asecond vertical plane erected above the line w2X2= -w1X1will intersect the original plane in thehorizontal line(see red arrow) at Y=b.

The equation of the black slanted planein Figure 1 is Y = -1 + 3X1 + 2X2. The vertical plane (red) is the set of all (X1,X2,Y) points with 3X1+2X2=0 that is, X2 = – 1.5X1, and Y = any real number. The place on this vertical plane where Y=-1, that is, the place where the original plane and vertical plane cross, is the bottom red line. The post in the middle of the plot is at (X1,X2)=(0,0) so you see that (0,0) is one point on the bottom red line of intersection. A third (cyan) horizontal plane at Y=-1 sits on that post as does the original plane and the horizontal red line. All three planes, vertical horizontal and slanted, share the point (Y,X1,X2)=(-1,0,0) and in fact they share the entire red horizontal line. We are interested in this lower red horizontal line because the maximum slope of the original black plane is the slope of a line in the plane’s surface perpendicular to the redline of intersection of these two planes. That perpendicular line is also shown in the plot as a thin black line extending across the face of the original plane from near the bottom front corner of the plot to near the back upper corner and marked by a black arrow.

Figure 1: Original (black) plane and vertical (red) plane intersect in a horizontal line through which a horizontal (cyan) plane has also been passed. Slope of black line is (maximal) slope of plane.

A “vector” in the horizontal plane can be thought of either as an arrow from the origin (0,0) to a point (X1,X2) in the plane or equivalently as just the point X= (X1,X2). Another point W = (w1,w2) is orthogonal to the point X= (X1,X2) if X1w1+X2w2=0. Orthogonal means that the corresponding arrow vectors are at a 90 degree angle to each other when they meet at the origin. Because the bottom red line of figure 1 is the set of points with X1w1+X2w2=0 we see that the thin black line perpendicular to it must pass through the origin (0,0) and the point (w1,w2) because it is the set of all points orthogonal to the red horizontal line. Here is a graph of points (2,3) and (-3,2) each connected to the origin (0,0) and thus illustrating the right angle at the origin when vectors are orthogonal.

Figure 2: Orthogonal vectors (2,3) and (-3,2) meet in a right angle.

In Figure 1, the lower red line is horizontal and consists of all points with X1w1+X2w2=0. To summarize, the slope of the plane in the steepest direction, which we will simply call the slope of the plane, would be the slope ofthe black thin line perpendicular to thatred horizontal line. Points (0,0) and (w1,w2) are on that perpendicular line. As we move from (0,0) to (w1,w2) we have moved a distance in the horizontal plane and using the Y equation, Y changed from Y=-1+0+0 to Y =-1 + w1(w1) + w2(w2) so Y has changed by . Dividing this change in Y by the changealong the horizontal plane, we see that the plane’s slope is .

****Main Result****: For a plane, when you form a point (w1,w2) from the slopes in the plane’s equation, the distance of that point from the origin , , is the slope of the plane.

When the equation has more X inputs than 2, that is Y=b+w1X1+w2X2+…+wkXk with k>2, the resulting surface is called a hyperplane. If we fix (plug in numbers for) all but one of these X values we get a line in the remaining X variable such as Y = __ + w3X3 and if we fix all but 2 we get a plane, for example Y = __+ w3X3 + w5X5 so mathematically a hyperplane behaves like a plane. By the same arguments as above, there is a line within that hyperplane whose slope is the slope of the hyperplane. That line connects the origin to the point W = (w1,w2,…,wk) and its slope is where the symbol || || is called the “norm” of the vector W and geometrically is the length of the vector (distance from the origin) as we have seen.

In k dimensional space with separable red and blue points our goal is again to associate Z=1 with one color, Z=-1 with the other then find the hyperplane Y=b+W’X with minimum slope subject to ZY >= 1 for al points. Here X is the vector of “features” (inputs) and W the coefficient vector. We want to minimize ||W|| subject to the restrictions.

Higher dimensional space:

Consider the points belowwhere the horizontal axis is the value of X and the color represents an event (red) or nonevent (blue). The data are made up, but suppose X is debt to income ratio and Z is whether the person pays only the interest on the credit card. There is no single number that separates the reds from the blues but there are clearly groups.

Figure 3. Not separable in 1 dimension.

Suppose we go to a higher dimension by creating a variable X2=X2 that is the square of X. Plotting the red points at Z=1 and the blue at -1 we get a 3D plot and we should then be able to insert a plane Y=b+w1X1+w2X2 of minimum slope such that YZ>=1 for all points and thus decide red for any future point with b+w1X1+w2X2>0 and blue otherwise. To see a little more rigorously how that might work, take any quadratic equation Y=c(X-r1)(X-r2) where the roots are 4<r15 and 7<r28. By picking c positive we can assure that Y>0 for X<r1 or X>r2 and Y<=0 otherwise, for example. By picking the magnitude of c large enough, we can assure that ZY>=1 where Z is 1 for the red and -1 for the blue points. Now our quadratic Y=cX2-c(r1+r2)X + cr1r2 that separates the reds from the blues can be embedded in a 3 dimensional (Z,X1,X2) space if we set X1=X and X2=X2. The observed points are (Z,X1,X2) with Z=1 or -1 and a few of these (the “support vectors”) are also on the separating plane. It is important to realize that the full space of (Y,X1,X2) points in which the plane will be fit contains a lot more points than those for which X2 is the square of X1 and Z is 1 or -1. The plane will contain the support vectors (points(Z,X1,X2) ) as well as a lot of other points where X2 is not necessarily the square of X1. The observed points are “embedded” in this higher dimensional space.

Now associating Z=1 with the red points and Z=-1 with the blue, we can plot in three dimensions and even figure out the equation of the separating plane. The only points that matter in determining that plane are those that have X in {4,5,7,8}, these being the support points of the plane. The green plane shown has equation Y =(-67+24X1-2X2)/3 and is easily evaluated at the 4 X values {4,5,7,8} getting Y in {-1,1,1,-1}.

Figure 4: Separable in higher dimensional space.

We decide Z=1 or -1 based on whether Y>0 or Y<=0. The division line as shown is X2=-33.5+12X1.

Another way to think about this is to compute the quadratic equation that runs through the 4 support vectors (points) as I Figure 5. Recall that the red points are plotted at Z=1 and the blue at Z=-1. We see that the observed points lie along the subspace of (X1,X2) for which X2 is the square of X1 but (of course) there are many more points in the whole space of (X1,X2). The 4 support vectors (points) determining the quadratic equation are joined with lines, not curves, in the next plot to illustrate that the decision line where Y=0 is determined where the plane hits 0, not where the quadratic is 0.

Figure 5: Looking into the floor and rotating.

If we have 4 groups, 2 red and 2 blue, it will take a cubic to separate them Y=_ + _X + _ X2 + _ X3 which is a subspace of (X1,X2,X3) as illustrated here:

Note that for any roots r1, r2, r3 in the gaps separating the groups, we can find a cubic Y=c(X-r1)(X-r2)(X-r3) that will allow us to separate the reds (Z=-1) from the blues (Z=1 here) by seeing if Y is positive or negative. In the cubic above, YZ>=1 for all observed points because we picked c such that either Z=1 and Y>=1 or Z=-1 and Y<=-1. This is just a simple example starting in 1 dimension to show that separation is possible in higher dimensions. The idea would be to add a Z dimension and do a support vector machine in the space (Z,X1,X2,X3) where the observed points lie in a subspace for which the coordinates are powers of X. Admittedly, if we start in two dimensional space or higher, only a few examples are so easy to explain intuitively. One that is commonly shown is two concentric circular clusters around (0,0).

The upper left plot starts in 2 dimensions. Adding a third coordinate creates a parabolic cone (Upper right or lower left panel) such that a plane can slice through it and separate the two cases in the (X1,X2,X3) space. The plane slices through the conic subspace in a circle which, projected back into the original space, provides a separating circle.

Caveats and technical points not covered:

The minimization of the slope of the hyperplane Y in higher dimensions subject to ZY>1 for all points is a reasonably difficult quadratic programming problem that we have not addressed here.

The requirement that ZY>1 for all points is quite restrictive and requires complete separation of points. The complete support vector machine algorithm lightens up on that requirement so that a few red points can overlap into the blue are and vice versa. A penalty for such points controls the degree to which overlap is allowed.

SVM computations: