Current Research on Improving Scalability for Kernel Based Methods

Alyssa Lees

New York University

For many applications, Support Vector Machines (and other Kernel based algorithms) alone or in combination with other methods yield superior performance to other machine learning options. In general SVMs, work very well in practice, are modular, have a small number of tunable parameters (in comparison with neural networks) and tend toward global solutions. However, a significant disadvantage of kernel methods is the problem of scalability to a large number of data points. The problems associated with large data sets using SVMs include a drastic increase in training time, increased memory requirements and a prediction time that is proportional to the number of kernels (support vectors) which also increases with the number of data points. The first part of the paper will explore a few solutions used to improve generalization performance for larger problems. Burges and Scholkopf (1997) offer the idea of the virtual support vector method that yields better accuracy for large problems and Tresp compares his Bayesian Committee Machine (BCM) algorithm with Reduced Rank Approximation (RRA) and Subset of Representers Method (SRM) as approximate methods for scaling to larger systems. The second part of the paper investigates past and current research in decreasing the training time. The paper studies a specialized case of Osuna’s Decomposition, SMO as well as current advances to optimize decomposition.

In 1997 Burges and Scholkopf suggested exploiting knowledge of a problem’s domain to achieve better accuracy for large scale support vector machines. Their contributions of the “virtual support vector” method and the “reduced set method” boasted a SVM that was 22 times faster and yielded a better generalization performance ( on the 10,000 NIST test of digit images the error rate dropped from 1.4% to 1.0%). However, at the time of publication the authors readily acknowledged that boosted Neural Nets had achieved better error rates (as low as .7%) at considerably faster speeds(350K multiply-adds vs 650K with modified SVM).

Despite the minimal results, the methods are based on important principles: improve accuracy by incorporating knowledge about the invariances of the problem and increase classification speed by reducing the complexity of the representation of the decision function. Problem domain knowledge can be utilized either by being used directly in the coding of an algorithm or it can be used to generate artificial (‘virtual’) training examples. Due to high correlations between the artificial data and the larger training set size, generating virtual examples can significantly increase training time, but has the advantage of being easily used for any learning machine. Support Vector Machines can combine both approaches. Support Vector Machines have the following characteristics: if all other training data was removed and the system was retrained, the solution is unchanged, the support vectors, are close to the decision boundary, and different SVMs tend to produce the same set of support vectors. Burges and Scholkopf propose training an SVM to generate a set of support vectors, generating artificial examples by applying desired invariance transformations to the support vectors and then training another SVM on the new set.

The Virtual Support Vector method, as listed above, is effective in nominally improving accuracy, but doubles the training time and decreases classification speed (due to more support vectors). The reduced set method chooses the smallest set of examples such that the resulting loss in generalization remains acceptable. The reduced set method is fundamentally flawed. In certain instances (quadratic kernels), the reduced set can be computed exactly (efficiently). However, generalized cases must be computed using conjugate gradient computation which is very computationally expensive. Please note that methods for improving training time are listed later in the paper that extend beyond the reduced set method.

In 2001, Tresp and Schwaighofer compared three methods designed to improve the scalability of kernel based systems. The methods examined include Subset of Representers Methods (SRM), Reduced Rank Approximation (RRA), and an improved BCM Approximation. SRM is based on a factorization of the kernel function. In the SRM method, a set of base kernels are selected (either subset of training data or of the test data) and the covariance of two points is approximated by using the covariance matrix of the base Kernel points. The Gram matrix is then approximated using these covariances. The key to this method is the reduction in rank of the Gram matrix (which will be of rank equal or less than the number of the base kernels). The Reduced Rank Approximation, RRA, uses the SRM decomposition of the Gram matrix to calculate the kernel weights. The difference between RRA and SRM is that the SRM decomposition changes the covariance structures of the kernels, while RRA leaves the covariance structures unchanged. In addition, with RRA the number of kernels with nonzero weights is identical to the number of training points, while in SRM the number is identical to the number of base points. Finally, Tresp offers his own method, the Bayesian Committee Machine (BCM) approximation, which uses an optimal projection of the data on a set of base kernels. Using assumptions about conditional independencies, Tresp calculates the optimal prediction at the base kernel points. However, this calculation requires computing the inverse of the covariance of label y given a function f of the base kernel. In order to avoid the inverse calculation of an NxN matrix, the BCM approximation uses a block diagonal approximation of the covariance matrix and the computation of the associated weight vector requires only the inversion of a block of size B. The approximation improves when few blocks are used. In fact the BCM approximation becomes SRM when the covariance of y given f is set to alpha squared times the identity.

According to the tests conducted by Tresp, the RRA method performed considerably worse than SRM and BCM. For many case BCM and SRM were comparable, but when training was performed after the test inputs are known the BCM yielded much better generalization performance.

Various methods have been suggested and used to improve training time for large datasets. The Chunking algorithm is based on the observation that the value of the quadratic form is the same if you remove the rows and columns of the matrix that correspond to zero Lagrange multiplies. At each iteration of the algorithm, chunking solves a smaller QP problem that consists of every nonzero Lagrange multiplier from the previous step and the M worst examples that violate the KKT conditions. In the final iteration the entire set of non-zero Lagrange multipliers is extracted and hence the algorithm solves the QP problem. While chunking reduces the size of the matrix from N^2 (where N is the number of training samples) to the number of nonzero Lagrange multipliers, it still does not handle large scale training problems since the matrix still may not fit in memory.

Osuna suggested a new method for solving the SVM QP problem by showing that a large QP problem can be broken into sets of smaller QP problems if one or more examples that violate the KKT conditions are added to the smaller QP problem at each iteration. The Osuna’s Decomposition algorithm uses a constant size matrix for every sub-problem and adds/subtracts one example at every step. However a numerical QP solver is still required, which raises numerical precision issues.

Sequential Minimal Optimization (SMO) is now a standard method used to quickly train support vector machines. Under regular circumstances, the training of SVM requires the solution of an extremely large quadratic programming problem. The idea behind SMO is that the quadratic programming problems can be broken up into a series of the smallest possible QP problems which can be solved analytically. Avoiding the large matrix computation, the SMO can handle very large training sets in between linear and quadratic time with a linear amount of memory in the training set size. The standard approaches such as Chunking and Osuna’s algorithm can be on the order of cubic time. The SMO algorithm performs especially well with sparse data sets (Platt, 1998).

Sequential Minimal Optimization solves the smallest possible optimization problem at every step, which involves two LaGrange multipliers. At each step the two LaGrange multipliers are jointly optimized. The main advantage is that the solution for the multipliers at each step is analytic (ie. no QP solver is used). SMO is basically a special case of the Osuna Decomposition algorithm.

In a recent paper(2002) by Flake and Lawrence, the SMO algorithm was applied to Regression Training. The principle disadvantage of the SMO algorithm is that the rate of convergence slows considerably when the data is not sparse and when many support vectors are listed in the solution. Flake introduces several modifications to improve the performance for nonsparse datasets. The improvement was shown to be an order of magnitude better when applied to regression problems.

Flake and Lawrence’s fundamental contribution is to cache the kernel output functions in an intelligent manner. Their cache data structure contains an inverse index and a 2-D array that store the cached values. The authors eliminate thrashing by using a hierarchy of selection methods to find the second optimal Lagrange multiplier using a heuristic that only searches the entire data set if the working set includes the entire data set. Finally they take optimal steps to exploit the cached output. The modified SMO was tested on dense regression datasets and yielded dramatic runtime improvements. In addition the authors claim their implementation of SMO can outperform the current SVM optimization packages that use a QP solver for decomposition.

Another recent improvement to the SMO algorithm was found for a class of SVM algorithms by Keethi and Gilber. In the 2002 paper, “Convergence of a Generalized SMO Algorithm for SVM Classifier Design”, the authors found a particular case that can generalize the SMO algorithm. As well as a proof of convergence, the authors site a significantly faster algorithm than the standard SMO.

Also in 2002, Hsu and Lin investigate ways of improving Osuna’s Decomposition Method. The authors pose a simple solution to the problem of selecting a working-set that leads to faster convergence in difficult cases. Their implementation for finding the working set, named BSVM, is based on the observations that the number of free variables should be minimized and that certain iterations yield very close components in the working set. The first observation is solved by adding some free variables from the previous iteration to the current working set. In experiments BSVM outperforms SVMlight. The advantages of the work are that the algorithm minimizes the number of free variables which leads to faster convergence for difficult problems and the algorithm considers free variables in the current iteration again in future iterations so that the columns of these elements are naturally cached.

The scalability of kernel based methods is still a serious problem if SVMs are to be used for large-scale problems. Boosted Neural Networks can in some cases achieve better accuracy at faster speeds. However, the different advances listed in the paper offer significant contributions to different classes of problems. SMOs are ideal solutions for training large sparse data sets and the current research for specialized cases such as regression problems yields superior results to other machine learning methods.

------

*C.J.C. Burges and B.Schölkopf. ‘Improving the accuracy and speed of support vector learning machines’. In M.Mozer, M.Jordan, and T.Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375-381, Cambridge, MA, 1997. MIT Press.

*Flake, Gary Wallace and Lawrence, Steve. ‘Efficient SVM Regression Training with SMO’. Machine Learning, 46, 271-290, 2002.

*Guyon, I and Boser, B. and Vapnik, V. ‘Automatic Capacity Tuning of Very Large VC-dimension Classifiers’. Advances in Neural Information Processing Systems, volume5, pages 147-155. Morgan Kaufmann, San Mateo, CA, 1993.

*Hsu, Whih-Wei and Lin, Chih Jen. ‘A Simple Decomposition Method for Support Vector Machines’. Machine Learning, 46, 291-314, 2002.

* Li, Yi and Long, Philip M. ‘The Relaxed Online Maximum Margin Algorithm’. Machine Learning, 46, 361-387, 2002.

*Osuna, E. and F.Girosi. Reducing run-time complexity in SVMs. In Proceedings of the 14th International Conf. on Pattern Recognition, Brisbane, Australia, 1998.

*Platt, J. ‘Fast Training of Support Vector Machines using Sequential Minimal Optimization’ 1998.

* Tresp, Volker and Schwaighofer, Anton, ‘Scalable Kernel Systems’. Proceedings of ICANN 2001.