Optimized Recommender Systems using Squared Error Gradient Descent and Vertically Structured Data
Dr. William Perrizo and Dr. Vasant A. Ubhaya
North Dakota State University
()
ABSTRACT:
This paper describ es an approach for recommender systems usin g vertically structured dat a and squared error gradient descent optimization . Recommender systems are systems that recommend products or items to people based on some training dataset of previously preferences expressed by those people. Recommenders are very important tools retail businesses today. The y are especially popular in online sales businesses. Typically the training table is a matrix of ratings by customers of products or items. For example, Netf l ix Corporation uses the Cinematch recommender system to try to predict which movies to recommend to users for their next rental. Amazon uses a recommender extensively to try to predict which products buyers might be attracted to buy, based on their previous purchases and expressed preferences.
The methodology used in this paper is based on what has been called the singular value decomposition (SVD) approach to recommending products. In short, this approach (though it resembles classical SVD but little) replaces the usually massive ratings training matrix (e.g., millions of customers by millions of products, so trillions of ratings) with two much smaller matrixes, a customer-feature matrix, CF, and a feature-product matrix , FP . Typically, the number of fe atures is taken to be 40 to 100. Thus, the number of cells is reduced from trillions to millions. A given customer-product rating is then estimated as the matrix product of that customer row of CF times that product column of FP. The method is only useful if that estimate is close to the actual rating that the customer would give to that product. Thus, a training step is required to train CF and PF so as to minimize the sum of the square errors, sse. This training is done by using gradient descent of sse and line search. In general, the line search step is done best if there is a closed form formula for the point on the gradient line which minimizes sse. However, in this case that formula is a 5 th degree polynomial and it is know that there is no closed for formula for the roots of such equations. Thus, we use a fixed increment line search instead.
For so-called “big data”, the s peed at which recommender model can be trained is a critical issue. Many ve ry good recommender s are unusable in the big data environment due to the f act that the training step takes an unacceptable a mount of time. The more features used, the better the recommender (lower sse). However, 40 to 100 features are typically chosen, not because that number is optimal but because the training can be done in reasonable time with that number of features. By training with vertically structured data, we can choose the number of features to be much higher and therefore improve the accuracy markedly without requiring unacceptable training periods. To address the speed issue, in this paper, we use horizontal process ing of vertically structured data rather than the ubiquitous vertical (scan) processing of horizon tal (record) data . W e use pTree, bit level , vertical data structuring. pTree technology represent s and process es data differently from the ubiquitous horizontal data technologies. In pTree technology, the data is structured column-wise (into bit slices) and the columns are processed horizontally ( typically across hundred to thousands of bit level columns), while in horizontal technologies, data are processed cell by cell (often involving billions, even tr i llions of matrix cell s) .
P-trees are lossless, compressed and data-mining ready data structures [9][10]. pTrees are lossless because the vertical bit-wise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a considerable amount of space, but more importantly facilitates faster processing. pTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. pTree vertical data structures have been exploited in various domains and data mining algorithms, ranging from classificatio n [ 1,2,3], clusteri ng [4,7] , association rule minin g [ 9], as well as other data mining algorithms. Speed improvements are very important in data mining because many quite accurate algorithms require an unacceptable amount of processing time to complete, even with today’s powerful computing systems and efficient software platforms. In this paper, we evaluate and compare the speed of various data mining algorithms when using pTree technology.
Introduction
Recommender systems are systems in the area of supervised Machine Learning (Classification or Prediction). This is one of the important data mining technologies for mining information out of large data sets. The assumption is usually that there is a, usually very large, matrix of ratings data for customers and products available from which to train the smaller recommender matrixes. Therefore the recommender system problem is in the area of “supervised” machine learning (the training of the recommender is supervised by the matrix of ratings values provided).
Unsupervised Machine Learning or Clustering is also an important data mining technology for mining information out of new data sets. The assumption is usually that there is essentially nothing yet known about the data set (therefore it is “unsupervised”). The goal is often to partition the data set into subset of “similar” or “correlated” records [4, 7].
There may be various additional levels of supervision available in either classification or clustering and, of course, that additional information should be used to advantage during the classification or clustering process. That is to say, often the problem is not a purely supervised nor purely unsupervised problem. For instance, it may be known that there are exactly k similarity subsets, in which case, a method such as k-means clustering may be a productive method. To mine a RGB image for, say red cars, white cars, grass, pavement, bare ground, and other, k would be six. It would make sense to use that supervising knowledge by employing k-means clustering starting with a mean set consisting of RGB vectors as closely approximating the clusters as one can guess, e.g., red_car=(150,0,0), white_car=(85,85,85), grass=(0,150,0), etc. That is to say, we should view the level of supervision available to us as a continuum and not just the two extremes. The ultimate in supervising knowledge is a very large training set, which has enough class information in it to very accurately assign predicted classes to all test instances. We can think of a training set as a set of records that have been “classified” by an expert (human or machine) into similarity classes (and assigned a class or label).
In this paper we assume the data set is a matrix of non-negative integers with n columns and N rows and that two rows are similar if there are close in the Euclidean sense. More general assumptions could be made (e.g., that there are categorical data columns as well, or that the similarity is based on L1 distance or some correlation-based similarity) but we feel that would only obscure the main points we want to make by generalizing.
We structure the data vertically into columns of bits (possibly compressed into tree structures), called predicate Trees or pTrees. The simplest example of pTree structuring of a non-negative integer table is to slice it vertically into its bit-position slices. The main reason we do that is so that we can process across the (usually relatively few) vertical pTree structures rather than processing down the (usually very numerous) rows. Very often these days, data is called Big Data because there are many, many rows (billions or even trillions) while the number of columns, by comparison, is relatively small (tens or hundreds, thousands, multiple thousands). Therefore processing across (bit) columns rather than down the rows has a clear speed advantage, provided that the column processing can be done very efficiently. That is where the advantage of our approach lies, in devising very efficient (in terms of time taken) algorithms for horizontal processing of vertical (bit) structures. Our approach also benefits greatly from the fact that, in general, modern computing platforms can do logical processing of (even massive) bit arrays very quickly [9,10].
Training the Customer-Feature and Feature-Product mat r ixes
The approach we use for recommender systems is to vertically structured data and use the squared error gradient descent to minimize the squared error of the recommendations made by multiplying the Customer-Feature (CF) and Feature-Product (FP) matrixes as compared to the actual ratings given in the massive ratings matrix. Again recommender systems are systems that recommend products or items to people based on some training dataset of previously preferences expressed by those people. Recommenders are very important tools retail businesses today. They are especially popular in online sales businesses. Typically the training table is a matrix of ratings by customers of products or items. For example, Netflix Corporation uses the Cinematch recommender system to try to predict which movies to recommend to users for their next rental. Amazon uses a recommender extensively to try to predict which products buyers might be attracted to buy, based on their previous purchases and expressed preferences.
The methodology used in this paper is based on what has been called the singular value decomposition (SVD) approach to recommending products. In short, this approach (though it resembles classical SVD but little) replaces the usually massive ratings training matrix (e.g., millions of customers by millions of products, so trillions of ratings) with two much smaller matrixes, a customer-feature matrix, CF, and a feature-product matrix, FP. Typically, the number of features is taken to be 40 to 100. Thus, the number of cells is reduced from trillions to millions. A given customer-product rating is then estimated as the matrix product of that customer row of CF times that product column of FP. The method is only useful if that estimate is close to the actual rating that the customer would give to that product. Thus, a training step is required to train CF and PF so as to minimize the sum of the square errors, sse. This training is done by using gradient descent of sse and line search. In general, the line search step is done best if there is a closed form formula for the point on the gradient line which minimizes sse. However, in this case that formula is a 5th degree polynomial and it is know that there is no closed for formula for the roots of such equations. Thus, we use a fixed increment line search instead.
For so-called “big data”, the speed at which recommender model can be trained is a critical issue. Many very good recommenders are unusable in the big data environment due to the fact that the training step takes an unacceptable amount of time. The more features used, the better the recommender (lower sse). However, 40 to 100 features are typically chosen, not because that number is optimal but because the training can be done in reasonable time with that number of features. By training with vertically structured data, we can choose the number of features to be much higher and therefore improve the accuracy markedly without requiring unacceptable training periods. To address the speed issue, in this paper, we use horizontal processing of vertically structured data rather than the ubiquitous vertical (scan) processing of horizontal (record) data. We use pTree, bit level, vertical data structuring. pTree technology represents and processes data differently from the ubiquitous horizontal data technologies. In pTree technology, the data is structured column-wise (into bit slices) and the columns are processed horizontally (typically across hundred to thousands of bit level columns), while in horizontal technologies, data are processed cell by cell (often involving billions, even trillions of matrix cells).
P-trees are lossless, compressed and data-mining ready data structures [9][10]. pTrees are lossless because the vertical bit-wise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a considerable amount of space, but more importantly facilitates faster processing. pTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. pTree vertical data structures have been exploited in various domains and data mining algorithms, ranging from classification [1,2,3], clustering [4,7], association rule mining [9], as well as other data mining algorithms. Speed improvements are very important in data mining because many quite accurate algorithms require an unacceptable amount of processing time to complete, even with today’s powerful computing systems and efficient software platforms. In this paper, we evaluate and compare the speed of various data mining algorithms when using pTree technology.