Movie Advisor Project1/51October 4, 2018
Movie Advisor
Tel Aviv University
Faculty of Engineering
M.Sc. Project
Doron Harlev
Supervisor: Dr. Dana Ron
October 4, 2018
Table of Contents
Introduction
Database Characteristics
Data Sets
Evaluation Criteria
Mean Average Error (MAE)
Coverage
Base Algorithms
All Average
Movie Average
User Average
User Movie Average
Summary
Prediction Methods
Pearson R Algorithm
Base Algorithm
Additions To The Base Pearson R Algorithm
Mean Square Difference (MSD) Algorithm
Genre Based Algorithms
Genre Statistics
Base Algorithm
Genre Algorithm
Hybrid Genre Algorithm
Algorithm Comparison
MAE Vs. Coverage For All Algorithms
Database Alteration
Conclusions
Reference
Appendix I: Implementation
Introduction
Matlab Code
General
Database Statistics
Base Algorithms
Pearson r Algorithm
MSD
Genre
Introduction
This paper will analyze existing and proposed algorithms designed to provide predictions of a users’ rating of a particular movie. The predictions will rely on the users’ own rating of other movies as well as the scores provided by neighboring users. The ability to predict a users’ rating may prove useful in many contexts. One such context may be enhancing a users’ experience in an online store by recommending some items while helping avoid others. The problem set described in this paper is studied in the field of Recommender Systems or Collaborative Filtering.
This paper will start by examining the statistical properties of the chosen database. Once the base and test data sets have been characterized, evaluation criteria such as Mean Average Error (MAE) and coverage will be explained. These criteria will then be used to measure the performance of trivial prediction methods applied to the base and test data sets. The performance of trivial methods will provide a reference for more sophisticated algorithms.
The first of these algorithms makes use of the Pearson R coefficient. The Pearson R coefficient provides a correlation measure between two vectors and can be used to provide a distance metric between two users. Use of the Pearson R correlation coefficient is quite common in the field of collaborative filtering, and results obtained with this method will be used to gauge the performance of other algorithms. The Pearson R algorithm will be further enhanced in accordance with improvements proposed by other papers as well this one. Another baseline algorithm to be examined is the Mean Square Difference (MSD) algorithm. Although generally considered inferior to the Pearson R algorithm, elements in the MSD algorithm will prove valuable in newly proposed algorithms.
The database used in this paper also provides genre information for all movies. The statistics of genre information in the database will be analyzed. Novel prediction algorithms relying on genre information will then be proposed. As a reference, a new trivial prediction algorithm will be presented and its performance compared to previous trivial methods. Several genre-based prediction algorithms will then be proposed and analyzed with respect to the test and base data sets.
Finally, all algorithms will be compared. This comparison will include a different instantiation of the initial data set, as well as altered versions of the entire database.
Database Characteristics
The database used for the project is the GroupLens database available on the internet at The database contains 100,000 ratings on a scale of 1-5. The ratings are made for 1682 movies by 943 users. If the database were to be viewed as a matrix with users designating rows and movies designating columns, the matrix would be extremely sparse with values in merely 6% of its entries. Despite its apparent sparseness, the database is sufficient in allowing the analysis of the prediction algorithms discussed in this paper.
The mean score provided for all entries in the database is 3.53, the median is 4[1] and the standard deviation is 1.13. Figure 1 depicts a histogram of all user ratings:
Figure 1: Historgram of all movie scores
From the histogram and average score we learn that users tend to rate movies they liked more than movies they disliked.
The average number of scores given to each movie is around 60 and the median number of scores is 57. Figure 2 depicts a histogram of the number of scores given to each movie:
Figure 2: Histogram of number of scores given to each movie
The histogram shows that while some movies have a significant number of ratings of more than 500, many others have 10 ratings or less.
The average number of scores given by each user is 106 and the median is 65. The minimum number of scores given by each user is 20. Figure 3 depicts a histogram of the number of scores given by users:
Figure 3: Histogram of number of scores given by each user
This histogram shows a distribution similar to the ratings of movies. While some users rated as many as 700 hundred movies, many others rated the minimum allowed.
Figure 4 depicts a graphical representation of the database:
Figure 4: Graphical representation of rated movies
Each rated item in the database is designated by a blue dot. The parabolic decline of rated movies as a function of user ID implies that when the database was established, each new user was provided with a more elaborate list of movies. It is also evident that movies with higher ID’s average less ratings than those with lower ID’s.
Data Sets
To assess the prediction methods, the data set is divided into two parts: base and test. The test data set contains 5 entries for 10% of the users yielding a total of 470 entries. The base data set contains all remaining entries (99,530).
The test data set is used as a reference for the accuracy of the predictions. Since the division into base and test data sets is made randomly, the performance of the tested algorithms will vary depending on the specific division that was made. To stabilize the performance throughout the paper, all results are calculated for a specific base/test division. This issue will be further addressed in the final analysis.
Evaluation Criteria
Two types of evaluation criteria will be used in this paper.
Mean Average Error (MAE)
Once predictions for the test data set are obtained, Ei is defined as the difference between prediction, Si, and the actual score given by the user, Ri. The MAE is , where n is the length of the predicted data set. Other error coefficients such as standard deviation and ROC may be used. However, related papers show that MAE is consistent with other measures and proves to be the most widely used.
Coverage
As will later be shown, predictions cannot be made for all movies in the test data set. Coverage is a measure of the percentage of movies in the test data set that can be predicted. In general, the smaller the MAE the smaller the coverage becomes.
The importance of coverage depends on the application using the recommender system. Some applications may require predictions for most items in the database, while others may choose to compromise coverage for improved accuracy.
Base Algorithms
Four base algorithms are analyzed in this section. These algorithms are “basic” in that they rely on trivial aspects of the user’s entries to provide predictions.
All Average
In this method the average rating of all entries in the base data set is calculated. This calculated value is then used as the predicted score for all values in the test data set. This method yields an MAE of 1.046 and coverage of 100%.
Movie Average
In this method the average score given to each movie in the base data set is calculated. This average score is then used to predict values for all occurrences of the movie in the test data set. This method yields a vast improvement over the previous one. The MAE drops to 0.887. Since not all movies in the base data set contain scores, some values in the test dataset cannot be predicted. The coverage in this method is 99.8%.
User Average
In this method the average score given by each user in the base dataset is calculated. This average score is then used to predict values for the user in the test data set. This method yields an improvement over the previous one. The MAE drops to 0.849. The coverage in this method is 100% since all users made at least 20 predictions.
Figure 5 depicts a histogram of the error using the user average base algorithm.
Figure 5: Histogram of error using User Movie Average prediction
User Movie Average
This method is the most sophisticated of the basic methods. The method combines user average and movie average to produce a prediction. In order to produce a prediction for user a and movie i we perform the following calculation:
where is user a’s average rating, u is an index running through all users who rated and the movie, and n is the number of users who rated the movie.
This method yields yet another improvement over the previous one. The MAE drops to 0.83 while the coverage is 99.8%, identical to method 2.
Summary
It is clear that any desirable prediction method must improve upon the results obtained in this section. Table 1 summarizes the results:
Method / MAE / Coverage [%]All Average / 1.046 / 100
Movie Average / 0.887 / 99.8
User Average / 0.849 / 100
User Movie Average / 0.830 / 99.8
Table 1: Summary of base algorithms results
Prediction Methods
The following prediction methods attempt to provide further improvement over the base algorithms. The improvement is possible through the use of additional information. One form of additional information is to intelligently weigh other users’ ratings. The Pearson R and MSD methods use this approach to improve prediction. Another form of additional information is to make use of genre information provided with the database. Methods using this approach will be presented in the ensuing discussion.
Pearson R Algorithm
The Pearson R algorithm relies on the Pearson R coefficient to produce a correlation metric between users. This correlation is then used to weigh the score of each relevant user. The Pearson R algorithm is used widely in the study of recommender systems, and is used as a reference in this paper.
Base Algorithm
The Pearson R correlation between users a and u is defined as:
where m is the number of movies that both users rated, is the score user a gave movie i, and is the average score user a gave all movies.
Since the base data set is sparse, when calculating the Pearson R coefficient, most users have only a few overlapping scored movies. In calculating the coefficient, only movies ranked by both users are taken into account. This affects the sum in the numerator, and the variance in the denominator.
Figure 6 depicts a histogram of the average number of mutually rated movies (MRM) for all users in the test data set:
Figure 6: Histogram of average number of MRM for all users in the test data set
The mean MRM is 17.6 and the median is 12. We note that many users have an average MRM below 10, while some have a relatively high average MRM of 60. Users who provide more entries will tend to have a higher average MRM.
Figure 7 depicts a histogram of the Pearson R coefficient between a particular user and all other users:
Figure 7: Historgram of Pearson R coefficient
It is evident that most users are slightly positively correlated. While there is such a thing as two people with similar tastes, it is highly unlikely to find two users with opposite tastes. Opposite tastes would require users to consistently dislike what the other user likes and vice versa. This observation is further discussed at Shardanand [1].
Once the Pearson R correlation between a user and all other users is obtained, the predicted movie score is calculated as:
This approach is very similar to that of the user movie average discussed earlier. The difference is that the Pearson R coefficient is used to weigh the score given by each movie, whereas in the user movie average algorithm all users are weighted equally.
Applying the Pearson R base algorithm to the test data set yields a MAE of 0.79 with coverage of 99.8%. The coverage is slightly limited, because not all movies in the test data set have a score in the base data set. Since no limitations have been made on the predictions, the coverage is identical to the movie average and user movie average base algorithms.
Figure 8 depicts a histogram of the prediction error using the Pearson R base algorithm.
Figure 8: Histogram of prediction error for Pearson R base algorithm
The MAE obtained is clearly an improvement over all base methods.
Note that in order to normalize the weights, all user scores are divided by a sum of the absolute value of the correlations. Since the Pearson R correlation may be either positive or negative, adding all correlations may produce low values in the denominator. These low values may cause the prediction to be either too low or too high, at times exceeding a value of 5. Using the absolute value has a drawback in that it produces an unbalanced normalization factor. Results obtained without the absolute value were extensively tested and shown to produce substantially poorer results. In further discussion, only the absolute value approach will be applied.
Additions To The Base Pearson R Algorithm
Several modifications can be made to the base Pearson R algorithm to improve its performance. These modifications have to do with setting thresholds in order to reduce the effects of ‘noisy’ data.
Pearson R Thresholding
One modification, initially suggested by Shardanand [1], is to avoid making use of users who are not highly correlated. To implement this approach we modify as follows:
where L is the Pearson R threshold. After calculating we simply replace it with and produce the expected score.
Figure 9 shows the effect of increasing the Pearson R threshold on the MAE and coverage:
Figure 9: MAE and Coverage as a function of Pearson R threshold
The results were obtained with two additional thresholds, users and Herlock. The users threshold is the number of correlated users required to make a prediction. When the threshold is not met, the algorithm will not make a prediction for the item in the test data set. For consistency, all methods in this paper will be examined with a threshold of 3 users. AHerlock threshold of 1 is identical to no threshold at all. This threshold will be explained later.
The Pearson R threshold initially reduces the MAE (MAE=0.7867 at P TH=0.1) slightly without compromising coverage significantly. Later on, the Pearson R threshold has the effect of increasing the MAE and decreasing coverage – both effects undesirable. The decreasing coverage is caused by an increased amount of movies that cannot be predicted due to an insufficient number of correlated users.
The user average plot, present in the upper chart, is the MAE of the user average algorithm calculated on the predicted test data set. The test data set experiences changes as the coverage changes. These changes cause slight variations in the results user average method. The plot of user average is adds information about the type of movies being omitted with decreasing coverage. This measure will prove more significant in later discussion.
Significance Weighting
Herlocker et. al. [2] suggested that it may be advisable to make use of the number of mutually rated movies (MRM). Recall Figure 6, which shows a histogram of the average MRM for all users in the test data set. Herlocker suggested that weighing should be applied to decrease the correlation coefficient for neighbors with a small MRM. The weighting scheme suggested alters the Pearson R coefficient as follows:
where H is a given threshold (designated Herlock threshold in the paper).
Figure 10 depicts the effect of increasing the Herlocker threshold with a Pearson R threshold of 0.1.
Figure 10: MAE and coverage as a function of Herlocker threshold, Pearson threshold=0.1
As we can see, an increase in the threshold reduces coverage, while also slightly improving the MAE.
In an attempt to improve Herlocker’s method, a slightgly altered weighing scheme was applied:
This slight alteration has the effect of truncating users with low MRM’s faster, while still making use of ‘borderline’ users.
Figure 11 depicts the effect of increasing the Herlocker threshold with a Pearson R threshold of 0.1. Both squared and none squared approaches are shown.
Figure 11: MAE Vs. Coverage with two types of Herlocker thresholding
As we can see, the performance of both square and none-square approaches is relatively similar. In future discussion, only the none-square approach will be used.
Another alteration to Herlocker’s threshold was attempted. Instead of a rigid threshold, a threshold based on each users’ average MRM was calculated for each user. The threshold is a constant multiplied by the average MRM for each user. The logic behind this approach is that some users tend to rate more movies than others, and would therefore tend to have higher MRM’s. Using a constant threshold for all users’ MRM does not take into account this variance.
Figure 12 depicts the results obtained with this method.
Figure 12: MAE and coverage as a function of multiplication threshold
As we can see, this method performs worse than Herlocker’s initial proposal and will be discarded.
Mean Square Difference (MSD) Algorithm
The MSD algorithm is very similar to the Pearson R algorithm, except that it relies on mean square distance as a distance metric, instead of Pearson R correlation. The distance between two users is calculated as: