Artificial Neural Network Prediction of Major League Baseball Teams Winning Percentages

Scott Wiese

12.18.2003

ECE 539

Professor Hu

Introduction

Baseball has long been known as America’s game; often called a game of inches, it is also a game of statistics. Fans attending the games across the country often keep their own statistics by using the scorecards that come with game day programs. Baseball teams have kept statistics since as early as the 1871 Boston Braves who played 31 games that year. Since then, statistics have become a tracking tool for team and individual player success. Now, scouts track players as early as high school, mainly by watching their statistics. If a high school player’s statistics are good enough to get a professional scout’s attention, that player could be drafted as early as 18 years old to play professional baseball. Statistics are part of the heritage of baseball, every game can be told by a box score and every season can be explained by an in depth review of a team’s statistics.

General Managers of current Major League Baseball teams are now examining statistics more than ever. In recent years, Billy Beane, the General Manager of the Oakland Athletics, has changed the way baseball teams are run. Instead of managing a team based on the players, he manages based purely on the players’ statistics. In the past, player personnel decisions were made on a basis of a player’s production and their relationship with the community and team. If a player was doing well and the fans liked him, the player was rewarded with a handsome contract. If a player performed very well, they could fetch tens of millions of dollars per year in salary from richer teams such as the New York Yankees. Billy Beane’s Oakland Athletics had a much more limited operating budget and therefore less money to spend on player’s contracts. He had to come up with a way to make the team more competitive with less money. He knew that if he spent a lot of money on one player and that player didn’t produce as anticipated that it could devastate his team’s chances of succeeding, whereas a team like the Yankees could go out and essentially buy further players. He had to come up with a way to accurately predict a player’s performance and determine how much money that player is worth. If he could find a formula that worked, he figured he could effectively put together a team that could compete with the bigger spending clubs in the sport.

The methodology that Beane now uses to manage the Oakland Athletics is deeply rooted in the statistics of the game (for obvious competitive reasons his algorithms have not been released). He claims to manage the team purely on statistics and determines personnel decisions on a projection of cost versus productivity. In the face of past method of managing a team, regardless of how popular a player is, if they are going to cost too much in his estimation versus their potential production, that player will no longer be a part of the team. His methodologies have proven to work as in the last few years as the Oakland Athletics have consistently either made the playoffs or been a major competitor.

As Billy Beane has risen in prominence, the question began to circulate in baseball circles and then eventually in the media, is it possible to manage a team based on statistics alone? The general consensus of the baseball community, which consulted statisticians on the matter, was over the course of an entire season that statistics could be used reliably to build a team. The entire season, 162 games, was a long enough trial period that eventually statistical patterns would become relevant. Where the theory fails though is in the post-season playoffs. Instead of a 162 game season to determine division champions, you have best of 5 and 7 game series to determine who moves on. In this case, the trial population is too small for the statistical methods to be as significant as during the regular season.

Since it has been claimed that statistics could be a useful predictor of a team’s success over the course of a season, I wanted to see if an artificial neural network could be effectively used to predict a team’s success based on its statistics. Given the above claims of statistical relevance to a team’s success, if a neural network was given a year’s worth of statistics for a team, it should be able to predict within an acceptable error range a team’s winning percentage. I would like to see if a multi-layer perceptron network that is back trained can either accurately predict a team’s winning percentage or classify a team based on a predicted winning percentage range.

Work Performed

Data Collection and Preprocessing

Before I could do any analysis I had to first gather enough data to train and test a neural network. Major League Baseball’s official website has a database of statistics fore each baseball club from the 2003 season dating back to 1871. The statistics are a wide array of batting, pitching, and fielding statistics. There are a total of 74 independent statistics not directly related to the team’s winning percentage (i.e. wins and losses). I gathered statistics from the past three years (2002, 2001, and 2000) for all 30 teams and all 74 statistics, as well as the team’s winning percentage. I was able to efficiently gather data utilizing Microsoft Excel’s web query feature. Given a web address, it will import any tabular data on the page.

One of the first preprocessing tasks needed before training a MLP neural network is to normalize the data. Since I was planning on using each year as an independent set, I normalized each year’s statistics individually. I normalized the data by using the following equation, assuming that both x and y are a vector of data:

where mean(x) is the mean value of the vector and dev(x) is the standard deviation of the vector. This was accomplished using the following Matlab code:

x = load('converted_2002');

for i=1:74,

avg = mean(x(:,i));

st = std(x(:,i));

x_n(:,i) = ((x(:,i) - avg));

end

x_n(:,75) = x(:,75);

This assumes that the file “convereted_2002” has a 30 x 75 array with each row being a different team. The first 74 columns are the statistics from MLB.com and the last column is the team’s winning percentage.

The second preprocessing step is known as Singular Value Decomposition (SVD). By performing SVD you emphasize the important features in an array of feature vectors and de-emphasize the unimportant features. By emphasizing the important features in an array of feature vectors it is easier to accurately train a neural network. Using SVD focuses the data so the neural network can concentrate on the important data and ignore the less important data. Matlab can perform SVD by using the following code:

b = a(:,1:74);

[P, S, Q] = svd(b);

p_v = P(:,1:4);

Projection = p_v * p_v';

result1 = Projection * b;

win_pct = a(:,75);

result = [result1, win_pct];

svd_2002 = result;

Here we assume that a holds the normalized data from the 2002 season. This code will emphasize the 4 most important feature vectors, as can be seen in the third line of code. The final result, svd_2002, is 30 x 75 array with the modified feature vectors and the original team winning percentages. I performed this process on all three seasons of data individually as discussed before. This concludes the data preprocessing, now the training of the neural networks can begin.

Multi Layer Perceptron Neural Network

Multi Layer Perceptron Neural Networks are best at dividing a feature space and classifying input feature vectors. The number of features is akin to the number of dimensions in a subspace, the more features present in a problem, the higher the number of dimensions. MLPs are great at subdividing this feature space into regions of classification, and then when a new set of feature vectors is mapped into the feature space, it can be properly classified by the MLP based on its coordinates within the a particular subspace. So is it possible to use a MLP to determine a baseball team’s winning percentage based on its statistics?

If we assume that each possible winning percentage is a classification from 0.000 to 1.000, one would assume given the definitions of a MLP that if the statistics are favorable that it is possible to predict a team’s winning percentage based on its year end statistics. This assumes that the statistics that we are using are in fact relevant and correlated to the problem in question.

The next step was to determine what type of MLP to use since the MLP technique is so flexible. Would it be best to attempt to predict the winning percentage exactly? Should there be an error margin? Or should a classification problem be created where the MLP attempts to classify each team into one of a few classes of winning percentages? First I’ll start by discussing my efforts to exactly predict the winning percentage of each team.

I wanted to come up with an MLP that could accurately predict the winning percentage of each team based on its season statistics. But with artificial neural networks, one hardly ever comes up with a perfect solution. So it is not whether you can create a perfect neural network, but whether your design is better than a baseline or previously established neural network. I searched through the annals of IEEE’s Transactions on Artificial Neural Networks to see if anyone else had attempted to solve this problem but I could find no prior work. So I had to come up with a baseline MLP of my own. I chose the most basic MLP, a 1 hidden layer network with 1 hidden neuron. Previous work in the semester has shown the class that there is little appreciable improvement in performance when more than 5 hidden layers are added to a MLP so I decided to test in the range from 1 to 5 hidden layers. In the end, I decided to create MLPs with 1 to 5 hidden layers with 1, 3, or 5 hidden neurons in each layer (each layer had the same number of neurons). This led to 15 different MLP configurations.

I trained each neural network three separate times with the data from the 2002 season. I used the bp.m algorithm provided by Professor Hu to train the MLP. To test each training of the MLPs I used the data from 2001. I used a Matlab file to apply each team’s statistics and check the output value from the MLP against actual winning percentage (this file can be seen in Appendix 1). After a short time I realized very few (1 or 2 out of 30) were coming out exactly correct, so I decided to establish a margin of error. I decided, arbitrarily, that if the output of the MLP was within plus or minus 0.15 that the test would be a success. When I found which configuration worked the best, I tested it directly against the created baseline MLP. I combined the statistics from 2002 and 2001 to create a larger training file. Then I proceeded to retain the baseline MLP and the best MLP from the initial trials three times each once again with this new data. I then tested each MLP with the data from 2000 and compared the two MLPs’ results to see which network performed better. Disappointed with results, I’ll discuss them in detail in the next section; I decided to see how the MLP would work as a classifier instead of a pure predictor.

Instead of training the MLP with just the winning percentage as the output of the input features, I decided to break up the winning percentages into three groups. I looked at the winning percentages of the teams that won their respective divisions over the last three years and found out that on average they had a winning percentage over 0.590, so I made this group one classification. Then I wanted to see if the MLP could properly classify winning and losing teams so I set the next threshold at 0.500. So, I created new data files where I replaced the winning percentage vector with three classification vectors that were mutually exclusive. This procedure is akin to clustering, but adapted for MLPs. Following the same procedure as described above, I went about finding which MLP configuration worked best as a classifier. Again, for the initial trials I trained with data from 2002 and tested with data from 2001 to be consistent. Then again, I took the best performing configuration and tested it against the baseline using a combined data set from 2002 and 2001 and tested with data from 2000.

Training Parameters

The BP.m algorithm written by Professor Hu has many parameters that need to be set before training and testing a MLP neural network. The parameters I was concerned about were those related directly to learning and data separation. There are two parameters that directly relate to learning: , which is the rate of learning and , which is the momentum constant. Both of these are found in the general equations for error back propagation training of a MLP neural network. I chose to use the default values for both  and , 0.1 and 0.8 respectively. I did some preliminary testing and found that although the training classification rates may have improved slightly by changing the default values the testing classification and success rates did not improve significantly. I always used 3500 epochs with random training subsets consisting of 2/3 of the available training data. I also waited until there was no improvement for 350 epochs before stopping the algorithm. This allowed most training sets to converge, although surprisingly, convergence of the training set did not correlate to better testing performance.

Results

Winning Percentage Prediction

Average Success Rates / 1 hidden layer / 2 hidden layers / 3 hidden layers / 4 hidden layers / 5 hidden layers
1 neuron / 33.33 / 56.67 / 50 / 60 / 50
3 neurons / 45.56 / 35.56 / 41.11 / 31.11 / 36.67
5 neurons / 32.22 / 45.56 / 43.33 / 40 / 25.56

As discussed above, I trained each individual trial MLP neural network three separate times followed by testing. I used data from the 2002 season to train each network and tested with data from the 2001 season. The average success rate was defined as any output from the neural network being within +/- 0.15 of the actual winning percentage. The results from the original trials are listed in Table 1.

Table 1: Average Success Rates from winning percentage prediction trials. Baseline result and best trial results in bold[JK1]

From the results we can see that the baseline neural network had an average success rate of 33.33%. Most of the trial neural networks (11/14) exceeded that level of success, with the best average rate of success of 60% for a MLP with 4 hidden layers and 1 neuron in each hidden layer. We also see that on average, the neural networks with only 1 neuron in each hidden layer outperformed the other neural networks, as seen in Figure 1. Here we can see that adding more complexity to the neural network did not necessarily help in the prediction success rates of the neural network. Next I tested the best trial network against the baseline network with more training data and a different set of testing data. The results from this trial can be found in Table 2.

Figure 1: Trial results, average success rates versus hidden number of hidden layers

Final Testing / Trial 1 / Trial 2 / Trial 3 / Mean
Baseline / 23.33 / 16.67 / 33.33 / 24.44
Best MLP / 73.33 / 43.33 / 26.67 / 47.78

Table 2: Results from final testing of baseline MLP against MLP with 4 hidden layers with 1 neuron.

We can see that on average, the best MLP from the trial stage greatly outperforms the baseline MLP. I think it can be stated with a good degree of certainty that the MLP with 4 hidden layers and 1 neuron was better at predicting a team’s winning percentage than the baseline MLP.

Classification

Following the results of the previous trials, I thought that a neural network would be better at classifying teams into one of three categories since they are more adept at classification problems rather than pure prediction. As discussed earlier, I divided the teams into three categories: division winners, winning teams, and losing teams. I made the categories mutually exclusive, so although a division winner is always a winning team I made them exclusively a division winner. The categories were set up in the following manner:

where 001, 010, and 100 are 3 x 1 vectors. This time, the classification was considered a success if a team was correctly placed into its category. The results from the original trials can be seen in Table 3.

Average Success Rates / 1 hidden layer / 2 hidden layers / 3 hidden layers / 4 hidden layers / 5 hidden layers
1 neuron / 55.5556 / 61.1111 / 56.6667 / 53.3333 / 50
3 neurons / 58.8889 / 61.1111 / 66.6667 / 57.7778 / 63.3333
5 neurons / 66.6667 / 62.2222 / 73.3333 / 60 / 67.7778

Table 3: Average classification trial results, baseline and best MLP are in bold.

In this set of trials the baseline neural network had an average success rate of approximately 56% whereas the most improved network had an average success rate of approximately 73%. Once again, most (12/14) of the advanced neural networks had improved performance versus the baseline MLP. The best performance this time was given by the network with 3 hidden layers with 5 hidden neurons. Interestingly, in contrast to the last experiment, the networks with 5 hidden layers performed better than their counterparts in this experiment. This can be seen more easily in Figure 2.

Figure 2: Average classification rates versus number of hidden layers

Once again, even though the performance in this trial was significantly better, I tested the best trial network against the baseline network with a larger training data set and a new testing set. The results of this trial can be seen in Table 4.