Baseball Fundamentals:

Pitching, Hitting, Running and Fielding Your Way to Success

Statistics & Data Analysis

Data Analysis Project

Professor Jeffrey Simonoff

Overview of Analysis

Baseball is a “simple” game: you pitch, field, and hit and run. For our project, we evaluate how each of those components contributes to the success of a team. Observations are taken for all Major League Baseball teams. The 6 variables analyzed for each team are:

Hitting

·  On Base Percentage (OBP) – This is a measure of a batter’s contribution to his team’s offense by the rate at which he reaches base.

·  Slugging Percentage (SLG) – This is a measure of a batter’s contribution to his team’s offense by considering the total number of base per at bat for a hitter.

Running

·  Stolen Bases (SB) – This is a measure of teams’ ability to utilize its speed and base-running skill for offensive gain.

Fielding

·  Fielding Percentage (FP) – The number of fielding chances handled without an error. High fielding percentages indicate quality fielding and throwing.

Pitching

·  K/9IP – The average number of strike outs per standard 9-inning game. This statistic is one of the ways to evaluate a team’s pitching and is commonly known as a measurement “power pitching”, as strikeouts are indicative of a pitcher’s ability to overpower a batter.

·  Walks Plus Hits Per Inning Pitched (WHIP) – This statistic is useful when evaluating the effectiveness of a team’s pitching. It indicates how successful the pitchers are keeping the opposing batter off base.

This analysis evaluates how these 6 variables effects a team’s winning percentage. These 6 variables have been chosen because they best represent the above mentioned fundamental aspect of the game. In theory, a team that can do the fundamentals of baseball the best will win the most games. We can see which aspect will have the most effect on a team’s ability to win.

Motivations for Analysis

This analysis is of interest because it considers the various aspects of the game and evaluates the impact each has on the overall performance of a team. Each team is effectively a company, competing in the industry of baseball. Ideally, each team determines and pursues a strategy that maximizes its resources (e.g. financial support) and capabilities (e.g. scouting and player development). By analyzing data for specific teams, our objective is to understand the aspects of the game that winning teams have emphasized.

Based upon this historical data, we will see which areas of the game at which winning teams have excelled. We expect this analysis to provide insight into winning strategies, which would enable us to forecast the winning percentage of future teams based upon basic measures of their hitting, running, fielding and pitching performance.

Overview of Data

The data analyzed in this project covers five years worth of season statistics for all 30 team in Major League Baseball. All of our data variables are numerical. Winning percentage, our response variable, and several predictors, including OBP, SLG, K/9IP, WHIP and fielding percentage are continuous variables. Stolen bases are a discrete numerical variable. All data was obtained from www.ESPN.com.

In determining what data to use, we wanted to select statistical measures that covered the traditional “5 tools” of baseball: hitting for average, hitting for power, speed, fielding and throwing. These tools are the skills for which individual players have been traditionally evaluated. Although our analysis looks at the overall team rather than the individual player, we will use the traditional list of statistical measures to provide a view of team performance. For example, to gain insight into batting for average and batting for power we used OBP and SLG batting statistic. Additionally, we used fielding percentage as a measure of both fielding and throwing for non-pitchers.

While these statistical values are useful in measuring teams’ performance in batting, running, fielding and pitching, there are some limiting factors to their ability to tell the whole story:

·  Fielding performance is a factor of both ability to execute a play without making an error and a player’s ability to reach a ball put in play, commonly known as “range”. While fielding percentage does not account for a range factor, range is a very subjective statistic. Teams comprised of players with exceptional range may impact the game by taking away hits that otherwise may have occurred. The downstream impact of teams with exceptional range could therefore be measured in WHIP. Additionally, fielders with limited range make no attempt on a play on which fielders with exceptional range could attempt and make an error. Therefore, since the impact of range may vary, it has been deemed acceptable to exclude.

·  A team’s proficiency at executing its running game is measured in many statistical and non-statistical ways. In addition to stolen bases, the percentage of successful stolen base attempts, and the ability to take an additional base on a play are a key components. The ability for a team to take the extra base or break up a double play are good indications of a team’s ability to use good base running to its advantage. However, these plays are not recorded in any statistical numbers.

First Look at the Data:

Descriptive Statistic:

Variable / N / Mean / Median / SE Mean / StDev
OBP / 150 / 0.33388 / 0.333 / 0.000983 / 0.01205
SLG / 150 / 0.42446 / 0.423 / 0.0019 / 0.02329
SB / 150 / 89.42 / 85 / 2.49 / 30.53
Fielding % / 150 / 0.98311 / 0.983 / 0.000206 / 0.00253
WHIP / 150 / 1.3938 / 1.39 / 0.00699 / 0.0856
K/9 / 150 / 6.5297 / 6.44 / 0.0481 / 0.5887
Winning % / 150 / 0.49633 / 0.5015 / 0.00573 / 0.07023
Variable / Minimum / Q1 / Q3 / Maximum
OBP / 0.3 / 0.325 / 0.342 / 0.366
SLG / 0.368 / 0.40775 / 0.44325 / 0.491
SB / 31 / 66 / 109 / 200
Fielding % / 0.977 / 0.981 / 0.985 / 0.989
WHIP / 1.22 / 1.33 / 1.45 / 1.62
K/9 / 5.41 / 6.1175 / 6.8625 / 8.68
Winning % / 0.264 / 0.438 / 0.549 / 0.644

The initial analysis of our data highlights no unusually distributions. For the most part the mean and the median of each variable are substantially similar, which points to a normal distribution. We then attempted to verify our findings by plotting a histogram for each of our variables. Looking at the histogram for the K/9 variable, we saw a slight right tail. However, after we took the log of K/9, there was no significant improvement.

Next we graphed our response (winning %) in a box plot to highlight any outlying data points. The only outlier that was observed was the winning % of the 2003 Detroit Tigers, which was one of top 10 lowest winning % in baseball history. We will take this into consideration as we evaluate the quality of our model.

We then looked at the fitted line plot of each variable against the winning % to get a better understanding of the relationship of each predictor and the response. This plots isolates each predictor and doesn’t take into account the combined effect of all variables on the response.

From these plots we do not see an overwhelming strong correlation between any individual variable and the team’s winning, as each R-Sq is below 50%. The variables with the highest R-Sq are WHIP and OBP and the variable with the lowest R-Sq is SB. Although each individual variable doesn’t show significant correlation to the team’s winning %, this is not surprising since a team’s success is dependent on execution of all the fundamentals of the game of baseball.

There appear to be potential outliers and/or leverage points identified in the fitted line plots above; however the impact of these outliers will be further evaluated after analyzing the best subsets regression.


Preliminary Multiple Regression Model:

Regression Analysis: Winning % versus OBP, SLG, ...
The regression equation is
Winning % = - 3.02 + 1.67 OBP + 0.891 SLG + 0.000098 SB + 3.35 Fielding % - 0.510 WHIP - 0.00089 K/9
Predictor / Coef / SE Coef / T / P
Constant / -3.020 / 1.107 / -2.73 / 0.007
OBP / 1.6664 / 0.3117 / 5.35 / 0.000
SLG / 0.8914 / 0.1569 / 5.68 / 0.000
SB / 0.00009752 / 0.00008467 / 1.15 / 0.251
Fielding % / 3.345 / 1.123 / 2.98 / 0.003
WHIP / -0.50953 / 0.03450 / -14.77 / 0.000
K/9 / -0.00089 / 0.004759 / -0.19 / 0.851
S = 0.0309459 R-Sq = 81.4% R-Sq(adj) = 80.6%
Analysis of Variance
Source / DF / SS / MS / F / P
Regression / 6 / 0.597891 / 0.099649 / 104.06 / 0.000
Residual Error / 143 / 0.136944 / 0.000958
Total / 149 / 0.734835

The multi-variable model highlights the importance of considering several fundamentals as it now accounts for approximately 81% of the variability in team winning percentage. As expected, increases in OBP, SLG and Fielding % are associated with higher winning percentages. Holding all other variables constant, the model indicates that a team which gives up one additional hit or walk per game (an increase in WHIP of 0.1111) can be expected to have a winning percentage that is decreased by 0.057, or nearly one standard deviation from the mean. This result underscores the baseball adage that “pitching wins games.” On the contrary, our model reveals that the impact of stolen bases on team winning percentage is negligible. Even when comparing the range (169), or the difference between the team with the most stolen bases and the team with the fewest, the predicted difference in winning percentage is only .017 (169 x 0.000098). This is further verified by the high P value for stolen bases of 0.251, which is indicative of insufficient evidence to reject the null hypothesis that stolen bases are unrelated to team winning percentage.

One point of interest in the model is that when comparing two teams with identical statistics other than K/9, the model predicts that the team with fewer K/9 will actually have a slightly higher winning percentage. However, K/9 appears to have a minimal impact on team winning percentage. The difference between the teams with the highest and lowest K/9, results in a predicted difference in winning percentage of only .003 (3.27 x 0.00089). This is again further verified by the extremely high P value for K/9 of 0.851, which is indicative of insufficient evidence to reject the null hypothesis that k/9 are unrelated to team winning percentage. The P value results for stolen bases and K/9 indicate that the inclusion of these variables in our model does not add value to its predictive power. This will be further analyzed in the “Model Improvement” section below.

The standard error of the estimate of approximately 0.031 implies the model can predict winning percentage to within ±0.062 (2 x 0.031) about 95% of the time. To put this further into perspective, over the course of a 162 game season, this translates into an error of the estimate of approximately ±10 wins (±0.062 x 162).

Checking Assumptions

In order to evaluate the validity of the model assumptions, we must analyze the model errors through the use of several residual plots. We will begin with the plot of residuals versus the fitted values as well as residuals versus each predictor. These plots will be evaluated to identify any structure which may indicate that the model assumptions are invalid.

The above plots reveal no apparent structure of the residuals, indicating our assumptions regarding distributions of errors is correct. That is, there are no well-defined subgroups and the variance of the errors is distributed in a homoscedastic pattern.

Next we evaluate the normal probability plot of the residuals to ensure errors are normally distributed.

This plot indicates that the residuals roughly follow a normal distribution. As a further step to ensure our assumptions hold, we will run a time-series plot of residuals, which will indicate any auto-correlation of results across seasons.

Our data was ordered from 2007 down to 2003, with 30 observations from each season. The time-series plot of residuals does not reveal any apparent patterns to indicate auto-correlation.

Model Improvement

Our multiple-regression model provided significantly greater ability to determine the winning percentage of a baseball team, than any single regression model with an individual predictor. However, we will now evaluate opportunities to improve upon our model.

Simplifying the Model

As previously indicated, we believe that stolen bases and K/9 are the weakest predictors of winning percentage. We will evaluate the best subset regression to determine which combination of predictors provides the strongest ability to predict winning percentage.

Best Subsets Regression: Winning % versus OBP, SLG, ...

Response is Winning %

W

O S H K

Mallows B L S F I /

Vars R-Sq R-Sq(adj) C-p S P G B % P 9

1 49.1 48.7 244.7 0.050282 X

1 35.4 35.0 349.3 0.056613 X

2 76.3 76.0 37.7 0.034401 X X

2 74.8 74.5 49.1 0.035467 X X

3 80.1 79.7 10.8 0.031661 X X X

3 77.3 76.9 31.8 0.033765 X X X

4 81.2 80.7 4.3 0.030874 X X X X

4 80.2 79.6 12.0 0.031683 X X X X

5 81.4 80.7 5.0 0.030842 X X X X X

5 81.2 80.5 6.3 0.030981 X X X X X

6 81.4 80.6 7.0 0.030946 X X X X X X

These results seem to support our initial conclusion that stolen bases and K/9 have negligible impact on the model’s ability to predict team winning percentage. By eliminating these two variables from the model, we reduce the complexity while improving our ability to predict winning percentage, as noted by the slight increase in adjusted R2. While the model only eliminating K/9 provides slightly higher R2 and lower standard error of the estimate, the benefits (R2 increased by 0.2 and S decreased by 0.000032) are nearly inconsequential compared to the simplicity of modeling based upon fewer variables.