A Statistical Analysis of Basketball Comebacks

Adam C. Benoit, Jessica L. Jenkins, and Dr. R. Mitchell Parry

RET Program, Computer Science Department, Appalachian State University

Abstract – This research provides the result of the study on likelihood of an NBA comeback based on the time remaining and point differential. Using different data mining strategies to process data and analyze the results, the study attempts to illustrate the likelihood of a team coming back from a deficit to win the game. Data mining and visualization led to an attempt at modeling the empirical data with a function that could potentially be useful in predicting future outcomes.

Index Terms- Data mining, Basketball, Sports History, Modeling, Root-mean-square deviation, confidence interval

I. INTRODUCTION

Most people have watched a sporting event and wondered, “Is there even a chance of a comeback?” Sports analysts have examined the likelihood of basketball comebacks, based on multiple variables: possession, home-court advantage, and current ranking within a division. Statistical reports have been published that reference seasons prior to the year 2000. Based on these reports, it has been shown that basketball scores are normally distributed and there is a factor of home-team advantage [1,2]. On average, home teams win 50-65% of games in the regular season [1]. There is evidence to support that as time progresses in a game, more bold predictions can be made. Stern suggests, 75% of games are won by the team leading at the beginning of half-time and 80% of games are won by the team leading at the beginning of the fourth quarter [1]. Formulas have been developed and tested to produce fairly accurate results as to whether or not a comeback is likely. Bill James’ formula (1) is influenced by possession, adding a half-point if the team that is ahead has the ball and subtracting a half-point if the other team has the ball.

seconds = (lead – 3 ± 0.5)2 (1)

If the result is greater than the number of seconds left in the game, then the lead is safe [3]. The formula (1) assumes once a safe lead, always a safe lead, meaning even if the score tightens the predicted winner still wins. Although, this formula has only been tested on college basketball games, there is only one case of it failing, in 1974 when the University of North Carolina played Duke University and won with 17 seconds left to play and a 86-78 deficit [3].

The sample sizes in these studies varied. Stern [1] used a sample size of 493 games from the 1992 season, Gill [2] used a sample size of the 1997-1998 regular season, and James [3] did not reference the sample size for his studies.

This study focuses on all regular season games from 2002-2013 and the likelihood of a National Basketball Association (NBA) comeback based on two factors, time and point differential. A regulation NBA game is played over forty-eight minutes divided into four twelve-minute quarters. At the end of regulation, if there is a tie, teams go on to play five-minute overtime periods until the tie is broken. This study analyzes over ten-thousand NBA games, and twenty-eight million seconds. The research concluded that the probability of a future comeback, when modeled by an exponential function, can be reported with 0.0920 root-mean-square deviation (RMSD) [4].

II. METHODOLOGY

This research began with a focus on basketball, but not necessarily comeback percentage. The original abstract was to analyze free throw techniques and present findings that would enable athletics to adjust a specific part of their free throw routine to produce better results. This idea had been researched and reported on in numerous articles and with little to no definite conclusions. Every player shoots differently, every player has a different routine, and many techniques can produce desirable results.

The research was modified to examine more closely each shot during a basketball game. The Shot Charts provided by ESPN [5] were used in the analysis of games since 2002. Using a Perl script, Shot Charts were extracted for individual games. The shot charts had a record of every shot attempted during the game, whether, the shot was made or missed, and at what location on the floor the shot was taken. ImageJ software was used for image processing to change the Shot Chart image to a binary image and adjust the threshold to minimize background noise. The process left an image full of x’s and o’s to be counted and recorded in an excel file.

The Play-by-Play record [5] for each game that is available provides a wealth of data for future studies and analysis in this research. Another Perl script was used to extract each of the Play-by-Play files [5], as well as the time remaining on the clock and current score for each second of each game. MatLab was used to compile the data and produce multiple visualizations for analysis.

III. RESULTS

Based on the empirical data, the likelihood of a comeback when a team is down by 10 points changes from the beginning of the second quarter to the third quarter and to the fourth quarter, decreasing from 30.79% to 16.20% to 11.9% respectively. The empirical data also reports that after the beginning of the second quarter, a 20 point deficit is surmountable by only a fraction of a percent chance. Visualizing the empirical data in Fig. 1, there are noticeably unusual features. Numerous overtimes have occurred during the last ten seasons, increasing the number of data point with 500 seconds or less remaining. The dark, red regions in Fig. 1 and the extreme peaks in Fig. 2 represent insignificant sample sizes, which were removed from our analysis. Fig. 1also illustrates an extensive dark blue region, which can be misleading. The top left area of the chart represents time and point differentials where there was no empirical data, but the bottom right area of the chart represents situations where the empirical likelihood of a comeback truly is zero.

Fig. 1. Empirical Data in 2-Dimensions [5]

Fig. 2. Empirical Data in 3-Dimensions [5]

It is important to determine a confidence interval, because basing likelihoods off of empirical data will not produce perfect predictions, only reasonable ones. Fig. 3 illustrates the probability density based on the empirical likelihood of a comeback. The curve represents the distribution of probabilities at a time remaining of 500 seconds with an 8 point differential. The mode of the probability of a comeback is 0.1420; however, the 95% confidence interval is from 0.1126 to 0.1757.

Fig. 3 Beta Distribution Probability of a Comeback

Perhaps one of the most noticeable features of the empirical data is the apparent curve, which suggests a function might be able to model the data. The 3-Dimensional view of the original data (Fig. 2) provided a new perspective and was the key to choosing a function. Data analyses lead to determining that an exponential function may model the data. The model needed to satisfy the extreme cases of the situation, meaning both teams would have a 50% likelihood of a comeback at the beginning of the game and a 0% likelihood of a comeback at the end of the game. The original function is shown in (2) where z is the likelihood of a comeback, x is the time remaining, y is the point differential, R is a coefficient and C is a constant. The best-fit function (3) was created by performing a nonlinear regression in MatLab. The command nlinfit returns an estimated coefficient and constant for the nonlinear regression of the responses in Y on the predictors in X using the exponential function specified in (2). The coefficient and constant were estimated using iterative least squares estimation.

z (x,y) = 0.5e-Ry/(x+C) (2)

z (x,y) = 0.5e-178.3099y/(x+457.8600 ) (3)

Fig. 4 is a visual representation of the function (3). It is important to take into consideration the color scale differences of the empirical data and the function. The range of the empirical data is 0 to1, inclusive while the range of the function is 0 to 0.5, inclusive. The range was contrived based on the assumption that the highest possible chance of a comeback at any point during a game would be 50%, representing when the score is tied. Visualizing the function in three-dimensions in Fig. 5 shows a close resemblance to Fig. 2.

Fig. 4. Function in 2-Dimensions

Fig. 5. Function in 3-Dimensions

Fig. 6 is an illustration of the function modeled to the original data. The black region shows the time periods and point differentials for which the function over-predicts the observed likelihood. It is also evident in Fig. 6 the places where the function underestimates the likelihood by exposing the original data as a colored surface on top of the function. Unfortunately, the exponential function is not sufficient in modeling the data, especially at the critical time periods during a basketball game.

Fig. 6. Graph of the Function and Empirical data [5]

The function (3) makes reasonable predictions of future values based on the empirical data collected, but minimizing error became a difficult task. After attempting to minimize the error, it became apparent that the exponential function gave a RMSD of 0.0920. On average, given any point in time and any point differential, the function produces a likelihood that varies from the empirical data by 9.2%.Table 1 is a sample of the likelihoods of a comeback at the beginning of the fourth quarter as predicted by the function and the empirical data.

TABLE 1. A Comparison of Sample Likelihoods

Comparison of Comeback Probabilities at the start of the 4th Quarter for Deficits up to 20 Points
Point Differential / Empirical Estimate / z(x,y)
1 / 0.4235 / 0.4298
2 / 0.3975 / 0.3694
3 / 0.3581 / 0.3175
4 / 0.3108 / 0.2729
5 / 0.2531 / 0.2346
6 / 0.2722 / 0.2016
7 / 0.1859 / 0.1733
8 / 0.1522 / 0.1489
9 / 0.1210 / 0.1280
10 / 0.1119 / 0.1100
11 / 0.0882 / 0.0946
12 / 0.0382 / 0.0813
13 / 0.0485 / 0.0699
14 / 0.0260 / 0.0601
15 / 0.0126 / 0.0516
16 / 0.0383 / 0.0444
17 / 0.0221 / 0.0381
18 / 0.0000 / 0.0328
19 / 0.0189 / 0.0282
20 / 0.0066 / 0.0242

While Table 1 provided a single instance in time, Fig. 7 illustrates the difference in predictions at each instance in time and each differential. The dark red regions represent large discrepancies in between the function and the data. The greatest absolute deviations occur at the beginning and end of each game.

Fig. 7. Absolute Deviation

Fig. 8 displays the regions in which the function [3] predicts a likelihood that falls outside of the 95% confidence interval for the empirical data.

Fig. 8. Points falling outside of confidence interval

III. CONCLUSION

Many other factors may come in to play in predicting the likelihood of a comeback. In Fig. 9, the data is represented in a histogram to illustrate the home-court advantage factor. It is possible to see a normal distribution formed by the data, as sited in [2]. The original data was extracted with Away Score – Home Score, meaning that for a negative point differential, the home team was winning. It is clear in the histogram, with over 9,000 sample games (some of the original games being rejected due to incomplete data in the play-by-play), that the home team is winning more than half of the time. Home teams win approximately 50-65% of games in most sporting events and the home advantage is 5-6 points in basketball [1]. It is apparent that the center of the histogram is shifted approximately 5 points to the left of 0-0, suggesting evidence to support this claim. After further thought and research, it seems fitting to pursue adding an additional variable of home-team advantage to the analysis.

Although the function reasonably models the data, it is oftentimes a better predictor of future comebacks at specific points in the game, as referenced in Table 1, Fig. 7 and Fig.8. A large percentage of predictions falling outside of the confidence interval in Fig. 8 appear in the final minutes of the game. The low-level of confidence in the predictions made by the function (3) at this crucial time in the game combined with the RMSD of 0.0920 proves the exponential function is best for modeling the data moderately. However, we would like to take into consideration that the observed data from 2002-2013 is just experimental probability and a better model may still exist based on only the two factors of time remaining and point differential.

Figure 9 – All point differentials for the original data

IV. ACKNOWLEDGEMENTS

This study was supported by Dr. Rahman Tashakkori, who offered training and support through the duration of the project. We would like to thank Dr. Mary Beth Searcy for her assistance in data analysis during this study. This research opportunity was made possible by the National Science Foundation’s Research Experience for Teachers Program and Appalachian State University’s Computer Science Department. We would also like to thank Watauga High School in Boone, NC and Lincolnton High School in Lincolnton, NC for continued support of higher education and research opportunities.

V. REFERENCES

[1] H. S. Stern, “A Brownian Motion Model for the Progress of Sports Scores” J. Amer. Stat. Assoc., vol. 89, no. 427, pp. 1128-1134, Sep. 1994.

[2] Paramjit S. Gill , “Late-Game Reversals in Professional Basketball, Football, and Hockey” The Amer. Stats.,vol. 54, no. 2. pp. 94-99. May. 2000.

[3] B. James, (2008, March, 17). The Lead is Safe. [Online]. Available: http://www.slate.com/articles/sports/sports_nut/2008/03/the_lead_is_safe.3.html

[4] (2013, May, 29). Root-mean-square deviation. [Online]. Available: http://en.wikipedia.org/wiki/Root_mean_square_deviation

[5] (2013, July, 17). ESPN NBA. [Online]. Available: http://espn.go.com/nba/

VI. AUTHORS

Adam Benoit

Adam joined the National Science Foundation’s Research Experience for Teachers (RET) in 2013 partnering with Appalachian State’s Computer Science department. Through the RET program, his work focuses on bridging the gap between computer science and high school curriculum in North Carolina. He conducted research at Princeton University in Plasma Science applications in 2012, funded by the Department of Energy. He currently teaches AP Physics and Honors Chemistry at Lincolnton High school in Lincolnton, NC where he has been teaching since 2005. Adam was Teacher of the Year at Lincolnton High School in 2011-2012. Adam is also a member of the North Carolina Association of Educators and the National Science Teachers Association.