5.1
5.Building and applying logistic regression models
We will start with Section 5.2 and then go back for Section 5.1.
5.2Model Checking
Determine how well the model fits the data.
Explanatory variable pattern (EVP) form
In the placekicking data set (consider ONLY distance and good1 variables), each of the 1,425 rows represented one observation. For example, the first observation was a placekick attempted at 21 yards. The “good1” variable represents a Bernoulli observation – 0 for a failure and 1 for a success. When examining goodness-of-fit measures and residuals, it is often better to examine the data in “explanatory variable pattern” form. This format has one row in the data set for each of the unique sets of explanatory variables. For example, there are 20 observations with distance=21 in the placekicking data set. Of the 20, 19 are successes. Therefore, the response variable is now binomial instead of Bernoulli.
Hosmer and Lemeshow (2000) call explanatory variable pattern form “covariate pattern form”.
Example: Placekicking (placekick_ch5_pattern.R, place.s.csv)
R code:
library(nlme)
place.small<-data.frame(good=place.s$good1, dist=place.s$dist)
place.sum<-gsummary(object = place.small, FUN = sum, groups=place.small$dist)
place.length<-gsummary(object = place.small, FUN = length, groups=place.small$dist)
place.pattern <- data.frame(y = place.sum$good, fail =
place.length$good -place.sum$good, n = place.length$good,
distance = place.sum$dist)
The explanatory variable pattern form of the data set is below. Y is the observed binomial value and n is the total number of trials.
> place.pattern
y fail n distance
1 2 1 3 18
2 7 0 7 19
3 776 13 789 20
4 19 1 20 21
5 12 2 14 22
6 26 1 27 23
7 7 0 7 24
8 12 1 13 25
9 8 1 9 26
10 24 5 29 27
11 20 2 22 28
12 16 1 17 29
13 12 2 14 30
14 10 1 11 31
15 23 7 30 32
16 20 1 21 33
17 16 3 19 34
18 12 2 14 35
19 18 4 22 36
20 22 7 29 37
21 23 5 28 38
22 22 6 28 39
23 13 6 19 40
24 7 3 10 41
25 20 7 27 42
26 12 10 22 43
27 13 5 18 44
28 11 7 18 45
29 11 3 14 46
30 14 9 23 47
31 10 8 18 48
32 8 4 12 49
33 10 9 19 50
34 11 4 15 51
35 5 8 13 52
36 5 4 9 53
37 1 6 7 54
38 2 1 3 55
39 1 0 1 56
40 1 0 1 59
41 0 1 1 62
42 0 1 1 63
43 0 1 1 66
Think of this as a 432 contingency table of responses. The 43 is the number of different distances and the 2 is the number of possible outcomes for the placekick.
The logistic regression model can be fit to the data set in this form also. Below is the glm() function used to fit the model. Notice the response variable used is y/n and the weight option specifies n. The exact same estimates for and result. Notice the different values for the “null deviance” and the “residual deviance”.
> mod.fit <- glm(formula = y/n ~ distance, data = place.pattern, weight = n,family = binomial(link = logit), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))
Deviance = 47.12782 Iterations - 1
Deviance = 44.52325 Iterations - 2
Deviance = 44.49945 Iterations - 3
Deviance = 44.49945 Iterations - 4
> summary(mod.fit)
Call:
glm(formula = y/n ~ distance, family = binomial(link = logit), data = place.pattern, weights = n, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0373 -0.6449 -0.1424 0.5004 2.2758
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.812080 0.326245 17.82 <2e-16 ***
distance -0.115027 0.008338 -13.80 <2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 282.181 on 42 degrees of freedom
Residual deviance: 44.499 on 41 degrees of freedom
AIC: 148.46
Number of Fisher Scoring iterations: 4
Pearson residuals and measures for goodness of fit
A simple measure of goodness of fit is to plot the estimated proportions versus the explanatory variable. The plotting point can be made to be proportional to the sample size for each explanatory variable pattern. See Chapter 4 for an illustration.
In the placekicking example, the explanatory variable pattern form of the data creates a 432 contingency of responses. The residuals for each cell can be found. The residuals result from the observe cell count minus the estimated cell count using the model fit.
For example, the predicted probability at distance=21 is
For 20 placekicks, the estimated number that will be successful is 0.967620 = 19.35. The estimated number of failures is 20-19.35 = 0.65.
The residual for the successes is 19 – 19.35 = -0.35 and the residual for the failures is 1 – 0.65 = 0.35
Instead of using both columns (success and failure) of the table, only the success column is used[CB1].
As defined in Chapter 4, the Pearson residual[b2]is
In logistic regression with a binomial response, the Pearson residual is
where yj denotes the number of successes for the jth explanatory variable pattern, nj denotes the number of trials for the jth explanatory variable pattern, and is the estimated probability of success for the jth explanatory variable pattern.
Be very careful with the notation here. Agresti (2007) uses a little bit different notation on p. 148. The subscript j here is used to help differentiate the explanatory variable pattern form (binomial) version of the data set from the Bernoulli form of the data set. Hosmer and Lemeshow (2000) use a similar form of notation.
For the distance of 21 yards,
Why should the Pearson residual be defined as shown here instead of defined using the Bernoulli form of the data set?
In order to find outliers, one would like to use the standard normal approximation to the distribution of the Pearson residual. One could conclude outliers are outside of -1.96 or 1.96 (for 95% confidence) or
-2.576 to 2.576 (for 99% confidence). This works fine provided a moderate to large sample size is available for each explanatory variable pattern. With the Bernoulli data, the sample size is 1 trial for each residual.
With the explanatory variable pattern form (binomial) version of the data, there is at least a possibility the sample size (nj)[CB3] may be large enough for the normal approximation to work. Unfortunately with a “continuous” explanatory variable, this will usually not be the case. Therefore, one needs be careful with interpreting Pearson residuals.
Pearson residuals outside of 2.576 may need to be investigated for being potential outliers. However, care needs to be used in saying which explanatory variable patterns are actually detrimental to the model.
In the derivation of the Pearson residual, the estimated standard deviation of yj is used in the denominator. This derivation treats as a constant, not as a random variable (which it is). The standardized residual treats the as a random variable which allows for it to be more closely approximated by a standard normal distribution. The standardized residual is:
where hj is the jth diagonal value from the hat matrix[CB4].
Let X be the matrix of the explanatory variable patterns with 1’s in the first column. Create a diagonal matrix, , with diagonal elements of in the same order as the corresponding explanatory variable patterns listed in X. The hat matrix is H = X(XX)-1X. Note that this is similar to the hat matrix used when fitting a regression model by weighted least squares.
From Chapter 2: The Pearson residual and the standardized residual are the equivalent of semistudentized residuals and studentized residuals discussed in a regression analysis course (see my Section 10.1-10.3 lecture notes from STAT 870 at
schedule.htm).
An overall measure of goodness of fit is the sum of the Pearson residuals squared. This is called a Pearson statistic:
where J is the number of explanatory variable patterns. Provided each nj is large, this statistic can be approximated by a distribution where k+ 1 is the # of parameters estimating.
The Pearson statistic is testing the following hypotheses
Ho: logit() = + 1x1 + … + kxk (k+1 parameters)
Ha: Saturated model (J parameters)
The “saturated” model is defined as a model where a parameter is estimated for EACH explanatory variable pattern (J different parameters). This means j is estimated by the sample proportion, yj/nj. How could you write the saturated model here?[b5]
The LRT can also be used to test for goodness of fit as well. The test statistic used is
.
Compare this to past presentations of the LRT. Note that you will generally NOT get the same value when the Bernoulli format of the data (yi = 0 or 1) is used. Provided each nj is large, this statistic can be approximated by a distribution. Obviously when there is a continuous explanatory variable, this 2 approximation may be poor. See Simonoff (American Statistician, 1998) for more information.
This statistic is often called the “deviance”. It is called this because the statistic measures how much the observed data “deviates” from the model’s fit.
Additional notes about the “deviance”
- Deviance residual: This measures the contribution of the jthexplanatory variable pattern to the G2. It can be calculated using the resid(object = model fit object, type=”deviance”) function in R. Also, R gives a summary of these in the summary(mod.fit) output. We will not discuss the residual here due to time restrictions, but the additional Chapter 5 notes give some details on them.
- The change in G2 when an explanatory variable pattern is removed from the data set can also be calculated. The statistic is the squared deviance residual divided by (1-hj). This measure will not be discussed here.
Measures of influence
For a review of influence measures used in regression analysis, see my Section 10.1-10.3 and 10.4-10.6 notes for the UNL STAT 870 course. These are accessible from
To measure the influence of a particular explanatory variable pattern, a model could be refit without this explanatory variable pattern and X2 could be recalculated. The amount of change from the Pearson statistic with all J explanatory variable patterns and with just J-1 would be of interest. Large values would indicate that the particular explanatory variable pattern is “influential”.
As in regression analysis, only one model actually needs to be fit and the diagonal of the hat matrix is used to calculate the statistic of interest. The statistic that measures the change in X2 is (squared standardized residual)! Thus, the standardized residual plays two different roles - it detects outliers and influential explanatory variable patterns. Because of this relationship, I will often just examine instead of . Since a square of standard normal random variable is a random variable, values of greater than or may indicate an outlier or influential EVP.
A measure similar to theCook’s distance in [CB6]regression analysis can also be calculated. The measure is
Notice a subscript j is used for the measure to indicate one value is calculated for EACH explanatory variable pattern. Thus, is an overall measure of influence on the in the model. Note that these are the same as what Agresti (2007) calls “confidence interval displacement diagnostic”[b7] on p. 150.
Note that there is not a nice distributional approximation for as there is for . “Large” values indicate an explanatory variable pattern may be influential. To assess how much an effect it has on the , the pattern should be TEMPORARILYremoved from the data set and the model refit. The percent change in the can be calculated as an assessment of influence.
Page 176 of Hosmer and Lemeshow (2000) has a nice discussion on likely values for these influence measures.[CB8]
Diagnostic plots
Plots can be used to simultaneously examine all of the standardized residuals and ’s. These can help when deciding which observations are possible outliers or influential. Below are the descriptions of a few different plots:
When there is only one explanatory variable, plotting , , and/or versus the explanatory variable is often helpful. Obviously, it would not be helpful if the explanatory variable is binary!
A simple plot is the ,, and/or versus the observation number.
Plot the and versus the estimated probabilities.
The previous plot can also be done with the plotting point proportional to nj.
A plot of versus the estimated probabilities with the plotting point proportional to can be helpful as a way to combine the different influence measures.
In all of these plots using or , horizontal lines at Z0.975=1.96 and Z0.995=2.576 when a standard normal is appropriate and=3.84 and when a chi-square is appropriate are helpful in order to view the standard normal or 2 approximation.
The third, fourth, and fifth plots listed above are especially helpful to diagnose if an explanatory variable pattern is “really” something to be concerned about. For example, some squared standardized residuals will often be larger than 3.84 when is small or large. This may be due to the binary nature of the data.
Suppose =0.9 for an explanatory variable pattern with 1 success out of 3 trials. Then Pearson residual is:
.
Is this explanatory variable pattern really an outlier or just a product of the data being binary? If the model fits well for other explanatory variable patterns with similar explanatory variable values, then it probably is just a product of the data being binary.
Example: Placekicking (placekick_ch5_pattern.R, place.s.csv, examine_residuals.R)
To help automate investigating the residuals, I have written a function called examine.resid(). This function produces the following plots:
- Pearson residual vs. EVP number(in the explanatory variable pattern data set)
- Standardized residual vs. EVP number
- Standardized residual vs. predicted probabilities
- Standardized residual vs. predicted probabilities with the plotting point proportional to nj.
- vs. EVP number
- vs. predicted probabilities
- Standardized residual vs. predicted probabilities with the plotting point proportional to .
The function also calculates the Pearson and LRT (to be discussed later) statistics.
The function serves as an excellent example of how R can be customized for your needs! The results of the function will be first discussed and then the code in the function will be discussed.
> mod.fit <- glm(formula = y/n ~ distance, data = place.pattern, weight = n,family = binomial(link = logit), na.action = na.exclude, control =
list(epsilon = 0.0001, maxit = 50, trace = T))
GLM linear loop 1: deviance = 173.2089
GLM linear loop 2: deviance = 60.7265
GLM linear loop 3: deviance = 45.1433
GLM linear loop 4: deviance = 44.5011
GLM linear loop 5: deviance = 44.4994
> source("C:/chris/unl/STAT875/Chapter5/
examine_residuals.R")#Change location on hard drive
for your own computer
> save <- examine.resid(mod.fit)
The Pearson statistic is 56.1136 with p-value = 0.0581
The G^2 is 44.4994 with p-value = 0.3267
> names(save)
[1] "h" "pearson" "sq.stand.resid" "delta.beta" "pear.stat" "dev"
The p-value for the Pearson statistic indicates the model may not fit the data well. Since there are many explanatory variable patterns with small sample sizes, the chi-square approximation may not be appropriate.
There are a few EVPs that have standardized residuals outside of 1.96. The versus the estimated probabilities plots help indicate why this is happening. Many of the EVPs occur when the estimated probability of success is large. Since there is only one explanatory variable, a plot of the standardized residuals versus the distance of the placekick can also be constructed:
The code to construct this plot is similar to the code used in examine_residuals.R. The explanatory variable patterns that are possible outliers are at 18, 20, 22, 27, 32, and 51 yards. Below are the observed values for these explanatory variable patterns:
> data.frame(place.pattern,
pi.hat = round(mod.fit$fitted.values,2),
stand.resid = round(stand.resid,2))
y fail n distance pi.hat stand.resid
1 2 1 3 18 0.98 -3.58
2 7 0 7 19 0.97 0.43
3 776 13 789 20 0.97 3.63
4 19 1 20 21 0.97 -0.45
5 12 2 14 22 0.96 -2.15
6 26 1 27 23 0.96 0.09
7 7 0 7 24 0.95 0.58
8 12 1 13 25 0.95 -0.44
9 8 1 9 26 0.94 -0.72
10 24 5 29 27 0.94 -2.48
11 20 2 22 28 0.93 -0.39
12 16 1 17 29 0.92 0.29
13 12 2 14 30 0.91 -0.76
14 10 1 11 31 0.90 0.05
15 23 7 30 32 0.89 -2.30
16 20 1 21 33 0.88 1.01
17 16 3 19 34 0.87 -0.37
18 12 2 14 35 0.86 0.01
19 18 4 22 36 0.84 -0.31
20 22 7 29 37 0.83 -0.97
21 23 5 28 38 0.81 0.18
22 22 6 28 39 0.79 -0.06
23 13 6 19 40 0.77 -0.91
24 7 3 10 41 0.75 -0.36
25 20 7 27 42 0.73 0.16
26 12 10 22 43 0.70 -1.67
27 13 5 18 44 0.68 0.40
28 11 7 18 45 0.65 -0.39
29 11 3 14 46 0.63 1.25
30 14 9 23 47 0.60 0.09
31 10 8 18 48 0.57 -0.15
32 8 4 12 49 0.54 0.88
33 10 9 19 50 0.52 0.10
34 11 4 15 51 0.49 2.00
35 5 8 13 52 0.46 -0.55
36 5 4 9 53 0.43 0.79
37 1 6 7 54 0.40 -1.43
38 2 1 3 55 0.37 1.06
39 1 0 1 56 0.35 1.37
40 1 0 1 59 0.27 1.63
41 0 1 1 62 0.21 -0.52
42 0 1 1 63 0.19 -0.49
43 0 1 1 66 0.14 -0.41
The 18 and 22 yard placekicks being potential outliers is partially attributable to the data being binary – especially the 18 yard kick. The 22, 27, 32, and 51 yard standardized residuals are not too far past their borderlines. The main concern for this data set is the 20 yard placekicks. Why do you think the model is not fitting the data well at this distance?[CB9]
The plots are indicating that explanatory variable pattern #3 is very influential. This pattern contains all 789 PATs. Since this is about ½ of the total placekicks, it is to be expected that the pattern is influential. Unfortunately, the large distorts the plot. No remedial action is taken here since a model which incorporates more explanatory variables will be discussed later.
Another good plot to examine when there is one explanatory variable is shown below. This plot was constructed in Chapter 4. Remember that the very large distance placekicks do not have many observations. Compare this plot to the table of observations given previously.
Below is the code in examine_residuals.R. I recommend downloading the program from the course website instead of copying and pasting what is below into R. A number of students have tried the copying and pasting and then had many problems running the program.
############################################################
# NAME: Chris Bilder #
# DATE: #
# Purpose: Automate part of the logistic regression #
# diagnostic procedures into one function. #
# #
# NOTES: This function needs to be run just once before it #
# used. #
# One only needs to call the function with the model #
# fit object. For example, #
# #
# mod.fit<-glm(formula = y/n ~ distance, data = #
# place.pattern, weight=n, family = binomial(link#
# = logit), control = list(epsilon = 0.0001, #
# maxit = 50, trace = T)) #
# examine.resid(mod.fit) #
# #
# Note that the data needs to be in explanatory variable #
# pattern form. #
#############################################################
examine.resid<-function(mod.fit.obj=mod.fit){
h<-influence(mod.fit.obj)$h
pearson<-resid(mod.fit.obj, type="pearson")
stand.resid<-pearson/sqrt(1-h)
sq.stand.resid <-stand.resid^2
pred<-mod.fit.obj$fitted.values
n<-mod.fit.obj$prior.weights
df<-mod.fit.obj$df.residual
delta.beta<-sq.stand.resid *h/(1-h)
#Pearson statisic
pear.stat<-sum(pearson^2)
cat("The Pearson statistic is", round(pear.stat,4), "with p-
value =", round(1-pchisq(pear.stat, df),4))
#Put in a blank line
cat("\n")
dev<-summary(mod.fit.obj)$deviance
cat("The G^2 is", round(dev, 4), "with p-value =", round(1-
pchisq(dev, df),4), "\n")
#Open a new plotting window
win.graph(width = 6, height = 6, pointsize = 10)
#Split plot into four parts
par(mfrow=c(2,2))
#Pearson residual vs observation number plot
plot(x = 1:length(pearson), y = pearson, xlab="j
(explanatory variable pattern number)", ylab="Pearson
residuals", main = "Pearson residuals vs. j")
abline(h = c(qnorm(0.975), qnorm(0.025)), lty = 3, col =
"blue")
#Standardized residual vs observation number plot
plot(x = 1:length(stand.resid), y = stand.resid, xlab="j
(explanatory variable pattern number)", ylab=
“Standardized residuals", main ="Standardized residuals