Handout 15 – Introduction to Logistic Regression

This handout covers material found in Section 13.8 of your text.
You may also want to review regression techniques in Chapter 11.

These data are taken from the text “Applied Logistic Regression” by Hosmer and Lemeshow. Researchers are interested in the relationship between age and presence or absence of evidence of coronary heart disease (CHD).

How do we develop a parametric model for a dichotomous response like CHD using Age of the person as the covariate? We might try a linear regression model with Age as a predictor and CHD as the response. Before we do this in SAS, consider our linear regression model:

,

where

Note that the mean function is given by . As we saw above, we can find the expected value of the Bernoulli random variable as follows:

.

Why is this important? This shows that… (see previous page)

That is, the regression line gives an estimate of the probability of having CHD for a given Age.

In SAS… (using file CHD.sas)

proc reg;

model CHD = age;

plot CHD*age; run;

Some Problems that Arise Using this Model

1.  Non-normality of the error terms. Only two different error terms are possible for each Agei: - θ(Agei) if the response is 0, and 1 - θ(Agei) if the response is 1.

2.  Non-constant variance of the error terms. Since is a Bernouilli random variable, we know that Var() = θ(Agei) × [1- θ(Agei)]. This then implies that,

Var() = []×[1-]


That is, the variance function varies with Age and is NOT constant.

3.  Constraints on the response function. A linear representation permits estimates or predictions outside the range 0 to 1, which is not correct when modeling probabilities. For example, what is our estimate of θ(Age=20) if we use a linear regression model?


Comment: The constraint that the mean function fall between 0 and 1 frequently rules out a linear response function. For our CHD example, the use of a linear response function might require us to assume a probability of 0 for the mean response for all individuals beneath a certain age and a probability of 1 for all individuals over a certain age (see below). Such a model is often considered unreasonable, however.

Ideally, we’d like to find a model where the probabilities 0 and 1 are reached asymptotically. One such model is the logistic regression model.

Recall:

The Simple Logistic Mean Function

We parameterize this model as follows:

.

Some examples of simple logistic mean functions are shown below:

With η0 = 0 / With η1 = -1

Comments:

1.  The logistic mean function is always between 0 and 1.

2.  As η1 increases, the function becomes more S-shaped; therefore, the function changes more rapidly in the center.

3.  When η1 is positive, the function is monotone increasing; when η1 is negative, the function is monotone decreasing.

4.  Changing η0 shifts the function horizontally.

5.  The logistic function possesses the property of symmetry. If the response variable is recoded by changing 1s to 0s and 0s to 1s, the signs of all coefficients will be reversed.

To fit the logistic regression model in SAS, you can use the following programming statements:

ods html;

ods graphics on;

proc logistic descending;

model CHD = age / link=logit;

graphics estprob; run;

ods graphics off;

ods html close;

Questions:

1.  Based on the plot, find

2.  Based on the plot, find

Interpreting the Model Parameters

Mean function: .

Fitted Model Equation (or Fitted Probabilities):

Note that in the mean function, the probabilities θ(Agei) are nonlinear functions of η0 and η1. However, a simple transformation results in a linear model. That is, we can show the following:

Proof of the previous claim:

Fitting the Model in JMP

Select Analyze > Fit Y by X and place CHD (y/n) in the Y box and age in the X box.

The resulting output is shown below. Because the response is a dichotomous categorical variable logistic regression is performed.

Example:

Interpretation of Model Parameters

P(CHD=1|Age)

Odds for Success

Thus

Suppose we contrast individuals who are Age = x to those who are Age = x + c. What can we say about the increased risk associated with a c year increase in age? The logistic model gives us a means to do this through the odds ratio (OR).

Exponentiating both sides gives

Thus the multiplicative increase (or decrease if) in odds associated with a c year increase in age is.

Example: Interpreting a c year increase in age.

Question: Is it reasonable to assume that a c unit increase in a continuous predictor is constant regardless of starting point? For example, does the risk associated with a 5 year increase in age remain constant throughout one’s life?

Statistical Inference for the Logistic Regression Model

Given estimates for the model parameters and their estimated standard errors what types of statistical inferences can be made? One approach is to use the normal-theory based methods outlined below.

Hypothesis Testing

For testing:

Large sample test for significance of “slope” parameter

Confidence Intervals for Parameters and Corresponding OR’s

For dichotomous categorical predictors (i.e. 0/1 predictors)
CI for

CI for OR Associated with

If corresponds to a continuous predictor and we wish to examine the OR associated with a unit increase the CI for the OR becomes

Often times categorical predictors have more than two levels and we will see to handle that case later in the notes.

Example:

What is the OR for CHD associated with a 10 year increase in age? Give a 95% confidence interval based on this estimate.

Some Mathematical Details: Estimation of the Model Parameters

In a linear regression analysis, the regression coefficients are estimated based on the least squares method. That is, the estimates are obtained by minimizing the sum of the squared residuals. In a logistic regression analysis, the model parameters are estimated through a process called the maximum likelihood method. The basic principle of maximum likelihood is to choose as estimates those parameter values which, if true, would maximize the probability of observing what we have actually observed. This involves:

1.  Finding an expression (i.e., the likelihood function) for the probability of the data as a function of the unknown parameters.

For the logistic model, the binary response variable is assumed to follow a binomial distribution with a single trial (n=1) and probability of “success” equal to θ(xi). Therefore, for the ith observed pair , the contribution to the likelihood is

where and

Then, since we assume independence across observations, the likelihood function is given by

2.  Finding the values of the unknown parameters which make the value of this expression as large as possible.

For computational purposes it is usually easier to maximize the logarithm of the likelihood function rather than the likelihood function itself. This works because the logarithm is a monotonic increasing function; therefore, the maximizing parameters are the same for the likelihood and log-likelihood functions. The log-likelihood function is given by

To find the parameter estimates, we solve simultaneously the equations given by setting the partial derivatives with respect to each parameter equal to 0:

Several different nonlinear optimization routines are used to find solutions to such systems. This process gets increasingly computationally intensive as the number of terms in the model increases.

Example: Estimating Model Parameters with a Single Dichotomous Predictor


CHD and Indicator of Age Over 55

Computed using standard approach

Logistic Model

There are two different ways to code dichotomous variables (0,1) coding or (-1,+1), i.e. contrast) coding. JMP uses contrast coding by default whereas other packages will generally use the (0,1) coding as default. The two coding types are shown below.

Age 55+ = Age 55+ =

For the purposes of discussion we will consider the (0,1) coding.

Recall

where x = Age 55+ indicator we have the following.

Age ≥ 55 (x = 1) / Age < 55 (x = 0)
CHD = 1
CHD = 0 / 1- / 1-

Estimating the model parameters “by hand”

OR =

EXAMPLE: Using the data in the file CHD.sas we will create a dummy variable indicating whether the subject is over age 55 or not. Then instead of examining the relationship between the CONTINUOUS variable age and the presence or absence of evidence of coronary heart disease (CHD), we could consider the dichotomous predictor:

data CHD;

input ageGrp$ age CHD;

if (age ge 55) then Over55=1; else Over55=0;

datalines;

1 20 0

. . .

. . .

proc sort data=CHD;

by descending CHD descending Over55;

run;

proc freq order=data;

tables CHD*Over55 / all;

run;

Standard OR output from SAS:

Using PROC LOGISTIC to fit the model:

proc logistic data=CHD descending;

model CHD = Over55 / link=logit;

output out=probs predicted=predicted_probabilities;

run;

Questions:

1.  Use the model parameters to predict the probability of having CHD for a person who is 55 or over and for a person who is younger than 55.

2.  Given only the estimates of the model parameters, could you find the odds ratio for having CHD associated with being 55 or over?

Verify these values from the SAS output:

proc print data=probs;

run;

Analysis in JMP (CHD55.JMP)

To fit a logistic regression model is best to use the Analyze > Fit Model option.

We place CHD (1 = Yes, 2 = No) in the Y box and Age > 55 (1 = Yes, 2 = No) in the model effects box. The key is to have “Yes” for risk and disease alpha-numerically before “No”, thus the use of 1 for “Yes” and 2 for “No”.

The summary of the fitted logistic model is shown below. Notice that the parameter estimates are the not the same as those obtained from SAS. This is because JMP uses contrast coding for the Age > 55 predictor (+1 = Age 55 and -1 = Age < 55).

OR’s and Fitted Probabilities

Using JMP to Compute OR’s, CI’s, and Fitted Probabilities

Because we have the disease and the risk factor are alpha-numerically ordered the OR’s are correct as given.

By selecting Save Probability Formula we can save the fitted probabilities to the spreadsheet.

Example: CHD and Age Over 55 in R

> Over55

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[53] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Levels: 0 1

> chd

[1] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 1

[53] 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1

Levels: 0 1

> table(chd,Over55)

Over55

chd 0 1

0 51 6

1 22 21

> chd55 = glm(chd~Over55,family=”binomial”)

> summary(chd55)

Call:

glm(formula = chd ~ Over55, family = "binomial")

Deviance Residuals:

Min 1Q Median 3Q Max

-1.734 -0.847 -0.847 0.709 1.549

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.8408 0.2551 -3.296 0.00098 ***

Over55 2.0935 0.5285 3.961 7.46e-05 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 136.66 on 99 degrees of freedom

Residual deviance: 117.96 on 98 degrees of freedom

AIC= 121.96

Number of Fisher Scoring iterations: 4

How do we measure discrepancy between observed and fitted values?


In OLS regression with a continuous response we used

=

In logistic regression modeling we can use the deviance (typically denoted D or G2) which is defined as

D =2 ln

Because the likelihood function of the saturated model is equal to 1 when the response (yi) is 0 or 1, the deviance reduces to:


D = -2 ln(likelihood of the fitted model)
= -2 i=1n[yilnθxi+1-yiln⁡(1-θxi)]

The deviance can be used to compare two potential models where one model is nested within the other by using the “General Chi-Square Test” for comparing rival logistic regression models. We will see more applications of this in more detail when we discuss multiple logistic regression and model development, however we will demonstrate this process below when considering a single predictor x.

The general nested model concept:

General Chi-Square Test

Consider the comparing two rival models where the alternative hypothesis model

General Chi-Square Statistic

= (residual deviance of reduced model) – (residual deviance of full model)

=

If the full model is needed is BIG and the associated p-value = is small.

Example: CHD and Age ~ a single numeric predictor

From JMP

From R

> summary(chd.glm)

Call:

glm(formula = chd ~ Age, family = "binomial")

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9718 -0.8456 -0.4576 0.8253 2.2859