SW 983 - LOGISTIC REGRESSION

Context

Logistic regression (LR) is used to predict whether or not an event will occur and to identify the variables useful in making the prediction. Such events of interest to social work include exits or discharge from some treatment or intervention (e.g. reunification from foster care, hospital discharge, high school graduation) and recidivism of various types (e.g. return to foster care, hospital re-admitance, completion of GED).

The situation here is directly analogous to our previous use of OLS multiple regression (MR) for prediction. The need for a new statistical technique arises from the fact that the dependent variable here is categorical not continuous. When the dependent variable can have only two values, the assumptions necessary for hypothesis testing in regression analysis are necessarily violated. For example, it is unreasonable to assume that the distribution of errors is normal. Another difficulty with multiple regression analysis is that predicted values cannot be interpreted as probabilities. They are not constrained to fall in the interval between 0 and 1.

Logistic regression is used in situations similar to those where discriminant analysis could be used. Logistic is favored for two reasons. First, logistic provides more interpretable information about the relative contribution of predictor variables. Secondly, logistic requires fewer assumptions about underlying distributions of independent variables and is less sensitive to extremely skewed distributions of the dependent variable. When the dependent variable is skewed beyond an 80-20 split, logistic parameters provide consistently better estimates.

Analysis and Interpretation

Commands for executing logistic analysis follow the same logic as MR. The output is also similar, however, logistic regression also provides a classification table of the same type as that found in DA output.

In logistic analysis the predicted values (Y’) take on values between 0 and 1. These predicted values are interpreted as the probability of being in the category designated by the value “1” in the dummy dependent variable.

Output from LR includes an overall test of the significance or “fit” of the regression model. The test statistic is the chi-square statistic.

Coefficients for each of the predictor (independent) variables are also provided along with tests of each of their statistical significance.

These coefficients can also be used to calculate an “odds ratio” which has a convenient interpretation as the change in the odds (i.e., the odds ratio) associated with a unit change in X1, controlling for all other variables in the model.

Let,P = Prob (event) = probability of the event happening

Odds = Prob (event) / Prob (no event)

eb = Exp (B) = Odds Ratio = Odds2 /Odds1 = Ratio of one odds to another, where

Odds2 = Odds after a unit change in the predictor variable

Odds1 = Odds before the change in the predictor variable

Consider some numerical examples:

P / Odds / Odds Ratio (eb)
Rowk /Row k-1 / Odds Ratio (eb)
Rowk-1 /Row k
.10 / 1/9 = .11 / .11/.25 = .44
.20 / 2/8 = .25 / .25/.11 = 2.27 / .25/.67 = .37
.40 / 4/6 = .67 / .67/.25 = 2.68 / .67/4 = .17
.80 / 8/2 = 4 / 4/.67 = 5.97

The odds ratio is a measure of association, but, unlike other measures of association, “1.0” means that there is no relationship between the variables. The size of any relationship is measured by the difference (in either direction) from 1.0. An odds ratio less than 1.0 indicates an inverse or negative relation; an odds ratio greater than 1.0 indicates a direct or positive relation.

Values for odds ratios greater than or equal to 2.5 or 3.0 are generally taken to represent the lower limits of a strong association [1]. Rosenthal (1996) has suggested the following qualitative descriptors of effect size for the odds ratio: about 1.5 (or the inverse value .67) = small effect (or weak association); about 2.5 (or .40) = medium (or moderate); about 4 (or .25) = large (or strong); about 10 (or .10) = very large (or very strong).[2]