Logistic Regression and Logistic Analysis
(Alan Pickering, 2nd December 2003)
- To understand these notes you should first have understood the material in the notes on “Associations in two-way categorical data” (henceforth referred to as the A2WCD notes), which are available electronically in the usual places.
MORE GENERAL PROCEDURES
FOR ANALYSING CATEGORICAL DATA
There are a number of techniques that are more general than the Pearson chi-squared test (2), likelihood ratio chi-squared test (G2), and odds-ratio (OR) methods reviewed in the A2WCD notes. In particular, these general methods can be used to analyse contingency tables with more than 2 variables. Next we consider the range of these procedures available in SPSS.
Choosing A Procedure In SPSS
There are lots of different ways to analyse contingency tables and categorical dependent variables within SPSS. Each of the following procedures can sometimes be used:
Analyze > Regression > Binary Logistic
Analyze > Regression > Multinomial Logistic
Analyze > Loglinear > General
Analyze > Loglinear > Logit
Analyze > Loglinear > Model Selection
Each procedure works best for a particular type of statistical question, although the procedures can often be “forced” to carry out analyses for which they were not specifically designed. The outputs of each procedure look quite different, even though many of the results, buried within the printout, will be identical. However, some of the output information is unique to each procedure.
My general advice is that the Multinomial Logistic Regression procedure is by far the most user-friendly and will deal with the most common data analyses of this type that we wish to carry out.
The SPSS Help menu gives advice on choosing between procedures (select HelpTopics and then select the Index tab). Unfortunately, SPSS uses slightly different names for the procedures in its Help menu (and on printed output) than those that appear as the options in the Analyze menu. The following table will clarify the names that SPSS uses:-
Procedure Name in AnalyzeMenu / Procedure Name used in HelpMenuand in Output
Regression > Binary Logistic / Help / Logistic Regression
Output / Logistic Regression
Regression > Multinomial Logistic / Help / Multinomial Logistic Regression
Output / Nominal Regression
Loglinear > General / Help / General Loglinear Analysis
Output / General Loglinear (Analysis)
Loglinear > Logit / Help / Logit Loglinear Analysis
Output / General Loglinear (Analysis)
Loglinear > Model Selection / Help / Model Selection Loglinear Analysis
Output / HiLog
Hierarchical Log Linear
Table 1. The varying names used by SPSS to describe its categorical data analysis procedures
Finally, note that the SPSS printed output, for the various types of categorical data analysis, is fairly confusing because it contains a lot of technical detail and jargon. That is why it is important to have a clear understanding of some of the basic issues covered below.
Three Types of Analysis
- Logistic (or Logit) Regression: describes a general procedure in which one attempts to predict a categorical dependent variable (DV) from a group of predictors (IVs). These predictors can be categorical or numerical variables (the latter are referred to in SPSS as covariates). The DV can have two or more levels (binary or multinomial, respectively). This analysis can be thought of as analogous to (multiple) linear regression, but with categorical DVs. It is most easily carried out in SPSS using the following procedures:
Analyze > Regression > Binary Logistic
Analyze > Regression > Multinomial Logistic
- Logistic (or Logit) Analysis: describes a special case of logistic regression in which all the predictor variables are categorical, and these analyses often include interaction terms formed from the predictor variables. This analysis can be thought of as analogous to ANOVA, but with categorical DVs. It is most easily carried out in SPSS using the following procedures:
Analyze > Regression > Binary Logistic
Analyze > Regression > Multinomial Logistic
Analyze > Loglinear > Logit
- (Hierarchical) Loglinear Modelling, Loglinear Analysis, or Multiway Frequency Table Analysis: describes a procedure in which there is no separation into DVs and predictors, and one is concerned with the interrelationships between all the categorical variables in the table. It is most easily carried out in SPSS using the following procedures:
Analyze > Loglinear > General
Analyze > Loglinear > Model Selection
Example of Logistic Analysis Using SPSS
Logistic analysis may be most straightforward place to start looking at the more general contingency table analysis methods available in SPSS. This is because: (a) logistic analysis resembles ANOVA (which is familiar to psychologists); (b) categorical data from psychological experiments are probably most often in a form requiring logistic analysis (i.e., there is a DV and one or more categorical IVs); and (c) the relevant SPSS procedures are probably easier to execute and interpret than the other types of contingency table analysis methods. Before analysing multiway tables, we start with an example of a two-way analysis.
Logistic Analysis Example: small parks data
The data concerns subjects with Parkinson’s disease (PD), and the dataset contains disease status (PDstatus: 1= has disease; 2= no disease) and smoking history (Smokehis: 1=is or was a smoker; 2=never smoked). Table 2 is a key contingency table:
PDstatus
yes (=1) / no (=2) / RowTotals
Smokehis
/ yes (=1) / 3 / 11 / 14no (=2) / 6 / 2 / 8
Column
Totals / 9 / 13 / Grand
Total=22
Table 2. Observed frequency counts of current Parkinson’s disease status by smoking history.
The data were analysed using the Analyze > Regression > Multinomial Logisticprocedure. The presence or absence of PD (PDstatus) was selected into the “Dependent variable” box and cigarette smoking history (Smokehis) was selected into the “Factor(s)” box. The Statistics button was selected and in the resulting subwindow, the “Likelihood ratio test” and “Parameter estimates” options were checked. The key resulting printed output was as follows:-
Understanding the Printout
(The key jargon from the printout is highlighted in bold below.) In this analysis, there is only one effect of interest: the effect of smoking history Smokehis on PDstatus. As a result, the first two output tables (“Model Fitting Information”; “Likelihood Ratio Tests”) are completely redundant. Later, when we look at a 3-way example, these two tables provide different information. The final model is a model containing all the possible effects. As will be explained in more detail later, this model has two free parameters (parameters are just the independent components in the mathematical formula for the model). The final model proposes that the probability of PD is different in each of the two samples with differing smoking histories (i.e., differing values of Smokehis). Therefore, the final model needs 2 parameters: effectively these parameters correspond to the probability of having PD in each of the 2 samples. As there are only 2 outcomes (has PD vs. doesn’t have PD), we do not need to specify the probability of not having PD, because that is simply 1 minus the probability of having PD.
The final model has a likelihood which we can denote with the symbol Lfinal. This likelihood is just the probability that exactly the observed data would have been obtained if the final model were true. These analyses use the natural logarithms (loge) of various values. Those who are not very familiar with logarithms should review the quick tutorial on logarithms that was given in the A2WCD notes. The value of -2*loge(Lfinal) is 5.087 (see “Model Fitting Information” in the above printout). The likelihood of getting exactly the data in Table 1, if the final model were true, is therefore given by e5.087/2 (=0.08). Although this probability may seem low, the final model is the best possible model one could specify for these data. Later in these notes we consider how these likelihoods are calculated.
The analysis also produces a reduced model that is simpler than the final model. The reduced model is called an intercept only model in the Model Fitting Information output table. Because the reduced model is formed from the final model by removing the effect of Smokehis on PDstatus, the reduced model appears in the row labelled SMOKEHIS in the table of Likelihood Ratio Tests. (The row labelled Intercept in this table should be ignored.) The reduced model has only a single parameter because it proposes only one rate of PD occurrence (i.e., the rate is the same for both samples differing in smoking history under this model).
The reduced model has a likelihood that can be represented Lreduced. From the printout shown above, the value of 2*loge(Lreduced) is 11.308; which corresponds to a likelihood of e-11.308/2 (=0.0035). Given the data, this model is more unlikely than the final model. This will be the case for any reduced model (because reduced models have fewer parameters than more complete models). In general, scientific modelling attempts to find the simplest model (i.e., the one with fewest parameters) that provides an adequate fit to the data. The key decision in this kind of analysis is therefore whether the lower likelihood of the reduced model is a statistically acceptable “trade” for the reduced number of parameters involved in the reduced model. The likelihood ratio test compares the likelihoods of the two models in order to make this decision.
If the reduced model were true then it turns out that a function of the ratio of the log likelihoods (specifically –2*log[Lreduced/Lfinal]) would have a distribution that is approximated by the 2 distribution, with degrees of freedom given by the difference in the number of free parameters between the final and reduced models (21=1 in this case). From the properties of logs (see the A2WCD notes) we know that the log likelihood ratio statistic given above is equivalent to:
2*loge(Lfinal) –2*loge(Lreduced).
The value of the statistic in the present example is (-5.087 - - 11.308)=6.222. Note that this value appears under the “Chi-Square” column heading in the table of Likelihood Ratio Tests. The above notes (and the footnote on the SPSS output) should have made it clear why it is called a likelihood ratio test statistic and why it is tested against the chi-squared distribution. The value obtained (6.222) is considerably greater than the critical value for 2 with df=1 (for p=0.05, this is just under 4) and so the reduced model can be rejected in favour of the final model.
This result means that Smokehis and PDstatus are not independent in this dataset: i.e., there is a significant effect of Smokehis on PDstatus (p=0.013).
We can compare the value of the likelihood ratio test statistic in this analysis with the value for the likelihood ratio statistic (G2) for two-way tables, which we can obtain using the SPSS Crosstabsprocedure (see A2WCD notes). The value is identical. The G2 statistic is a special case of logistic analysis when there are only 2 variables in the contingency table.
We should also note that the Parameter Estimates output table shows estimates for the two parameters of he final model (B values). Note that the B parameter with a value of -2.398 represents the natural logarithm of the odds-ratio {loge(OR)}. Once, again we can calculate this for a 2x2 table using Crosstabs (see A2WCD notes). The value Exp(B) “undoes” the effect of taking the logarithm, because the exponentiation function {Exp()} is the inverse operation to taking a logarithm (in much the same way as dividing is the inverse of multiplication). Exp(B) therefore gives us the odds ratio itself (OR) and its associated 95% confidence interval (CI; also available from Crosstabs). Later on in these notes, we will discuss how these parameter estimates are constructed, and thus how they can be interpreted. When we move on to tables bigger than 2x2, the B parameter values shown in the output will each be different log(OR) values calculated from 4 cells within the larger table.
PART II – UNDERSTANDING THE STATISTICAL MODELLING TECHNIQUESUSED IN LOGISTIC REGRESSION
Sample Estimates and Population Probabilities
This section is quite simple conceptually (and the technical bits are in boxes). If you grasp what’s going on, even if only roughly, then it will really help you to understand: (a) the process of executing logistic regression; (b) what to look at in the printed output; and (c) the jargon used in the printout.
Imagine we tested a random sample of 100 female subjects in their twenties in a dart-throwing experiment. We got each subject to take one throw at the board. We scored the data very simply: did the subject hit the scoring portion of the board? This generated a categorical variable hitboard with values: 1=yes; 2=no.
hitboard1=yes / 2=no / Total
60 / 40 / 100
Table 3. The summary data for the hitboard variable in the dart-throwing data
So the overall probability of hitting the dartboard was 0.6 (60/100; let’s call that probabilityq). The measured value of 0.6, from this particular measurement sample, might be used to estimate a particular population probability that we are interested in (e.g., the probability with which women in their twenties would hit a dartboard given a single throw; this population probability will be denoted by the letter p). We might ask what is the most likely value of the population probability given the sample value (q=0.6) that we obtained. For any hypothetical value of p, we can easily calculate the likelihood of getting exactly 60 hits from 100 women using probability theory. The underlying theory and mechanics of the calculation are described in the box on the following page.
Using Probability Theory To Derive Likelihoods
Take tossing a coin as an easy example, which involves all the processes we are interested in. What is the likelihood of getting exactly 2 heads in 3 tosses of a completely fair coin? This is a binomial problem as there are just two outcomes for each trial (Heads, H; Tails, T). We can count the answer. There are 8 (i.e. 23) possible sequences of 3 tosses which are all equally likely:TTT; TTH; THT; HTT; HHT; HTH; THH; HHH
Only 3 of the sequences have exactly 2 Heads (THH; HTH; HHT), so the likelihood is 3/8 (=0.375). It is important that the outcome on each toss is independent of the outcome on every other toss. Independence therefore means, for example, that tossing a H on one trial does not change the chance of getting a H on the next trial. In this way the 8 possible sequences shown above are equally likely. This binomial problem, for 2 possible outcomes, can be described generally as trying to find the likelihood, L, of getting exactly m occurrences of outcome 1 in a total of N independent trials, when the probability of outcome 1 on a single trial is p). For our example above: N=3; m=2; p=0.5. (The value of p is 0.5 because it is a fair coin.) The general formula is:
L = NCm * pm* (1 – p)(N-m)
Where NCm is the number of ways (combinations) that m outcomes of a particular type can be arranged in a series of N outcomes (answer=3 in our example). NCm is itself given by the following formula:-
NCm = N!/(m!*[N-m]!)
Where the symbol ! means the factorial function. X! = X*(X-1)*(X-2)…*2*1. Thus, for example, 3!= 3*2*1 . Check that the above formulae generate 0.375 as the answer to our coin-tossing problem. The multinomial formulae are an extension of the above to deal with cases where there are more than 2 types of outcome.
Applying Probability Theory to the Dart-Throwing Data
For our darts data, the values to plug into the formula thus: there are 100 trials (i.e., N=100); and we are interested in the case where we obtained exactly 60 “hitboard=yes” outcomes (i.e., m=60). We can allow the value of p to vary in small steps from 0.05 to 0.95 and calculate a likelihood for each value of p. Putting these values into the formulae, we get the likelihoods that are shown in the following graph:
It is fairly clear from the graph that the likelihood of getting the result q=0.6 is at a maximum for the value p=0.6. In fact, this might seem intuitively obvious: if the true value of p were, say, 0.5 then to get a sample estimate of 0.6 must mean that the random sample used was slightly better than expected. This seems less likely to occur than getting a sample which performs exactly as expected. A coin-tossing example may help: I think many people would intuitively “know” that the likelihood of getting 10 heads in 20 tosses of a fair coin is greater than the likelihood of getting 8 (or 12) heads (and much greater than the likelihood of getting 2 or 18 heads). This means that our sample value (q=0.6) is the best estimate for p we can make, given the data.
Another, possibly surprising, point that one might notice from the graph is that the likelihoods are all quite low. Even the maximum likelihood (for p=0.6) is only around 0.08. Even if the population probability was really 0.6, we would get sample values which differ from this value 92% of the time. The low values are because here we are talking about the likelihood of p being exactly 0.6 and, in psychology, we are more used to giving ranges of values. For example, we might more usefully give the 95% confidence intervals (CIs) around our sample estimate of q=0.6. These CIs give a range of values which, with 95% confidence, would be expected to contain the true value of p. (How to calculate such CIs is not discussed here.)
Maximum Likelihood Estimation
In general, if one has frequency data of this kind, and an underlying hypothesis (or model) that can be expressed in terms of particular probabilities, then one can create a computer program to estimate the values of those probabilities which are associated with the maximum likelihood of leading to the data values obtained. This is the process called maximum likelihood estimation. In the darts example, we would therefore say that 0.6 is the maximum likelihood estimate (MLE) of the underlying population probability parameter (p), given the data obtained. It can also be said that the value p=0.6 provides the best fit to the data obtained in the experiment.For the simple dart-throwing example it was possible to work out the MLE for p by logic/intuition. For a more complex model, with several probabilities, numerical estimation by computer is often the only way to derive MLEs. Statistical packages, such as SPSS, use numerical methods to generate MLEs in several different kinds of analyses, including those involved in logistic regression.
Comparing Likelihoods