Analyses Involving Categorical Dependent Variables
When Dependent Variables are Categorical
Examples:Dependent variable is simply Failure vs. Success
Dependent variable is Lived vs. Died
Dependent variable is Passed vs Failed
You ignored everything I’ve said about not categorizing and dichotomized a DV.
Chi-square analysis is frequently used.
Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?
Dependent variable is Death: No (0) vs. Yes (1).Independent variable is Helmet – No (0) vs Yes (1).
Crosstabs
So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.
Comments on Chi-square analyses
What’s good?
1. The analysis is appropriate. It hasn’t been supplanted by something else.
2. The results are usually easy to communicate, especially to lay audiences.
3. A DV with a few more than 2 categories can be easily analyzed.
4. An IV with only a few more than 2 categories can be easily analyzed.
What’s bad?
1. Incorporating more than one independent variable is awkward, requiring multiple tables.
2. Certain tests, such as tests of interactions, can’t be performed easily when you have more than one IV.
3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs, which goes against recommendations to NOT categorize continuous variables because you lose power.
Alternatives to the Chi-square test.We’ll focus on Dichotomous (two-valued) DVs.
1. Linear Regression techniques
a. Multiple Linear Regression. Stick your head in the sand and pretend that your DV is continuous and regress the (dichotomous) DV onto the collection of IVs.
b. Discriminant Analysis (equivalent to MR when DV is dichotomous)
Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.
1. Assumption is that underlying relationship between Y and X is linear.
But when Y has only two values, how can that be?
2. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.
3. Residuals will probably not be normally distributed.
4. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction resulting in Y-hats that are impossible values.
2. Logistic Regression
3. Probit analysis
Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.
The Logistic Regression Equation
Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.
Conceptualizing Y-hat. When Y is GPA, for example, actual GPAs and predicted GPAs are just like each other. They can even be identical. However, when Y is a dichotomy, it can take on only 2 values – 0 and 1, although predicted Ys can be any value. So how do we reconcile that discrepancy?
When you have a two-valued DV it is convenient to think of Y-hatas the likelihood or probability that one of the values will occur. We’ll use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1.
The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So we’re conceptualizing Y-hat as the probability that Y is 1.
The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression)
(B0 + B1*X)
1 e
Y-hat = P(Y=1) = ------= ------
-(B0 + B1*X) (B0 + B1*X)
1 + e 1 + e
The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1 as X ranges from -∞ to +∞. P(Y=1) is never negative and never larger than 1.
The curve of the equation . . .
B0: B0 is analogous to the linear regression “constant” , i.e., intercept parameter. Although B0 defines the "height" of the curve at a given x, it should be noted that the most vertical part of the curve moves to the right as B0 decreases. For the graphs below, B1=1 and X ranged from -5 to +5.
For equations for which B1 is the same, changing B0 only changes the location of the curve over the range of X-axis values.
The “slope” of the curve remains the same.
B1: B1 is analogous to the slope of the linear regression line. B1 defines the “steepness” of the curve. It is sometimes called a discrimination parameter.
The larger the value of B1, the “steeper” the curve, the more quickly it goes from 0 to 1. B0=0 for the graph.
Note that there is a MAJOR difference between the linear regression curves we’re familiar with and logistic regression curves - - -
The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.
But the linear regression lines extend below 0 on the left and above 1 on the right – the predicted Ys range from -∞ to +∞.
If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.
Why we must fit ogival-shaped curves – the curse of categorization
Here’s a perfectly nice linear relationship between score values, from a recent study.
This relationship is of ACT Comp scores to Wonderlic scores. It shows that as intelligence gets higher, ACT scores get larger.
[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav
Here’s the relationship when ACT Comp (vertical axis) has been dichotomized at 23, into Low vs. High.
When, proportions of High scores are plotted vs. WPT value, we get the following
So, to fit the above curve relating proportions of persons with High ACT scores to WPT within successive groups, we need a model that is ogival.
This is where the logistic regression function comes into play.
This means that even if the “underlying” true values are linearly related, proportions based on the dichotomized values within successive groups will not be linearly related to the independent variable.
Crosstabs and Logistic Regression
Applied to the same 2x2 situation
The FFROSH data.
The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the 2nd semester.
The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not. Retention is good.
The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester.
After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.)
I know, I know. This is a violation of the “Do not categorize!” rule. There were technical reasons this time.
So the analysis that follows examines the relation of RETAINED to EARLIREG, the relation of retention to the 2nd semester to early registration.
The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.
First, univariate analyses . . .
GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.
Fre var=retained earlireg.
sustained
Frequency / Percent / Valid Percent / Cumulative PercentValid / .00 / 552 / 11.6 / 11.6 / 11.6
1.00 / 4201 / 88.4 / 88.4 / 100.0
Total / 4753 / 100.0 / 100.0
earlireg
Frequency / Percent / Valid Percent / Cumulative PercentValid / .00 / 2316 / 48.7 / 48.7 / 48.7
1.00 / 2437 / 51.3 / 51.3 / 100.0
Total / 4753 / 100.0 / 100.0
crosstabs tables = retained by earlireg /cells=cou col /sta=chisq.
Crosstabs
The relation is significant. Students who registered early were more likely to sustain directly into the 2nd semester than those who registered late.
The same analysis using Logistic RegressionAnalyze -> Regression -> Binary Logistic
logistic regression retained WITH earlireg.
Logistic Regression
The Logistic Regression procedure applies the logistic regression model to the data. It estimates the parameters of the logistic regression equation.
1
That equation is P(Y) = ------
-(B0 + B1X)
1 + e
The LOGISTIC REGRESSION procedure performs the estimation in two stages.
The first stage estimates only B0. So the model fit to the data in the first stage is simply
1
P(Y) = ------
-(B0)
1 + e
SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimated
The second stage estimates both B0 and B1. So, the model fit to the data in the first stage is simply
1
P(Y) = ------
-(B0 + B1*X)
1 + e
SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimated
The first stage . . .
Block 0: Beginning Block (estimating only B0)
Explanation of the above table: The program estimated B0=2.030. The resulting P(Y=1) = .8839.
The program computes Y-hat=.8839 for each case using the logistic regression formula with the estimate of B0. If Y-hat is less than or equal toa predetermined cut value of 0.500, that case is recorded as a predicted 0. If Y-hat is greater than 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s. Predicted Ys are 1 in this particular example. Sometimes this table is more useful than it was in this case. It’s typically most useful when the equation includes continuous predictors. Note that all students are predicted to sustain in this example.
The Variables in the Equation Table for Block 0.
The “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis. The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030). Only B0 is in the equation. Recall that B1 is not yet in the equation.
The test statistic in the “Variables in the Equation” table is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.
Exp(B)is theodds ratio: e2.030It is the ratio of odds of P(Y=1) when the predictor equals 1 to the odds of P(Y=1) when the predictor equals 0. It’s an indicator of strength of relationship to the predictor. It means nothing here since there is no predictor in the equation.
The Variables not in the Equation Table for Block 0.
The “Variables not in the Equation” gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be “significant” if it were added to the equation. In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .
The second stage . . .
Block 1: Method = Enter (Adding estimation of B1 to the equation)
Whew – three chi-square statistics. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.
For Chi-square in this procedure, bigger is better. A significant chi-square means that your ability to predict the dependent variable is significantly different from chance.
“Step”: Compared to previous step in a stepwise regression. In this case the previous step is the equation with just B0.
“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block in which just B0 was estimated..
“Model”: I believe this is analogous to the ANOVA F in REGRESSION – testing whether the model with all predictors fits better than a model with just B0 – an independence model.
All of these tell us that adding estimation of B1 to the equation resulted in a significant addition
The value under “-2 Log likelihood” is a measure of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.
The above table is the version of the table including Y-hats based on B0and B1.
Note that since X is a dichotomous variable here, there are only two y-hat values. They are
1
P(Y) = ------= .842 (see below)
-(B0 + B1*0)
1 + e
And
1
P(Y) = ------= .924 (see below)
-(B0 + B1*1)
1 + e
In both cases, the y-hat was greater than .5, so predicted Y in the table was 1 for all cases.
The prediction equation is Y-hat = P(Y=1) = 1 / (1 + e-(.1.670 + .830*EARLIREG).
Since EARLIREG has only two values, those students who registered early will have predicted RETAINED value of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of
1/(1+e-(1.670+.830*0) = 1/(1+e-1.670)).= .842. That difference is statistically significant.
Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.
Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is
Odds when X=1 .924/(1-.924)12.158
Odds ratio = ------= ------= ------= 2.29.
Odds when X= 0 .842/(1-.842)5.329
So a person who registered early had odds of retaining that were 2.29 times the odds of a person registering late being retained. Odds ratio of 1 means that the DV is not related to the predictor.
Graphical representation of what we’ve just found.
The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.
The bottom line here is that the LOGISTIC REGRESSION results are the same as the CROSSTABS results – those who registered early are significantly more likely to sustain than those who register late.
CrossTabs in Rcmdr -Start here on 10/2/18
R Rcmdr, then import the ffrosh for P5100 data.
Crosstabs in Rcmde requires that the variables to be crossed be factors.
First, convert the variables to factors
Data Manage variables in active data set Convert numeric variables to factors
Note that I created a new column in the Rcmdr data editor – so that I can use EARLIREG in procedures that analyze regular variables and earliregfactin procedures that require factors.
By the way: Rcmdr’s import automatically converts any variable whose values have labels into factors.
You can remove this tendency by
1) Removing value labels from the variable in the SPSS file prior to importing, or
2) Unchecking the “Convert value labels to factor levels” box in the Import SPSS Data set dialog.
Statistics Contingency Tables Two-way table . . .
Note that this procedure works only for factors.
The output.
Frequency table:
earliregfact
retainedfact 0 1
0 367 185
1 1949 2252
Pearson's Chi-squared test
data: .Table
X-squared = 78.832, df = 1, p-value < 2.2e-16
The chi-square value is the same as the Pearson chi-square in SPSS, p. 9.
Logistic Regression Analysis in Rcmdr.
Statistics Fit Models Generalized Linear Models . . .
library(foreign, pos=14)
> ffroshnm <- read.spss("G:/MDBR/FFROSH/Ffroshnm for P595.sav",
+ use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
> colnames(ffroshnm) <- tolower(colnames(ffroshnm))
> library(abind, pos=15)
> GLM.1 <- glm(retained ~ earlireg, family=binomial(logit), data=ffroshnm)
> summary(GLM.1)
Call:
glm(formula = retained ~ earlireg, family = binomial(logit),
data = ffroshnm)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2708 0.3974 0.3974 0.5874 0.5874
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.66971 0.05690 29.343 <2e-16 ***
earlireg 0.82951 0.09533 8.702 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3414.1 on 4752 degrees of freedom
Residual deviance: 3334.2 on 4751 degrees of freedom
AIC: 3338.2
Number of Fisher Scoring iterations: 5
> exp(coef(GLM.1)) # Exponentiated coefficients ("odds ratios")
(Intercept) earlireg
5.310627 2.292191
Discussion
1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.
BUT as mentioned above . . .
2. CROSSTABS cannot be used to analyze relationships in which the X variable is continuous.
3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious. No tests of interactions are possible. The analysis involves inspection and comparison of multiple tables.
4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.
5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.
6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor. But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.
Logistic Regression Example 1: One Continuous Predictor
The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase. Both Amylase and Lipase levels are tests (blood test, I believe) that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis.
The objective here is to determine
1) which alone is the better predictor of the condition and
2) to determine if both are needed.
Note that this analysis could not be done using chi-square.
Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and Lipase values were used for this handout and for some of the following handouts.