SPS 580 Lecture 7 Data Mining Dummy Variables notes
I. THE LINEARITY ASSUMPTION
- It’s called multiple LINEAR regression because Y is assumed to be a linear function of the each X variable Y = a + B1(X1) + B2(X2) +B3(X3) . . .
- We like linear models because they are an intuitive way to talk about the effects of each variable (intervening, control), difference between the zero order and the partial
- Violations of linear assumption: Curvilinearity
- look at it with the zero order relationship.,
- if the relationship is curvilinear, then the linear slope doesn’t do as good a job at predicting Y as some other alternatives.
- In most cases the linear model is pretty accurate predictor – not usually the end of the world.
- We’re about to learn a way to deal with the situation when there is a curvilinear relationship.
- Violations of linear assumption: Interactions
- look at by examining the conditional slopes in the three-variable graph.
- If there is an interaction then the effect of an X variable on Y is not linear – because the magnitude of the slope DEPENDS on a third variable. in most cases interactions are not significant.
- But when they are it IS the end of the world. You have to do the analysis separately for the groups involved in the interaction or incorporate an interaction term in the linear regression model.
- We’ll learn about how to deal with them in a couple of weeks.
II. SUPPRESSOR EFFECT
- Not a violation of linearity, rather an unusual outcome of causal analysis
ß Three variable path diagram
- Happens when the SIGN of the indirect path B2 * B3 is opposite from the SIGN of the direct path B1.
- If this happens then B1 > ZERO ORDER and you get an estimated of suppression effect rather than explanation.
III. WHY IS CAUSAL ANALYSIS IMPORTANT?
è Intervening variables often show points of policy input . . . Let’s say you knew higher income people were moving out of a neighborhood. And that they would often explain their reasons for doing so in terms of neighborhood pessimism.
è You want to reduce neighborhood turnover.
Income Neighborhood Pessimism Move out
???
è You can’t do much about income, but you might be able to find things that cause pessimism that you can affect.
IV. HOW TO FIND GOOD INTERVENING VARIABLES
X1 à Y
ß pretend data, not PQ
How do you find variables that when you put them inside the causal chain X1 àY, the partial is less than the zero order?
- reflect on own experience or talk to people
- literature – an article or a report
- data mining – there will be certain statistical relationships between X1, X2 and Y
V. DATA MINING . . . For an intervening to explain part/all of the X1 à Y relationship, two conditions have to be met . . .
CONDITION 1: The explanatory variable X2 has a significant impact on Y
X2 à Y is significant
ß pretend data , not PQ
Fear is a cause of pessimism
è This is the intervening CAUSAL process, it comes from psychological theory, literature, observational studies, it is a reflection of social process ß this is the reason you like stx
CONDITION 2: Groups that differ on the independent variable X1 differ on the explanatory variable X2
X1 à X2
ß pretend data, not PQ
Income groups differ on Fear
è In order for fear to be a reason income causes pessimism, higher income people have to be less fearful than lower income low income
è In order for X2 to explain X1 à Y, X1 has to be a cause of X2
VI. OUTCOME OF SUCCESSFUL DATA MINING
A. Mechanically, when you control for X2 the partial is lower than the zero order
X1 à Y controlling X2
ß pretend, not PQ
ß in this case Partial = 0
B. Intuitively, in X1àY relationship you think you’re looking at groups that differ on X1
ß But actually we’re looking at groups that differ on X1 and also X2
ß So you need to control for X2 to see the impact of X1 alone (Partial)
VII. SO HOW DO YOU DATA MINE FOR (OTHER) INTERVENING VARIABLES
A. Get a list of candidate intervening variables from the same survey years . . .
- Read a book in the past month -- readers less pessimistic
- Frequency of using the local park in the past month – park users less pessimistic
- Employment status -- unemployed pessimistic
(CODING EMPSTAT) ß
B. Recode the candidate variables, look at the xtabs to see if two conditions are met
1. First check X2 à Y to see if the explanatory variable actually causes pessimism
ß doesn’t make the cut
ß makes the cut weakly
ß makes the cut
2. Then check X1 à X2 to see if income groups actually differ on it
ß Park use might be OK
ß Unemp fails
ß Working v. Other seems important
Weak Sweet Fails
BOTTOM LINE: Go with LF status coded (1 = working, 0 = other)
ß Xtab results -- since y= dichot(0,1)
Pessimism = .40 - .166 (Income) +.042 (Labor Force Status) ß slope for LF status is significant
ß but the impact of the control variable isn’t very great
Worse yet . . . there might be an interaction effect
VIII. DEALING WITH CURVILINEARITY
A. Start by looking at how to deal with Ordinal (3+) variables
ß Education is a very important variable for a lot of public policy analysis
ß It’s not really usable as an interval variable, not across the full range, and not in the US context
ß But you don’t want to lose the gradient, usually best to treat it as a ordinal variable
B. Recode the variable into (k) ordinal categories . . . as shown above
ß do a xtab or table of means -- depending on whether Y is dichotomous(0,1) or interval(3+)
ß Look at the pattern in the data
ß line goes down, higher education à lower pessimism
ß not much diff between some college vs. HSG/trade
C. Don’t think of the pattern in the data as a line
Think of the pattern in the data as (k-1) separate CONTRASTS . . .
[WARNING DATA ANALYSIS METHOD AHEAD]
D. Think of each of the (k-1) contrasts as something that is measured with a (0,1) dichotomous variable.
à (0,1) dichotomies created this way are knows as DUMMY VARIABLES
ß With (k) categories of education, we need (k-1) dummy variables to estimate the available contrasts
The “left out” category is called the reference category (in this case 0-11 yrs of education)
A dummy var measures the difference between the contrast category and the reference category
E. Creating (K-1) Dummy Variables To Analyze The Impact Of An Ordinal Variable
RECODE education (0=0) (1=1) (2=0) (3=0) (ELSE=9) INTO educHSG.
VARIABLE LABELS educHSG 'dummy var HSG vs 0-11'.
RECODE education (0=0) (3=0) (1=0) (2=1) (ELSE=9) INTO educANYCOLL.
VARIABLE LABELS educANYCOLL 'dummy any coll vs 0-11'.
RECODE education (0=0) (1=0) (3=1) (2=0) (ELSE=9) INTO educCOLLGRAD.
VARIABLE LABELS educCOLLGRAD 'dummy coll grad vs 0-11'.
MISSING VALUES educHSG educANYCOLL educCOLLGRAD (9).
The result will be (k-1) variables, each of which codes the ENTIRE SAMPLE . . .
Regression works the as before, except that instead of having one education variable there are now 3 dummy variables measuring the effects of education. Whenever you estimate the effect of education, put all (k-1) dummy vars in the regression equation together
F. For the ZERO ORDER, there are now (k-1) slopes, t-tests
ß All 3 are significant
ß D1 and D2 are pretty similar to each other
G. To test Education as a control variable, enter ALL (k-1) dummy variables together in the multiple regression equation along with income . . .
ß Income effect is reduced substantially
ß education makes a pretty big difference as an explanatory variable
H. The prediction equation works the same way too
Regression equation . . .
Predicted avg(Y) = .513 - .130*(Income)-.077 *(educHSG) -.093 *(educANYCOLL) -.200*(educCOLLGRAD)
ß predicted values
IX. DUMMY VARIABLES ARE THE MAIN TECHNIQUE FOR DEALING WITH CURVILINEARITY
A. Example: Client = WBEZ want to target fundraising
Commission research to explore extent to which Education à Listen to public radio and reasons why this might be the case
B. ZERO ORDER RESULTS
ß The listenership variable is nominal (4 cat), so to proceed with causal analysis, I’m going to recode it into a dichotomy
ß Curvilinear relationship . . .
ß . . . is significant Chi sq(3) = 135 p < .04 phi =.204
ß Examine Contrasts
Minimal difference HSG/Trade vs. 0-11
Small difference Some College vs. 0-11
Large difference College Grad vs. 0-11
One of the DUMMIES is not significant
Conclusion . . .
C. INTERVENING VARIABLE:
Education à Politically Independent à Listen to Public radio
Theory ...... X1 à X2 à Y
ß recode X2 to Independent vs. other
REGRESSION ANALYSIS
ß Education effect is still curvilinear
ß Independence isn’t significant
X. HOW TO SUMMARIZE THE ZERO ORDER AND PARTIAL EFFECTS OF AN ORDINAL/NOMINAL VARIABLE MEASURED WITH DUMMY VARIABLES
ß Independence doesn’t explain much of the relationship between education and listenership
5