Topic 8. Nominal and Ordinal Independent Variables

Topics

·  Dummy coding for dichotomous and polychotomous nominal variables

·  Using ordinal independent variables in regression

Nominal Independent Variables

The population model: / The multiple regression equation:

·  The interpretation for the partial slope: each one-unit increase in X1 leads to a b unit increase or decrease in Y while controlling for (i.e., holding constant/net of) X2

·  Regression slopes are meaningless for untransformed nominal and ordinal variables because regression assumes that the categories of the variables are rank-ordered and also that a ‘one-unit’ increase is consistent across all categories of the variable.

·  Nominal variables – The categories of nominal variables are not and cannot be rank-ordered

o  Nominal variables are simply classification schemes – the numbers assigned to the categories do not tell us anything about the ranking of categories, they represent qualitative differences (e.g., sex and race)

o  If a nominal variable has more than two categories and is used in a regression model, the interpretation for the slope would be meaningless because you could reorder the categories any number of ways, each of these ways is legitimate, and the slope would change with each reordering of categories

o  Moreover, the regression would yield only one slope to describe all of the group differences

o  Take, for example, a four-category race variable where White=1, Black=2, Hispanic=3, and other=4.

·  Ordinal variables – The difference between categories of ordinal variables is not measureable and may not be consistent

o  Ordinal variables are also classification schemes, but they provide the additional information of rank order for the categories

o  Ordinal variables are often (but generally should not be) included in regression models because the difference between each successive pair of categories is not measurable and is not necessarily the same

§  Self-reported health: 1=poor,2= fair,3= good, 4=excellent

§  Is the difference between ‘poor’ and ‘fair’ health really the same as the distance between ‘fair’ and ‘good’ health?; both represent differences of 1 in the data (‘1’ to ‘2’ and ‘2’ to ‘3’), but the numbers are not meant to indicate anything other than order

·  There is, however, a legitimate way to include nominal and ordinal variables in a regression model – we have to take advantage of the unique properties of dummy variables (dichotomous variables coded 0 and 1)

Dummy Coding for Two-category (i.e., dichotomous) Independent Variables

*Sex.

recode sex (2=0) (1=1) (else=sysmis) into male.

freq vars=sex male.

·  You should create one dummy variable when you have a dichotomous independent variable

·  It does not matter which group you code as 0 and which group you code as 1

·  I typically name the new dummy variable after the group coded 1 (to help me remember how it is coded)

·  The bivariate regression model:

o  a is the predicted income for women – that is, an individual with a score of 0 on X

o  b is the difference in income between women and men – that is, the change in income when X changes from 0 to 1

o  You can think of X (the dummy variable ‘male’) as a switch

§  When a respondent is female, the switch is turned off because X=0 and, as a result, b drops out of the equation

·  When X=0 (female), b*X=0, so the regression equation simplifies from to

§  When a respondent is male, the switch is turned on because X=1 and, as a result, b does not drop out of the equation; instead, the slope is added the intercept

·  When X=1 (male), b*X=1, so the regression equation remains and b is added to a to find the predicted income

· 

·  For women (X=0):

·  For men (X=1):

·  The t test for the slope (b):

One-tailed:
H0: b≤0
H1: b>0
Critical t=1.645 (a=.05, one-tailed, d.f.=5,138)
Reject H0 / Two-tailed:
H0: b=0
H1: b≠0
Critical t=±1.960 (a=.05, one-tailed, d.f.=5,138)
Reject H0

·  You could also use the p value (identified as ‘Sig.’ in the coefficients table) or the confidence interval to test the hypotheses

·  Notice that you are also given a t value for the intercept; this would allow you to test the null hypothesis that the intercept is equal to zero (i.e., that the average income for women in the population is zero). This t test is not relevant for this example (but sometimes it could be), but the 95% confidence interval provides an interval estimate of the population mean income for women: $18,708.93 to $21,817.41.

The independent samples t test is equivalent to a bivariate regression with one dummy variable.

Don’t believe me? Look at the independent samples t test output for the same example below:

Dummy Coding for Independent Variables with More than Two Categories

*Race.

recode race (1=1) (2 3 4 5 6=0) (else=sysmis) into race_white.

recode race (2=1) (1 3 4 5 6=0) (else=sysmis) into race_black.

recode race (5=1) (1 2 3 4 6=0) (else=sysmis) into race_hispanic.

recode race (3 4 6=1) (1 2 5=0) (else=sysmis) into race_other.

freq vars=race race_white race_black race_hispanic race_other.

*You can also use cross-tabs to check your recodes.

CROSSTABS

/TABLES=race BY race_white race_black race_hispanic race_other

/FORMAT=AVALUE TABLES /CELLS=COUNT /COUNT ROUND CELL.

You should create one dummy variable for each category of the variable. One of these, however, must be left out of the regression equation! I create one dummy variable for each category because this gives me the flexibility to change the reference category without having to go back to create another dummy variable.

o  Incorrect:

§  This is incorrect because there is not a reference category. The intercept (a) would be equal to the predicted income for a respondent who is not White, not Black, not Hispanic, and not in the other category. Respondents, however, have to be in one of these four categories.

o  Correct:

§  One of the categories has to serve as the reference category; in this model, the ‘White’ dummy variable is excluded and serves as the point of reference to which the other groups are compared. The intercept (a) would be equal to the predicted income for a respondent who is not Black, not Hispanic, and not in the other category (i.e., a respondent who is White)

·  Theory typically dictates which group should be left out of the model to serve as the reference category; in this example, it would make sense to compare the other groups to Whites given racial income inequality in the US

·  Just as in the previous example involving gender, each dummy variable acts as a switch. In this example, they provide point estimates that indicate by what amount each group differs from Whites:

o  White respondents (X1=0, X2=0, X3=0):

o  Black respondents (X1=1, X2=0, X3=0):

o  Hispanic respondents (X1=0, X2=1, X3=0):

o  ‘Other’ respondents (X1=0, X2=0, X3=1):

· 

·  For White respondents (X1=0, X2=0, X3=0):

·  For Black respondents (X1=1, X2=0, X3=0):

·  For Hispanic respondents (X1=0, X2=1, X3=0):

·  For ‘other’ respondents (X1=0, X2=0, X3=1):

·  Hypothesis testing works the same way. These results suggest that, without controlling for other variables, the incomes of Blacks and Hispanics probably differ from Whites in the population.

·  We are NOT able to evaluate whether or not the incomes of Blacks differ from Hispanics or those of ‘other’ groups; etc. We would need to change the reference category to evaluate these hypotheses.

·  The F test is very helpful for this example because we have three variables measuring one concept (i.e., race). You can use the t tests in the coefficients table to test individual slopes. You can use the F test in the ANOVA box to test them collectively:

o  H0: bBlack=bHispanic=bOther=0; this is equivalent to testing the idea that ‘race’ is unrelated to income

o  Critical F= 2.60 (a=.05, d.f.=3, 5,132)

o  Observed F=28.475

o  We reject the null hypothesis and conclude that ‘race’ is related to income; at least one race dummy variable is significant

Analysis of variance (ANOVA) is equivalent to a regression with multiple dummy variables measuring the same concept. Don’t believe me? Look at the ANOVA output below:

So why do we need to bother with regression with dummy variables if it is equivalent to a t test or ANOVA?

Because we can easily control for other variables using multiple regression!

One criterion for establishing that there is a causal relationship between two variables is to eliminate alternative explanations – that is, we must demonstrate that the relationship is not spurious. We do this by controlling for other variables in the same model.

Let’s look at an example…

I have regressed income on age, education, work experience, sex, and race. The reference categories for sex and race are female and White, respectively. All three interval-ratio level independent variables are mean-centered.

Remember that the mean value for a dummy variable (i.e., a variable coded 0 and 1) is equal to the proportion of cases coded 1. This means that males comprise 55.3% of our sample, Black respondents comprise 17.2% of the sample, Hispanic respondents comprise 7.8% of the sample, and respondents in the other category (i.e., Asian, Native American, and other) comprise 2.1% of our sample.

Bivariate correlations are displayed on the next page. Researchers typically examine these to look for evidence of multicollinearity – that is, strong bivariate correlations (i.e., that exceed ±.7) between two or more independent variables.

We want there to be strong bivariate correlations between our dependent and independent variables. We will talk about multicollinearity soon.

You could discuss and interpret all correlations between interval-ratio level variables, but should avoid discussing and interpreting those involving any of your dummy variables (because bivariate correlations assume the interval or ratio level of measurement).

Notice that I have entered the variables is two blocks.

The first block contains the male dummy variable, education, age, and experience.

The second block adds to these three race dummy variables (White is the omitted reference category).

By entering the three race dummies in the second block, I can evaluate the null hypothesis that the slopes for all three race dummy variables are equal to zero in the population (H0: bBlack=bHispanic=bOther=0).

·  Model 1 F Change test:

H0: bMale=bEducation=bAge=bExperience=0

o  Critical F=2.37 (a=.05, d.f.=4, 5,131)

o  Observed F Change=184.991 (also ‘Sig. F Change’ or p=.000, which is less than alpha)

o  We reject the null hypothesis and conclude that Model 1 is an improvement over a null model (i.e., a model with no independent variables) and that at least one independent variable has a non-zero slope.

o  Our independent variables (sex, education, age, and experience) explain 12.5% of the variation in income.

·  Model 2 F Change test:

o  H0: bBlack=bHispanic=bOther=0

o  Critical F=2.60 (a=.05, d.f.=3, 5,128)

o  Observed F Change=1.782 (also ‘Sig. F Change’ or p=.148 which is greater than alpha)

o  We fail to reject the null hypothesis and conclude that Model 2 is not an improvement over Model 1; we do not expect any of the race dummies to be significant.

o  Our independent variables (sex, education, age, experience, and race) explain 12.6% of the variation in income.

·  Model 1 F test: Same as the Model 1 F Change test above

·  Model 2 F test:

H0: bMale=bEducation=bAge=bExperience=bBlack=bHispanic=bOther=0

o  Critical F=1.94 (a=.05, d.f.=7, 5,128)

o  Observed F=106.521 (also ‘Sig.’ or p=.000, which is less than alpha)

o  We reject the null hypothesis and conclude that Model 2 is an improvement over a null model (i.e., a model with no independent variables) and that at least one independent variable has a non-zero slope.

Research Hypotheses:

·  bAge≠0; bEducation>0; bExperience>0; bMale>0; bBlack<0; bHispanic<0; bOther<0.

I will focus on the results from Model 2:

·  The intercept=$21,495.08 – this is the predicted income for a White female of average age, education, and experience.

o  Notice that the intercept is no longer simply the mean income – this is due to the inclusion of the dummy variables. I hope that you can now see why it is helpful to center the interval-ratio level independent variables.

·  The age slope=$274.90

o  Critical t=1.96 (2 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=-2.221

o  Reject the null hypothesis

o  Interpretation: each one year increase in age reduces income by $274.90, net of the other variables

·  The education slope=$3,238.61

o  Critical t=1.645 (1 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=18.183

o  Reject the null hypothesis

o  Interpretation: each one year increase in education increases income by $3,238.61, net of the other variables

·  The experience slope=$14.69

o  Critical t=1.645 (1 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=0.258

o  Fail to reject the null hypothesis

o  Interpretation: you should not interpret the coefficient; it is not likely to be different than 0 in the pop.

·  The sex slope=$17,538.55

o  Critical t=1.645 (1 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=15.490

o  Reject the null hypothesis

o  Interpretation: on average, men earn $17,538.55 more than women controlling for age, education, experience and race

·  The race/Black slope=-$2,342.16

o  Critical t=-1.645 (1 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=-1.676

o  Reject the null hypothesis (I prefer to use critical t values rather than p values because the p values are for two-tailed hypothesis tests)

o  Interpretation: on average, those who self-identify as Black earn $2,342.16 less than Whites controlling for age, education, experience and sex

·  The race/Hispanic slope=-$3,275.16

o  Critical t=-1.645 (1 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=-1.605

o  Fail to reject the null hypothesis

o  Interpretation: you should not interpret the coefficient

·  The race/other slope=-$3,662.83

o  Critical t=-1.645 (1 tailed, a=.05; d.f.=5,136-7-1=5,128)

o  Observed t=-1.028

o  Fail to reject the null hypothesis

o  Interpretation: you should not interpret the coefficient