Qualitative Variables in a Regression Model using Dummy Variables

  • Differences in Two Population Means
  • Differences Among More Than Two Means
  • Mixtures of Quantitative and Qualitative Independent Variables

1. Differences in Two Population Means

1.1 Ways to express the values of two means

  • Two means can be expressed as two values or as one mean and the difference between the means.
  • Example: You are measuring the starting income of two majors. The average starting salary of major 1 is $25 per hour and the average starting salary of major 2 is $15. This could also be expressed as the average of major 1 is $25 and when you go to major 2 from major 1, the mean is reduced by $10.

1.2 Using regression to model these values.

  • The intercept is the average value of value when x=0 and the slope is the change in the mean of y when x increases by 1.
  • In the example above:

The intercept would be the average starting salary of major 1 and the slope is the change in the average starting salary when you go to major 2.

The following population model can be used to express this:

Mean of Y = 0 + 1X = $25 - $10X

where X = 0 for major 1 and X=1 for major 2

1.3 Interpretation of the coefficients

  • 0 + 1 = 2 which is the average value of y for the second population

0 = 1 which is the average value of y for the first population

1 = 2 - 1 the mean of second population minus the mean of the first

  • Example:

$25 - $10 average starting salary for major 2

$25 average starting salary for major 1

-$10 is the average starting salary of the first major minus the average starting salary of the second major

1.4 Estimates

  • Least square equation

X = 0 or 1

b0 is the estimated mean of the first population (the first sample mean)

b1 is the estimated difference in means (second sample mean minus the first)

  • Example:

Supposed you wanted to compare the average time spent by males and females watching a particular cable channel. You found the following least squares line:

where X = 0 for Males 1 for females.

The sample intercept is the ______

While the sample slope is the ______

______

1.5 Inferences:

  • Requires the same assumptions and has the same degrees of freedom as simple linear regression
  • The test and confidence interval for the slope is identical in values and meanings to the test and confidence interval for the difference in means (independent sample case) found in an earlier chapter.

2. More than two means

2.1 Number of differences

  • Two means require one mean and one difference; i.e., one dummy variable
  • Three means require one mean and two differences; i.e., two dummy variables
  • K means require one mean and k-1 differences; i.e. k-1 dummy variables
  • Example: average days sick per month for type 1 workers is 4, mean for type 2 is 6 and the mean for type 3 is 1 can also be expressed as:

The mean for the first type of worker is 4,

when you go from the first worker population to the second the mean increases by 2, and when you go from the first to the third the mean decreases by 3

2.2 Modeling using dummy variables

2.2.1 Notation and interpretation for three means

  • The intercept is the average value of value when x1=0 and x2=0 and the first slope is the change in the mean of y when x1 increases by 1 and the second slope is the change in the mean of y when x2 increases by 1.

E(y) = 0 + 1X1 + 2X2

0 + 1 = 2 which is the average value of y for the second population

0 + 2 = 3 which is the average value of y for the third population

0 = 1 which is the average value of y for the first population

1 =2 - 1 the mean of second population minus the mean of the first

2 =3 - 1 the mean of third population minus the mean of the first

  • In the example above:

.

The following population model can be used to express this:

Mean days absent = 0 +1X1 + 2X2 = 4 +2X1 – 3X2

Where

X1 = 1 indicates the second type of worker and 0 if not

X2 = 1 indicates the third type of worker and 0 if not

Mean for worker 1 is ____(neither worker 2 or 3)

Mean for worker 2 is ______(worker 2 and not 3)

Mean for worker 3 is ______(worker 3 and not 2)

Differences in means =

2.2.1 Notation and interpretation for k means

  • The intercept is the average value of value when all k-1 dummy variables is zero and the slope of the ith (i =1, 2, … , k-1) dummy variable is the change in the mean of y when Xi increases by 1.

E(y) = 0 + 1X1 + 2X2 + … + k-1Xk-1

0 + i = i which is the average value of y for the ith population

0 = 1 which is the average value of y for the first population

i =i - 1 the mean of ith population minus the mean of the first

2.3 Inferences

  • Requires same assumptions and uses same degrees of freedom as does a regression model with k - 1 variables
  • F test for regression tests the null hypothesis that all the coefficients are zero. Here if all the coefficients are zero then all the means are equal.
  • A t-test or a confidence interval for i will make inferences about the difference in the mean of the ith level and the mean of the first level.

Example: Supposed you wanted to compare the average time spent by adult males, adult females, and children watching a particular cable channel. From a sample of 30, you found the following least squares line when X1 = 1 if males and 0 otherwise and X2= 1 if children and 0 otherwise

The slope estimate of 1.5 could be interpreted as:

4 could be interpreted as:

Additionally using multiple regression for testing all the coefficients you found an F test value of 4.5Complete the following hypothesis test

H0

H1

Test Statistic

Rejection Region

Conclusion:

3. Mixtures of Quantitative and Qualitative Variables

Consider the following Y = time spent watching a cable channel, X1 is the total time spent on all channels during the same time period and category (adult males, adult females, and children).

Examine the following model:

E(y) = 0 + 1X1 + 2X2+ 3X3

Where

X2 = 1 if males and 0 otherwise

X3= 1 if children and 0 otherwise

What is the equation for males?

What is the equation for children?

What is the equation for females?

What is the interpretation of 1?

How would you test for the effect of category?

4. Examples from Bureau of Labor Statistics:

Pricing of College Textbooks

Pricing of Microwave ovens

Creating Occupational Pay relatives