Ideal Answers to Chapter 9 (Multiple Regression) Questions

QUESTION 9.1.The correlation between Lisa’s presence in my kitchen on a given day (0 = absent, 1 = present) and the disappearance of cookies that day (0 = did not happen, 1 = did happen) is r = (10 – 2) / 12 = 8/12 = .67. This correlation is significant at p < .05. If these were the only data at my disposal, I would be forced to conclude that there is a very good chance that Lisa was guilty of stealing at least some of the cookies.

QUESTION 9.2a.I’d now place all of the blame on Bart. Without exception, he was always in my kitchen when cookies disappeared. Moreover, when he did not visit my kitchen, cookies never disappeared. Lisa seems to have merely had the misfortune of usually being wherever Bart was.

QUESTION 9.2b. Based on the additional data, I’d say the beta weight reflecting the unique association between Bart’s presence in my kitchen and the disappearance of cookies should be β = 1.00, meaning that Bart is perfectly correlated with cookie disappearance and that Lisa seems to play no role.

QUESTION 9.2c.Because I would no longer place any blame on Lisa, I’d say the beta weight for Lisa would be β = 0.00, meaning that there is no unique association between Lisa’s presence in my kitchen and the disappearance of cookies. As I said before, Lisa just usually happened to visit when Bart visited.

QUESTION 9.3.The two days that were most informative with regard to who was truly responsible for the disappearance of the cookies were days 6 and 7. On Day 6, Lisa did not visit my kitchen, Bart did visit it, and cookies did disappear. On Day 7, Lisa did visit my kitchen, Bart did not visit it, and cookies did not disappear. In other words, if we isolate the two kids, it seems clear that Bart is the only one whose presence in my kitchen is uniquely associated with the disappearance of cookies. The other ten days are less informative because on those days, the two children were always jointly absent or present in my kitchen.

QUESTION 9.4. The simultaneous multiple regression analysis in which I predicted income from both height and age revealed that height was still a significant predictor of income when we statistically equated taller and shorter people for age. Specifically, taller people still tended to make more money, β = .436, t (27) = 2.39, p = .024. In contrast, after controlling for height, there was no unique association between age and income, β = -.172, t (27) = -0.94, p = .355. Apparently, it is not simply the case that taller people tend to make more money because they tend to be younger.

QUESTION 9.5.The simultaneous multiple regression analysis in which I predicted income from (a) height, (b) age, and (c) gender revealed that height was only a marginally significant predictor of income when we statistically equated taller and shorter people for gender as well as for age. Taller people only showed a marginally significant trend to make more money, β = .505, t (26) = 1.88, p = .072. After controlling for height and gender, there was no unique association between age and income, β = -.147, t (26) = -0.74, p = .465. Further, after controlling for height and age, there was no unique association between gender and income, β = .086, t (26) = 0.36, p = .725.

This last finding tentatively suggests that height may play a role in income that goes beyond the well documented tendency for men to make more money than women. Perhaps if we had a larger sample size, the unique height-income association would prove to be significant. In terms of effect sizes, it is somewhat reassuring, at least, that the unique association between height and income (β = .505) has about the same effect size as the zero-order correlation between height and income (r = .514).

QUESTION 9.6. I predicted life satisfaction from both income satisfaction and friendship quality. The beta weights for both income satisfaction and friendship quality were .50, which is exactly the same as the zero-order correlations for these two predictors. The reason the Beta weights were exactly the same as the zero order correlations is that the two variables were already statistically independent of one another, by virtue of being completely uncorrelated. So no adjustment was necessary to figure out the unique association between each variable and life satisfaction. The multiple R was .707, which when squared is .500. Taken together these two variables account for exactly half of the variance in life satisfaction. That, too, makes perfect sense because each variable alone accounted for 25% of the variance; for each variable the coefficient of determination or r2 = 25. This is one of those highly unusual situations in which the contribution of each variable to the prediction of y can simply be added together (.25 + .25 = .50). Income satisfaction accounted for ¼ of the variance in life satisfaction and friendship quality accounted for a unique, and thus additional, ¼ of the variance.

QUESTION 9.7. As you can see from the screen capture I pasted in the Appendix, I corrected my SPSS data file so that for store 10, the value for “pop2miles” was 18814.00. In addition, I deleted the impossible employee satisfaction score for store 15. Give me my points.

QUESTION 9.8. In a simultaneous multiple regression analysis, I predicted annual sales in these 20 convenience stores from (a) average daily traffic, (b) population within a 2-mile radius of the store, (c) median family income for households within a 2-mile radius of the store, and (d) employee satisfaction levels at each store. This analysis revealed that each of these predictors was uniquely associated with annual sales levels.

First, once we controlled for all of the other variables in the analysis, traffic levels were uniquely and negatively associated with annual sales. All else being equal, stores located in areas where there was more traffic had lower, not higher, annual sales, β = -.852, p = .032. This makes sense in that when people are stuck in traffic, they are probably more motivated to get home than to stop and grab a Big Gulp. Second, not surprisingly, population density was a positive unique predictor of sales, β = 1.03, p = .001. Stores in more heavily populated neighbors had higher sales. That is, all else being equal, more people living near a store makes for more customers. Third, median family income in the neighborhood surrounding the stores was also positively, and uniquely, associated with sales, β = 1.48, p < .001. Stores located in richer neighborhoods took in more money, presumably because people living nearby had more money to spend. Finally, after controlling for all of these unique effects, there was also a modest but significant effect of employee satisfaction, β = .23, p = .037. Ceteris paribus, stores with more satisfied employees had higher annual sales, perhaps because more satisfied employees somehow communicated their satisfaction to customers (e.g., by being friendlier or more professional). The marketing implications of these findings are straightforward. If I were advising 7-11 or a similar company about where to locate their next store, I would encourage them to pick a location in a densely populated, wealthy neighborhood where there was as little traffic as possible. Once the store was built, I would also encourage the company to do whatever was possible to hire and retain satisfied, engaged employees. One caveat to this advice is that we cannot know for sure whether some of these variables are really drivers of sales. It seems unlikely that convenience store sales create neighborhood income, but it is conceivable that higher salesmight increase employee satisfaction, or that the friendliness of a neighborhood might be the real reason for the unique connection between sales and satisfaction.

QUESTION 9.9. The table below includes the beta weights and p values from a simultaneous multiple regression analysis in which I used the original data without correcting the errors. If I had believed these erroneous results, I would have concluded that the only significant predictor of annual sales is average daily traffic. Thus, I would have erroneously concluded that, all else being equal, it is a great idea to build convenience stores in neighborhoods where there is a great deal of traffic. Based on the corrected data, we know that the opposite is actually true (and that the other three variables are also important).

Predictor beta weight p value

------

average daily traffic .564.020

population density .239 .115

neighborhood income .291.195

employee satisfaction .033 .831

QUESTION 9.10. I assume that this researcher conducted a study that is eerily similar to the one we conducted. In fact, in our study, the simple (zero-order) correlation between daily traffic levels and annual sales was r (18) = .77, p < .001, which is exactly what this researcher observed. Our regression analysis revealed, however, that when one controls for variables such as neighborhood population density and neighborhood income, the true, unique association between traffic levels and annual sales is negative. In short, this researcher needed to conduct a more sophisticated study to get a true picture of what predicts convenience store sales.

QUESTION 9.11. Multiple regression analysis cannot replace experimental research techniques. First, because laboratory experimenters usually have the luxury of independently manipulating (or holding constant) any independent variables in which they are interested, laboratory researchers do not have to worry about having so many correlated predictors that they cannot hope to separate them all. Thus, they never have to worry about multicollinearity. In the cookie example, we were barely able to separate Bart and Lisa because there were only two days on which only one of them was present. If we had been looking at two other children at the same time, it is unlikely that we would have gotten a clear answer to our question of who was uniquely responsible for the cookie thefts. Second, in the example of cookie thefts, it is obvious that disappearing cookies cannot make children appear in my kitchen. However, in most cases involving correlational data, reverse causality is a much more serious worry. For example, suppose a multiple regression analysis showed that income uniquely predicted education even after controlling for achievement motivation. This finding might mean that there is something unique about income that contributes to educational achievement, but it might also mean that there is something unique about educational achievement that contributes to income. In short, multiple regression analyses can often address the third variable problem, but they cannot do anything to address the problem of reverse causality. Finally, even if we focus solely on the third variable problem, multiple regression analyses can only address third variables (i.e., confounds) that a researcher was knowledgeable, sophisticated, or well-funded enough to measure. In contrast, in true experiments, random assignment equates two or more groups on every conceivable individual difference variable. In the long run, random assignment completely eliminates all confounds, measured or not.

QUESTION 9.12.The crucial variables in the World95 data set that allow a basic test of this idea are (a) lifeexpf (female life expectancy), (b) lit_fem (female literacy rates), and (c) fertilty (average number of children women have). Because the file also includes (d) gdp_cap (the per capita GDP of each country) it is also possible to control for national wealth. A multiple regression analysis that predicts female longevity from the two indicators of women’s social status (female literacy and fertility) while controlling for GDP will reveal that both femaleliteracy rates and fertilitydo uniquely predict female life expectancy even after controlling for GDP. In an even more stringent test, female literacy and fertility both predict female life expectancy after controlling for male life expectancy. However, this last analysis is worrisome because male and female longevity are so very highly correlated (i.e., because of multicollinearity). In fact, a test for multicollinearity in this last analysis yields a very high VIF (whereas the analysis controlling for GDP does not). Although we will discuss multicollinearity later in the text, this data file could provide an opportunity to discuss it earlier. At any rate, regression analyses that include both male and female life expectancies as predictors should be interpreted very cautiously. At a minimum, it would be good to replicate the analysis using a more recent version of these data. It would be easy to create such a data file, by the way, from sources such as the UNDP (United Nations Development Programme) or the CIA World Factbook.

See

See https://www.cia.gov/library/publications/the-world-factbook/

QUESTION 9.13.The reason the logistic regression analysis yielded the same Exp (B) for Bart, with or without Fred in the analysis, is that Fred was absolutely uncorrelated with Bart (r = 0) and with the disappearance of cookies (also r = 0). That is, Bart’s regression coefficient didn’t change with Fred added to the equation because there was no effect of Fred to adjust for in the first place. This is a nice, intuitive validation that regression does what it is supposed to do. In this case logistic regression made no adjustment at all for a variable when the variable had no association at all with anything in the model. Presumably, if Fred had been correlated, even slightly, with either Bart or the disappearance of cookies, there would have been at least a slight adjustment in Bart’s value once we accounted for what was going on with Fred.

QUESTION 9.14.Because the data pattern did not change at all it is not surprising that theodds-ratio for Bart remains exactly 49:1, p < .001, meaning that cookies are way more likely to disappear on days when Bart visits than on days when he does not visit. Increasing the sample size by a factor of 5 (n = 80 rather than n = 16) without changing anything else didn’t reduce the 95% confidence interval for Exp(B) as much as I thought it would. The new confidence interval with n= 80 observation is 13.0 to 184.4. That is there is a small chance that the true population value for Bart’s effect on missing cookies could be below 13 or above 184.4. That’s still a very large range. I assume that this has something to do with the fact that both the predictors and the criterion in the model are dichotomous. Simply changing a couple of 0s to 1s would have quite a substantial effect on the estimates for the odds ratios.