Practice Problems on Correlation & Simple Regression

1. Suppose that, across a sample of stores, the correlation coefficient between beer prices and beer sales is -0.65. What does this number indicate?

(a) There is almost no variability in beer sales that is unexplained by beer price.

(b) More beer sales tend to go along with lower beer prices.

(d) All of the above are true.

2. The purpose of a scatterplot is:

(a) To test for the significance of association in bivariate data.

(b) To calculate the correlation coefficient.

(d) To determine a confidence interval for the regression slope.

3. The standard error of the sample regression slope tells you:

(a) Approximately how different the slope coefficient will be in different samples.

(b) Approximately how large the prediction errors are.

(d) Approximately how much of the variability of Y is explained by X.

4. The correlation coefficient describes the ______between 2 variables.

(a) strength of curved association

(b) strength of random association

(d) American Marketing Association

5. R2 is a measure used to describe the overall fit of the regression line. Which of the following statements is/are correct about R2?

(a) In general, the closer the R2 is to 1, the better the fit of the regression line to the points in the scatterplot.

(b) R2 tells you the proportion of the points in the scatterplot that fall right on the regression line.

(d) All of the above are true statements about R2.

6. A cost accountant is developing a regression model to predict the total cost of producing a batch of circuit boards as a function of the batch size. The independent and dependent variables for this regression would be:

(a) IV: circuit board DV: batch size

(b) IV: batch size DV: total cost

(d) IV: total cost DV: average cost

(The next 9 questions are based on the following information.)

Pete Estrian is looking to buy a used Honda Civic. He checks the Internet and finds a huge list of Civics for sale in his area. He selects a random sample of 10 cars, ranging in age from 2 years old to 15 years old. For each car, he enters the age (in years) and the offered sales price (in thousands) into Excel. He runs a regression predicting price from age, and gets the following (edited) output:

ANOVA
df / SS / MS / F / Significance F
Regression / 1 / 93.5 / 93.51 / 117.0 / 0.000005
Residual / 8 / 6.4 / 0.80
Total / 9 / 99.9
Coefficients / Standard Error / t Stat / P-value
Intercept / 12.10 / 0.60 / 20.2 / 0.00000004
Age / -0.80 / 0.07 / -10.8 / 0.000005

7. What is the equation for the regression line?

(a) Predicted price = $12,100 - $800 * Age

(b) Predicted price = $12,100 - $600 * Age

(d) Predicted price = $12,100 - $70 * Age

8. Car #5 in the sample was 10 years old and cost $4,000. Determine the predicted price and the residual for this car.

(a) Predicted price = $11,300; residual = -$7,300

(b) Predicted price = $11,300; residual = $7,300

(d) Predicted price = $4,100; residual = -$100

9. Construct a 95% confidence interval for the drop in price associated with an additional year of age.

(a) ($639, $961)

(b) ($667, $933)

(d) Cannot be determined from the information given

10. What is the correlation between Age and Price?

(a) r= 0.97

(b) r= -0.97

(d) r= -0.80

(Honda Civics Prices and Ages, continued.)

11. What is the typical difference between the predicted prices (based on the regression line) and the actual prices for these cars?

(a) about $70

(b) about $600

(d) about $2,530

12. What does the p-value of 0.000005 tell us?

(a) It is not very plausible that the population regression line relating Price to Age is flat.

(b) There is strong evidence that the slope of the population regression line is not 0.

(d) All of the statements above are implied by the low p-value.

13. The average age of the cars in the sample is 7.1 years. What is the average price of the cars in the sample?

(a) $5,000

(b) $6,420

(d) Cannot be determined from the information given.

14. What is the best conclusion we can draw about Honda Civics that are 5 years old, based on the information we have?

(a) We conclude that the average price of 5-year old Civics is about $8,100, but we expect to see some differences in prices for different 5-year Civics.

(b) We conclude that that all 5-year-old Civics should cost the same, about $8,100.

(c) We conclude that all 5-year-old Civics should cost more than all 6-year-old Civics, although we can’t be completely sure by how much.

(d) All of the above are equally valid conclusions.

15. Suppose instead that Pete had taken a second sample, consisting of 6 cars that ranged in age from 4 to 8 years old, and suppose that he regressed Price on Age for this second sample. How would the standard error of the regression slope be different for this second sample (compared to the first sample described on the previous page)?

(a) The standard error of the regression slope would probably be larger for the second sample.

(b) The standard error of the regression slope would probably be about the same for both samples.

(The next 5 questions deal with the following information.)

Below is partial output from a regression predicting consumption of beef (called BeefConsumption, and measured in pounds of beef per person annually) from the price of beef (called BeefPrice, and measured in cents per pound). [The data are from the United States from 1925 to 1941. During this period, the price of beef ranged from about 55 to 80 cents per pound.]

ANOVA
df / SS / MS / F / Significance F
Regression / 1 / 166.0 / 166.0 / 19.6 / 0.0005
Residual / 15 / 127.2 / 8.5
Total / 16 / 293.1
Coefficients / Standard Error / t Stat / P-value / Lower 95% / Upper 95%
Intercept / 85.24 / 7.30 / 11.67 / 6.3E-09 / 69.68 / 100.80
BeefPrice / -0.47 / 0.11 / -4.42 / 0.0005 / -0.69 / -0.24

16. In 1941, beef cost 56 cents per pound, and annual consumption of beef was 60.0 pounds per person. Determine the predicted consumption of beef in 1941, and say whether the data point for 1941 is above or below the regression line.

(a) Predicted consumption of beef = 57.0; data point is above the regression line

(b) Predicted consumption of beef = 57.0; data point is below the regression line

(d) Predicted consumption of beef = 58.9; data point is below the regression line

17. What is the best interpretation of the y-intercept (85.24) for this regression?

(a) When beef was free in 1941, everyone consumed about 85 pounds of beef per year.

(b) The intercept doesn’t tell us much, because 0 isn’t a reasonable value to plug in for the price of beef.

(d) Both (a) and (c).

18. Determine the correlation between BeefPrice and BeefConsumption.

(a) r = +0.75

(b) r = -0.75

(d) r = -0.47

19. Determine the standard deviation of the residuals for this regression

(a) 2.9 pounds/person

(b) 0.11 pounds/person

(d) 0.75 pounds/person

20. Can we reject the null hypothesis that BeefConsumption is unrelated to BeefPrice? (use alpha=.05)

(a) Yes, because the p-value for the slope is very small (p=.0005)

(b) Yes, because the 95% confidence interval for the slope does not include 0.

(d) Both (a) and (b).

21. Let X= the dosage of a stimulant (in milligrams) given to a patient, and let Y= the pulse rate of the patient (in beats per minute). Data is gathered for 100 patients, and the resulting regression equation is: predicted Y = 70 + 0.35 * X.

What is the best interpretation of the number 0.35?

(a) 35% of the variability in pulse rate is explained by the amount of stimulant.

(b) The correlation between amount of stimulant and pulse rate is 0.35.

(c) When a patient is given one additional milligram of stimulant, their pulse tends to increase by about 0.35 beats per minute.

(d) When a patient is given 0.35 additional milligrams of stimulant, their pulse tends to increase by about 1 beats per minute.

(The next 2 questions are based on the following information.)

22. Across a sample of data from New York City (one data point for each month), the correlation between average monthly temperature (X) and number of homicides per month (Y) is observed to be .40. The most appropriate interpretation is:

(a) As the temperature increases by 10 degrees, we expect to see 4 more homicides.

(b) For every additional 10 homicides, we expect temperature to increase by 4 degrees.

(d) Average temperature explains 40% of the variability in number of homicides per month.

23. If the standard deviation of the monthly temperature measurements is 15 degrees, and the standard deviation of number of homicides per month is 5, what is the slope of the regression line predicting number of homicides from monthly temperature? (calculate directly)

24. Which of the following is not an assumption of the simple linear regression model?

(a) The population mean of Y (for each level of X) is linearly related to X.

(b) The population variance of Y is the same for each level of X.

(d) The population correlation between X and Y is equal to 1.

(The next 3 questions are based on the following information.)

25. A manager wants to predict the cost of travel (Y) for salespeople based on the number of days (X) spent on each trip. Based on a sample of data for trips ranging from 1 to 5 days, the following regression line is estimated: Y-hat = 180 + 130 X

What would be the best prediction for the cost for a trip lasting 3 days?

(a) $390

(b) $570

(d) Cannot be determined; the regression line shouldn't be used to predict costs for trips lasting that long.

26. What would be the best prediction for the cost for a trip lasting 10 days?

(a) $1300

(b) $1480

(d) Cannot be determined; the regression line shouldn't be used to predict costs for trips lasting that long.

27. Suppose the average length of the trips in the sample is 4 days. What is the average cost of travel for these trips? (calculate directly)

______

28. The standard error of the slope of the regression line depends on which of the following quantities?

(a) the number of observations in the sample

(b) the variance of the scores on the independent variable

(d) all of the above

29. There is a relationship between caffeine consumption and performance on early-morning tests as follows: at very low and at very high levels of caffeine consumption, performance is worse than performance at moderate levels of caffeine consumption. Which of the following is the best estimate of the correlation coefficient that describes this relationship?

(a) r = -0.70

(b) r = 0.00

(d) r = +1.00

(The next 9 questions are based on the following information.)

Mark E. Ting, a market researcher, is analyzing household earnings and spending data. He is interested in relating monthly household grocery spending (in dollars) to annual household income (in thousands of dollars). He gathers data for a sample of 9 households. Below is partial output from his analysis.

Annual Income (thousands of $) / Monthly Grocery Spending ($)
Mean / 52.8 / 137.2
SD / 29.9 / 59.8
Source / df / SS / MS / F
Regression / 1 / 25543 / 25543 / 57.4
Residual / 7 / 3112 / 445
Total / 8 / 28656
Coefficients / Standard Error / t Stat / P-value
Intercept / 37.51 / 14.92 / 2.51 / 0.0401
Income / 1.89 / 0.25 / 7.57 / 0.0001

30. What is the equation for the estimated regression line?

(a) Predicted monthly grocery spending = 37.51 + 1.89 Annual income

(b) Predicted monthly grocery spending = 1.89 + 37.51 Annual income