Scatterplots and Regression Models

Project AMP Dr. Antonio R. Quesada – Director, Project AMP Part 2 of 4

Scatterplots and Regression Models Name ______

Curve Fitting Date ______Period ______

1. The table below gives the number of drive-in movie screens in the United States for 1988 – 1999.

Year / 1988 / 1989 / 1990 / 1991 / 1992 / 1993 / 1994 / 1995 / 1996 / 1997 / 1998 / 1999
Screens / 1497 / 1014 / 910 / 899 / 870 / 837 / 859 / 848 / 826 / 815 / 750 / 683

Source: National Association of Theatre Owners

a. Enter the data into your calculator. Label your x-list “year” and your y-list “screens”. Make a scatterplot of the data.

b. What type of function do the points seem to represent?

c. Calculate a regression equation using the appropriate model to best fit the data.

i. Write the equation of the model (round to the nearest thousandth).

ii. What is the correlation coefficient, r?

iii. Remember: r measures the strength and direction of linear relations with -1 ≤ r ≤ 1. As r moves away from 0 toward ±1, the linear relation gets stronger. Based on the r value of this data, how would you describe the strength of this relation?

iv. The r2 statistic describes how much of the variation in one variable can be accounted for by this straight-line relationship with another variable, with r2 = 1 meaning 100%. What is the r2 value of this data?

v. Superimpose the regression line onto the scatterplot. Are the points close to the line? Based on what you found in part iv, is this surprising?

vi. Use your regression equation to predict the number of drive-in movie screens in the years 2003, 2005 and 2007. How accurate do you think these predictions are? Why?

vii. According to this model, in what year will there be NO drive-in screen in the U.S.?

viii. Visit www.natoonline.org/statisticsscreens.htm to find the actual number of screens in 2003, 2005 and 2007.

Your findings in #1 should lead you to believe that a linear model may not have been the best choice for that set of data. It is normally best to perform more than one type of regression and use the r2 data to decide which model best fits the data.

2. Weight lifting in the Summer Olympics is divided into classes according to how much the participants weigh. Winners at the 1996 Summer Olympics in Atlanta and the amounts lifted are shown below (the weight limits of the participants have been converted from kilograms to pounds):

Weight Limit (lb) / Winner / Country / Amount Lifted (lb)
119 / Halil Mutlu / Turkey / 633
130 / Tang Ningsheng / China / 678
141 / Naim Sulemanoglu / Turkey / 739
154 / Zhan Xugang / China / 787
167.5 / Pablo Lara / Cuba / 809
183 / Pyrros Dimas / Greece / 864
200.5 / Alexi Petrov / Russia / 886
218 / Kakhi Kakhiasvili / Greece / 926
238 / Timur Taimazov / Russia / 948

Source: Sports Illustrated Almanac, 1997

a. Enter the data into your calculator. Label your x-list “weight” and your y-list “lifted”. Make a scatterplot of the data.

b. What type of function do the points seem to represent?

c. Calculate a linear regression equation.

i. Write the equation of the model (round to the nearest thousandth).

ii. What is the value of r2?

iii. The r2 statistic describes how much of the variation in one variable can be accounted for by this straight-line relationship with another variable, with r2 = 1 meaning 100%. How can you interpret the r2 value this data?

d. Now calculate a Quadratic Regression.

i. Write the equation of the model (round to the nearest thousandth).

ii. What is the value of r2?

iii. The r2 statistic describes how much of the variation in one variable can be accounted for by this quadratic relationship with another variable, with r2 = 1 meaning 100%. How can you interpret the r2 value this data?

e. Which model is a better fit for this data? Why?

f. Superimpose the best regression model onto the scatterplot. Are the points close to the line/curve? Based on what you found in part iii, is this surprising?

g. Use this equation to predict the amount lifted for a person who weighs 35 pounds and a person who weighs 400 pounds. Do you think your model is a helpful predictor outside the range of weight limits given? Why or why not?

Although the data in #2 appeared to have a linear relationship, performing a different type of regression gave a better fitting curve. It is important to always perform at least two regression tests to decide which model is best. Some common curves are found below. Notice that on small intervals, these curves can look very similar. Knowing how these curves behave can help in choosing the best regression model.

3. The population per square mile in the United States has changed dramatically over a period of years. The table below shows the number of people per square miles (or population density) of the United States for several years between 1790 and 2000:

Year / 1790 / 1800 / 1810 / 1820 / 1830 / 1840 / 1850 / 1860 / 1870 / 1880 / 1890
People per
Square mile / 4.5 / 6.1 / 4.3 / 5.5 / 7.4 / 9.8 / 7.9 / 10.6 / 10.9 / 14.2 / 17.8
Year / 1900 / 1910 / 1920 / 1930 / 1940 / 1950 / 1960 / 1970 / 1980 / 1990 / 2000
People per
Square mile / 21.5 / 26.0 / 29.9 / 34.7 / 37.2 / 42.6 / 50.6 / 57.5 / 64.0 / 70.3 / 80.0

Source: U.S. Census Bureau, 2000

a. Enter the data into your calculator. Label your x-list “weight” and your y-list “lifted”. Make a scatterplot of the data.

b. Calculate a regression of your choice.

i. What regression did you choose?

ii. Write the equation of this model (round to the nearest thousandth).

iii. What is the r2 value? Interpret this value.

c. Calculate a different regression.

i. Which regression did you choose this time?

ii. Write the equation of this model (round to the nearest thousandth).

iii. What is the r2 value? Interpret this value.

d. If you feel you need to calculate another regression, do so. Superimpose the best regression model onto the scatterplot. Write the best model below. Use this equation to predict the population density of the U.S. in 2030 and to predict the DATE when the population density of the U.S. will reach 200 people per square mile.

To find more interesting data of the United States, visit www.census.gov .

Scatterplots and Regression Models Teacher Solution Sheet

Curve Fitting

4. The table below gives the number of drive-in movie screens in the United States for 1988 – 1999.

Year / 1988 / 1989 / 1990 / 1991 / 1992 / 1993 / 1994 / 1995 / 1996 / 1997 / 1998 / 1999
Screens / 1497 / 1014 / 910 / 899 / 870 / 837 / 859 / 848 / 826 / 815 / 750 / 683

Source: National Association of Theatre Owners

a. Enter the data into your calculator. Label your x-list “year” and your y-list “screens”. Make a scatterplot of the data.

b. What type of function do the points seem to represent? a linear function

c. Calculate a regression equation using the appropriate model to best fit the data.

i. Write the equation (round to the nearest thousandth). y = -23.855x + 48412.418

ii. What is the correlation coefficient, r? -.925

v. Superimpose the regression line onto the scatterplot. Are many points on the line? Based on what you found in part iv, is this surprising? Are there any points that seem to be outliers? If you eliminated these points from the set, what do you think would happen to the value of r? r2?

Not many of the points are on the line. This is not surprising as only about 86% of the variation can be accounted for by this model. The points (1988, 1497) and (1999, 683) seem like outliers. If these points were eliminated, the r value would get closer to -1 and the r2 value would be closer to 1 … so the new model would be a better predictor.

vi. Use your regression equation to predict the number of drive-in movie screens in the years 2001, 2005 and 2007. Do you think these predictions are accurate? Why? Not very accurate.

f(2003) = 679 screens f(2005) = 584 screens f(2007) = 536 screens

vii. According to this model, in what year will there be NO drive-in screen in the U.S.?

Solve: 0 = -23.855x + 48412.418 Around the year 2029

viii. Visit www.natoonline.org/statisticsscreens.htm to find the actual number of screens in 2003, 2005 and 2008. 2003: 634 2005: 648 2007: 635

5. Weight lifting in the Summer Olympics is divided into classes according to how much the participants weigh. Winners at the 1996 Summer Olympics in Atlanta and the amounts lifted are shown below (the weight limits of the participants have been converted from kilograms to pounds):

Source: Sports Illustrated Almanac, 1997

h. Enter the data into your calculator. Label your x-list “weight” and your y-list “lifted”. Make a scatterplot of the data.

i. What type of function do the points seem to represent? linear

j. Calculate a linear regression equation.

Write the equation of the model (round to the nearest thousandth). y = 2.617x + 356.848

ii. What is the value of r2? .952

The r2 statistic describes how much of the variation in one variable can be accounted for by this straight-line relationship with another variable, with r2 = 1 meaning 100%. How can you interpret the r2 value this data? About 95% of the variation can be accounted for with this model.

k. Now calculate a Quadratic Regression.

Write the equation of the model (round to the nearest thousandth). y = -.016x2 + 8.412x – 132.984
What is the value of r2? .994
The r2 statistic describes how much of the variation in one variable can be accounted for by this quadratic relationship with another variable, with r2 = 1 meaning 100%. How can you interpret the r2 value this data? About 99% of the variation can be accounted for with this model.

l. Which model is a better fit for this data? Why? Quadratic model; r2 value is closer to 1.

m. Superimpose the best regression model onto the scatterplot. Are the points close to the line/curve? Based on what you found in part iii, is this surprising?

The points are close to the curve. This is not surprising, as the r2 value is close to 1.

n. Use this equation to predict the amount lifted for a person who weighs 20 pounds and a person who weighs 400 pounds. Do you think your model is a helpful predictor outside the range of weight limits given? Why or why not?

f(35) = 141.44 pounds f(400) = 619.86 Not a good predictor outside the range of weight limits.

6. The population per square mile in the United States has changed dramatically over a period of years. The table below shows the number of people per square miles (or population density) of the United States for several years between 1790 and 2000:

Source: Northeast-Midwest Institute

e. Enter the data into your calculator. Label your x-list “weight” and your y-list “lifted”. Make a scatterplot of the data.

f. Calculate a regression of your choice.

What regression did you choose?
Write the equation of this model (round to the nearest thousandth).
What is the r2 value? Interpret this value.

g. Calculate a different regression.

Which regression did you choose this time?
Write the equation of this model (round to the nearest thousandth).
What is the r2 value? Interpret this value.

h. If you feel you need to calculate another regression, do so. Superimpose the best regression model onto the scatterplot. Write the best model below. Use this equation to predict the population density of the U.S. in 2030 and to predict the DATE when the population density of the U.S. will reach 200 people per square mile.

Using the Cubic Model:

y = 2.817x3 – .014x2 + 22.830x – 12332.887

f(2030) = 107.9 people per square mile

According to this model, the U.S. will reach a population density of 200 per square mile sometime on October 24, 2101.