Math 160 - Cooley Intro To Statistics OCC
Section 14.2 – The Regression Equation
Scatterplot – A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use a horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plotted as a point.
The least-squares criterion – The least-squares criterion is that the line that best fits a set of data points is the one having the smallest possible sum squared errors.
J Example:
Consider the following three points:
x / y1 / 3
2 / 1
3 / 5
The scatterplot is shown to the right.
Our goal here is to find the line of best fit for the three points in
the scatterplot. However, there are many lines that can fit the data,
but there is only one line of best fit. Let’s look at two possibilities
for the line of best fit. We will choose and .
To measure quantitatively how well a line fits the data, we first consider the errors, e, made in using the line to predict the y-values of the data points. For Line A, the predicted values for our data, was For Line B, the predicted values for our data, was We calculate each error, e, which is the signed vertical distance from the line to each data point. From there, those distances are squared and then summed. The line with the smallest sum of squared errors is considered the line of best fit. Thus, this process is called the least squares criterion.
Line A: y = ‒1 + 2x / Line B: y = 1 + xx / y / / / / x / y / / /
1 / 3 / 1 / 2 / 4 / 1 / 3 / 2 / 1 / 1
2 / 1 / 3 / ‒2 / 4 / 2 / 1 / 3 / ‒2 / 4
3 / 5 / 5 / 0 / 0 / 3 / 5 / 4 / 1 / 1
8 / 6
So, from the results above, Line B is a better line of fit than Line A, but is it the best line of fit? The following information below will instruct us how to find the line of best fit.
Regression line – The line that best fits a set of data points according to the least-squares criterion.
Regression equation – The equation of the regression line
Notation Used in Regression and Correlation – For a set of n data points, the defining and computing formulas for Sxx, Sxy, and Syy are as follows.
Quantity / Defining formula / Computing formulaSxx / /
Sxy / /
Syy / /
Regression Equation Formula – The regression equation for a set of n data points is , where
slope is
and
y-intercept is .
Response variable – The variable to be measured or observed.
Predictor variable – A variable used to predict or explain the values of the response variable.
Criterion for Finding a Regression Line
Before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.
J Examples: Study Time vs. Test Score. Is there a relationship between the number of hours a student studies for a particular exam and the score that they receive? Here is the data (to the right) on ten students from a particular class taking the same exam.
Study Hours / Test Score %3 / 80
5 / 90
2 / 75
6 / 80
7 / 90
1 / 50
2 / 65
7 / 85
1 / 40
7 / 100
a) Graph the scatterplot based off the data points.
Study Hoursx / Test Score
y / xy / x2
3 / 80 / 240 / 9
5 / 90 / 450 / 25
2 / 75 / 150 / 4
6 / 80 / 480 / 36
7 / 90 / 630 / 49
1 / 50 / 50 / 1
2 / 65 / 130 / 4
7 / 85 / 595 / 49
1 / 40 / 40 / 1
7 / 100 / 700 / 49
41 / 755 / 3465 / 227
b) Determine the regression equation for the data.
Since, the data points do appear to be scattered about a line,
we will determine a regression line.
Here, we are going to use the computing formula.
First, construct the following table to the right.
Second, compute the slope .
Now, compute the y-intercept .
So, the regression line is . This is considered the best fit line for the data.
J Examples:
c) Graph the regression equation obtained in part (b) on the scatterplot.
Graphing by slope-intercept form can be rather tedious, since more than likely, both the slope and y-intercept, are decimals in the tenths, hundredths, thousandths, etc. So, it is best to use a table using only 2 points: One being the intercept, if convenient and another point of your choice, preferably a point that is somewhat towards the right (or end) of the graph.
x /0 / 49.8
6 / 87.42
d) Describe the apparent relationship between the two variables under consideration.
Solution: à Because the slope of the regression line is positive, the test grades tend to increase as the number of
study hours increases, which should come as no surprise.
e) Interpret the slope of the regression line.
Solution: à The slope is 6.27, so that means the test scores tend to increase 6.27 % for every 1 hour of studying.
f) Identify the predictor and the response variables.
Solution: à The y is the response variable and the x is the predictor (sometimes called explanatory) variable.
J Exercises:
g) Use the line of regression to estimate (approximate) the predicted test score for 4 hours of studying.
h) Now, use the regression equation from part (a) to formally calculate the predicted test score for 4 hours
of studying. Compare your results with (g).
- 1 -