Math 155 - Cooleyfinite Math with Applications OCC

Math 155 - CooleyFinite Math with Applications OCC

Section 1.3 – The Least Squares Line

Scatterplot – A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use a horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plotted as a point.

The least-squares criterion – The least-squares criterion is that the line that best fits a set of data points is the one having the smallest possible sum squared errors.

 Example:

Consider the following three points:

x / y
1 / 3
2 / 1
3 / 5

The scatterplot is shown to the right.

Our goal here is to find the line of best fit for the three points in

the scatterplot. However, there are many lines that can fit the data,

but there is only one line of best fit. Let’s look at two possibilities

for the line of best fit. We will choose and .

To measure quantitatively how well a line fits the data, we first consider the errors, e, made in using the line to predict the y-values of the data points. For Line A, the predicted values for our data, was For Line B, the predicted values for our data, was We calculate each error, e, which is the signed vertical distance from the line to each data point. From there, those distances are squared and then summed. The line with the smallest sum of squared errors is considered the line of best fit. Thus, this process is called the least squares criterion.

Line A: Y = 2x‒ 1 / Line B: Y = x + 1
x / y / / / / x / y / / /
1 / 3 / 1 / 2 / 4 / 1 / 3 / 2 / 1 / 1
2 / 1 / 3 / ‒2 / 4 / 2 / 1 / 3 / ‒2 / 4
3 / 5 / 5 / 0 / 0 / 3 / 5 / 4 / 1 / 1
8 / 6

So, from the results above, Line B is a better line of fit than Line A, but is it the best line of fit? The following information below will instruct us how to find the line of best fit.

Least squares line – The line that best fits a set of data points according to the least-squares criterion.

Least Squares Line Formula

The least squares line that gives the best fit to the data points has slope m and y-intercept b given by

and .

Criterion for Finding a Least Squares Line

Before finding a least squares line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a least squares line.

Consider the following three scatterplots.

Figure A Figure B Figure C

The objective is to determine how strong of a relationship there is, if any, among the x and y-values.

To answer this is quite simple. The more linear the plot looks, the stronger the relationship. Recall from the previous page, that each one of these figures above all have a least squares line associated with them. It is obvious that Figure A looks more linear than Figure B. Moreover, Figure B looks more linear than Figure C.

Again, every scatterplot has a line of best fit, but how strong is that fit. The strength of the line of best fit is called the correlation coefficient, r. We can calculate r, without finding the least squares (or regression) equation. Here is the scale of the correlation coefficient, r. (These are only crude estimates for interpreting strengths of correlations):

Algebraically, these are the guidelines:

Weak or no correlation

Moderate (or moderately strong) correlation

Strong correlation

Perfect correlation

The correlation coefficient, r, always lies between –1 and 1. A value of r near 0 suggests that there is weak or no correlation between the x and y-values, so the least squares equation is not very useful for making predictions; whereas a value of rnear –1 or 1 suggests that there is strong correlation between the x and y-values, so the least squares equation is quite useful for making predictions. A negative correlation coefficient implies that the least squares line has negative slope, so the data has negative association. Conversely, a positive correlation coefficient implies that the least squares line has positive slope, so the data has a positive association

Correlation Coefficient Formula

 Examples:Study Time vs. Test Score. Is there a relationship between the number of hours a student studies for a particular exam and the score that they receive? Here is the data (to the right) on ten students from a particular class taking the same exam.

Study Hours / Test Score %
3 / 80
5 / 90
2 / 75
6 / 80
7 / 90
1 / 50
2 / 65
7 / 85
1 / 40
7 / 100

a)Graph the scatterplot based off the data points.

Study Hours
x / Test Score
y / xy / x2 / y2
3 / 80 / 240 / 9 / 6400
5 / 90 / 450 / 25 / 8100
2 / 75 / 150 / 4 / 5625
6 / 80 / 480 / 36 / 6400
7 / 90 / 630 / 49 / 8100
1 / 50 / 50 / 1 / 2500
2 / 65 / 130 / 4 / 4225
7 / 85 / 595 / 49 / 7225
1 / 40 / 40 / 1 / 1600
7 / 100 / 700 / 49 / 10000
41 / 755 / 3465 / 227 / 60175

b)Calculate the least squares line for the data.

First, construct the following table to the right.

(Note: The last column, y2 is for the calculation of

the correlation coefficient, r.)

Second, compute the slopem.

Now, compute the y-intercept b.

So, the least squares equation is . This is considered the best fit line for the data.

 Examples:

c)Graph the least squares line and graph it on the scatterplot.

Graphing by slope-intercept form can be rather tedious, since more than likely, both the slope and y-intercept, are decimals in the tenths, hundredths, thousandths, etc. So, it is best to use a table using only 2 points: One being the intercept, if convenient and another point of your choice, preferably a point that is somewhat towards the right (or end) of the graph.

x /
0 / 49.8
6 / 87.42

 Exercises:

d)Predict the test score for 4 hours of studying.

e)Now, use the least squares equation from part (b) to formally calculate the predicted test score

for 4 hoursof studying. Compare your results with (d).

 Examples:

f)Now, calculate the correlation coefficient, r.

So, from the table on the previous page,

Thus, . So, the least squares line with the equation has a strong, positive, correlation amongst the data.

 Exercises:

1)Size of Hunting Parties. In the 1960s, the famous researcher Jane Goodall observed that chimpanzees

hunt and eat meat as part of their regular diet. Sometimes chimpanzees hunt alone, while other times

they form hunting parties. The following table summarizes research on chimpanzee hunting parties, giving the size of the hunting party and the percentage of successful hunts. Source: American Scientist and Mathematics Teacher.

Number of Chimps in Hunting Party / Percentage of Successful Hunts
1 / 20
2 / 30
3 / 28
4 / 42
5 / 40
6 / 58
7 / 45
8 / 62
9 / 65
10 / 63
12 / 75
13 / 75
14 / 78
15 / 75
16 / 82

a)Plot the data. Do the data points lie in a linear pattern?

 Exercises:

b)Find the equation of the least squares line, and graph it on your scatterplot.

# of chimps
x / % of successful hunts
y / xy / x2 / y2
1 / 20 / 20 / 1 / 400
2 / 30 / 60 / 4 / 900
3 / 28 / 84 / 9 / 784
4 / 42 / 168 / 16 / 1764
5 / 40 / 200 / 25 / 1600
6 / 58 / 348 / 36 / 3364
7 / 45 / 315 / 49 / 2025
8 / 62 / 496 / 64 / 3844
9 / 65 / 585 / 81 / 4225
10 / 63 / 630 / 100 / 3969
12 / 75 / 900 / 144 / 5625
13 / 75 / 975 / 169 / 5625
14 / 78 / 1092 / 196 / 6084
15 / 75 / 1125 / 225 / 5625
16 / 82 / 1312 / 256 / 6724
125 / 838 / 8310 / 1375 / 52558

c)Find the correlation coefficient. Combining this year with your answer in part a), does the

percentage of successful hunts tend to increase with the size of the hunting party?