Math 103 - CooleyStatistics for Teachers OCC
Activity #33 – Correlation and Regression
CaliforniaState Content Standard - Statistics, Data Analysis, and ProbabilityN/A
In mathematical modeling, it is often necessary to deal with numerical data and to make assumptions regarding the relationship between two variables. For example, you may want to examine the relationship between IQ and salary, study time and grades, age and heart disease, math grades in the 6th grade and amount of TV viewing. All are attempts to relate two variables in some way or another. If it is established that there is a correlation, then the next step in the modeling process is to identify the nature of the relationship. This is called regression analysis. In this handout, we will consider only linear relationships. It is assumed that you are familiar with the slope-intercept form of the equation of a line (namely, y= mx+ b)
We are interested in finding a best-fitting line by a technique called the least squares method. The derivations of the results and the formulas given in this handout are, for the most part, based on calculus, and are therefore beyond the scope of this book. We will attempt to focus instead on how to use and interpret the formulas.
The first consideration is one of correlation. We want to know whether two variables are related. Let us call one variable x and the other y. These variables can be represented as ordered pairs (x,y) in a graph that we are somewhat familiar with called a scatterplot.
Example:
Does the overall grade in a math class change as the number of absences increases. Here is the data on ten students from a particular class:
Absences / 0 / 2.5 / 2 / 1 / 1.5 / 0 / 5 / 3 / 4.5 / 3.5Grade / 98 / 72 / 85 / 94 / 73 / 84 / 43 / 62 / 59 / 71
a)Draw a scatterplot to represent the
data in the table above.
Let x be the number of absences
and let y be the overall grade (in %).
The scatterplot is shown to the right.
Correlation is a measure to determine whether there is a statistically significant relationship between two variables. This was discussed in the prior activity. This measure is called the linear correlation coefficient, r, and has the following formula:
Formula for Linear Correlation Coefficient, r
b)Calculate the correlation coefficient, r.
Absencesx / Grade
y / xy / x2 / y2
0 / 98 / 0 / 0 / 9604
2.5 / 72 / 180 / 6.25 / 5184
2 / 85 / 170 / 4 / 7225
1 / 94 / 94 / 1 / 8836
1.5 / 73 / 109.5 / 2.25 / 5329
0 / 84 / 0 / 0 / 7056
5 / 43 / 215 / 25 / 1849
3 / 62 / 186 / 9 / 3844
4.5 / 59 / 265.5 / 20.25 / 3481
3.5 / 71 / 248.5 / 12.25 / 5041
23 / 741 / 1468.5 / 80 / 57449
So,
Thus, we have a very strong negative correlation.
c)Find the best-fitting line for the data.
Here we want to find a line so that the sum of the distances of the data points from this line will be small as possible. (We use instead of y to distinguish between the actual second component, y, and the predicted y-value, .) Since some of these distances may be positive and some negative, and since we do not want large opposites to “cancel each other out,” we minimize the sum of the squares of these distances. Therefore, the regression line is sometimes called the least squares line.
Formula for the Least Squares Line (or Regression Line)
,
where
and .
So, we will calculate m first and then b, using the summaries from the table on the previous page.
and
So, the regression line is . This is considered the best fit line for the data on the scatterplot from part a). Now, we approximate the best-fitting line with this equation on the scatterplot.
Exercises:
x / y1 / 1
2 / 5
3 / 8
4 / 13
1)Find r as well as the regression line for the data points:
2)The following data are the number of years of full-time education (x) and the annual salary in
thousands of dollars (y) for 15 persons. Is there a correlation between education and salary?
Education / 20 / 27 / 28 / 18 / 13 / 18 / 9 / 16 / 16 / 12 / 12 / 19 / 16 / 14 / 13Salary / 35.2 / 24.6 / 23.7 / 33.3 / 24.4 / 33.4 / 11.2 / 32.3 / 25.1 / 22.1 / 18.9 / 37.8 / 25.9 / 28.4 / 29.6
1