Chapter 4
The student will be able to:
1. Use the scatter diagram and linear correlation coefficient to determine whether a linear relationship exists between two variables.
2. Determine the regression line for bivariate data.
3. Test hypotheses about correlation coefficients.
4. Understand that correlated data may not have a causal relationship.
5. Determine the best prediction relative to correlation.
Section 4.1 – Scatter Diagrams and Correlation
Objectives
- Draw and interpret scatter diagrams
- Describe the properties of the linear correlation coefficient
- Compute and interpret the linear correlation coefficient
- Determine whether a linear relation exists between two variables
- Explain the difference between correlation and causation
Objective 1 – Draw and interpret scatter diagrams
Univariate data – One variable
BiVariate data – Two variables
Credit Score Interest Rate (%)
545 19
595 18
640 12
675 9
705 7
750 5
The response variable is the variable whose value can be explained by the value of the explanatory or predictor variable.
A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point. The explanatory variable is plotted on the horizontal axis, and the response variable is plotted on the vertical axis.
Correlation – There is a correlation between two variables when one of them is related to the other in some way.
Example
83 / 1025
72 / 960
88 / 1200
84 / 1100
80 / 900
76 / 860
70 / 880
93 / 1180
Enter the data above into L1 and L2 and draw a scatter plot
Looking at a scatter diagram can help you determine if the variables have a linear relationship.
(pg. 192)
Two variables that are linearly related are positively associated when above-average values of one variable are associated with above-average values of the other variable and below-average values of one variable are associated with below-average values of the other variable. That is, two variables are positively associated if, whenever the value of one variable increases, the value of the other variable also increases.
As the explanatory variable goes up the response variable goes up and at a constant rate.
Two variables that are linearly related are negatively associated when above-average values of one variable are associated with below-average values of the other variable. That is, two variables are negatively associated if, whenever the value of one variable increases, the value of the other variable decreases.
Looking at a scatter plot of the data can help you determine if the two variables are positively associated, negatively associated, or have no association. (pg. 194)
Objective 2 - Describe the Properties of the Linear Correlation Coefficient
The linear correlation coefficient measures the strength and direction of the linear relation between two quantitative variables. The Greek letter ρ (rho) represents the population correlation coefficient, and r represents the sample correlation coefficient.
(round to 3 decimal places)
where
is the sample mean of the explanatory variable
sx is the sample standard deviation of the explanatory variable
is the sample mean of the response variable
sy is the sample standard deviation of the response variable
n is the number of individuals in the sample
Sample scatter plots with associated value for r
Objective 3 - Compute and Interpret the Linear Correlation Coefficient
The default for LinReg is the explanatory variable is in L1 and the response variable is in L2. If the explanatory and response variables are in different lists than L1 and L2 then enter the lists after LinReg, for example
LinReg (ax+b) L3, L4
NOTE: the formula chart does not mention LinReg. Also, r can be found using LinRegTTest (see instructions below)
Example
Find r for the following set of data.
Temperature / Cricket Chirps83 / 1025
72 / 960
88 / 1200
84 / 1100
80 / 900
76 / 860
70 / 880
93 / 1180
Objective 4 - Determine whether a linear relation exists between two variables
Method 1: P-Value approach
Since the formula chart specifically mentions LinRegTTest, we will prefer the P-value approach instead of the critical value approach.
Note this approach is not listed in the book.
Method 2: Critical value approach
Following is the critical value approach which is the approach given in the book
Critical Values for Correlation Coefficient (Table II Appendix A from book)
n / Critical Value / n / Critical Value1 / 0.997 / 21 / 0.413
2 / 0.950 / 22 / 0.404
3 / 0.878 / 23 / 0.396
4 / 0.811 / 24 / 0.388
5 / 0.754 / 25 / 0.381
6 / 0.707 / 26 / 0.374
7 / 0.666 / 27 / 0.367
8 / 0.632 / 28 / 0.361
9 / 0.602 / 29 / 0.355
10 / 0.576 / 30 / 0.349
11 / 0.555 / 40 / 0.304
12 / 0.532 / 50 / 0.273
13 / 0.514 / 60 / 0.250
14 / 0.497 / 70 / 0.232
15 / 0.482 / 80 / 0.217
16 / 0.468 / 90 / 0.205
17 / 0.456 / 100 / 0.195
18 / 0.444
19 / 0.433
20 / 0.423
Example
Assume that 20 pairs of data result in a value of r = 0.855. Is there a linear relation between x and y?
Example
Assume that 10 pairs of data result in a value of r = 0.601. Is there a linear relation between x and y?
Example
Is there a linear relationship between temperature and cricket chirps? Use the P-value approach and a = 0.05.
Temperature / 83 / 72 / 88 / 84 / 80 / 76 / 70 / 93Cricket Chirps / 1025 / 960 / 1200 / 1100 / 900 / 860 / 880 / 1180
Example
American Black Bear – The American black bear is one of eight species in the world. It is the smallest North American bear and the most common bear species on the planet. In 1969, Dr. Michael Pelton of the University of Tennessee initiated a long-term study of the population in the Great Smokey Mountains National Park. One aspect of the study was to develop a model that could be used to predict a bear’s weight. One variable thought to be related was the length of the bear. The following data represents the lengths and weights of 12 American black bears.
Total Length (cm) / Weight (kg)139 / 110
138 / 60
139 / 90
120.5 / 60
149 / 85
141 / 100
141 / 95
150 / 85
166 / 155
151.5 / 140
129.5 / 105
150 / 110
Does a linear relationship exist between the weight of the bear and it’s height? Use the P-value approach and a = 0.05.
Objective 5 - Explain the difference between correlation and causation
Note, do note read “causal” as “casual, ” not the same!
Causation can only come from designed experiments, not observational studies.
A lurking variable is related to both the explanatory and response variables. Two variables can be correlated without there being a causal relationship through a lurking variable.
Causation
If there is a significant linear correlation between two variables, then one of five situations can be true.
· There is a direct cause and effect relationship
· There is a reverse cause and effect relationship
· The relationship may be caused by a third variable
· The relationship may be caused by complex interactions of several variables
· The relationship may be coincidental
Common Errors
There are some common errors that are made when looking at correlation.
· Avoid concluding causation. Just because there is a linear relationship doesn't mean that one thing caused the other. It could be any of the five situations above.
· Avoid data based on rates or averages. Variation is suppressed when using a rate or an average. Remember the central limit theorem? The variance of the sample means was the variance of the population divided by the sample size. So, if you work with averages, the variances are smaller and you might be able to find linear relationships that are significant when they would not be if the original data was used.
· Watch out for linearity. All that we're testing here is the strength of a linear relationship. There are other kinds of relationships. In algebra, we talk about linear, quadratic, cubic, quartic, exponential, logarithmic, Gaussian (bell shaped), logistics, and power models. A scatter plot is a good way to look for patterns.
Section 4.2 – Least-Squares Regression
Objectives
- Find the least-squares regression line and use the line to make predictions
- Interpret the slope and the y-intercept of the least-squares regression line
- Compute the sum of squared residuals
Objective 1 - Find the least-squares regression line and use the line to make predictions
Once the linear correlation coefficient has indicated that a linear relationship exists between two variables, our next step is to find a linear equation that describes the relationship between the two variables.
The goal of this section is to find not just any linear equation, but the “best” linear equation that fits our data.
What does “best” mean?
We will define “best” in terms of residuals, or errors. A residual is the difference between an observed y-value (y) and predicted y-value (). The predicted y-value comes from the line we chose to represent the data.
From an example in the book
Residual =
Positive residuals indicate that a data point is above the line, i.e., above average
Negative residuals indicate that a data point is below the line, i.e., below average.
So the definition of “best” is to minimize the sum of the squared residuals
The line of best-fit or the least-squares regression line is the line that minimizes the sum of the squared residuals.
On the calculator it will be .
Relate this equation back to the slope-intercept form for a linear equation, , that you learned in Algebra, where m is the slope and b is the y-intercept.
x is called the predictor variable y is called the response variable
The good news is that the calculator will do all of the work for us. Use either LinReg or LinRegTTest to get the a and b values to form the least-squares regression line. Calculator instructions for these formulas was presented earlier.
Example
Use your calculator to find the least-squares regression line for the following set of data:
Temperature / 83 / 72 / 88 / 84 / 80 / 76 / 70 / 93Cricket Chirps / 1025 / 960 / 1200 / 1100 / 900 / 860 / 880 / 1180
Predict the number of cricket chirps when the temperature is 90 degrees. 0 degrees. 105 degrees.
Example – American Black Bears
Total Length (cm) / Weight (kg)139 / 110
138 / 60
139 / 90
120.5 / 60
149 / 85
141 / 100
141 / 95
150 / 85
166 / 155
151.5 / 140
129.5 / 105
150 / 110
Use your calculator to find the least-squares regression line.
Predict the weight of a bear if the length is 150cm. 160cm. 200cm
Objective 2 - Interpret the slope and the y-intercept of the least-squares regression line
The y-intercept of any line is the point where the line intersects with the vertical axis. Find the y-intercept by letting x=0 in the equation and solving for y.
To interpret the y-intercept, first ask two questions?
- Is 0 a reasonable value for the explanatory variable?
- Do any observations near x = 0 exist in the data set?
If the answer is no to either question, do not interpret the y-intercept.
Do not use the regression model to make predictions outside the scope of the model. That is, do not use the regression model for values of the explanatory variable that are much smaller or larger than the observed data.
The x-intercept is the rate of change, on average.
Example – Cricket Chirps
Temperature / 83 / 72 / 88 / 84 / 80 / 76 / 70 / 93Cricket Chirps / 1025 / 960 / 1200 / 1100 / 900 / 860 / 880 / 1180
Interpret the slope and y-intercept of the least-squares regression line found earlier.
Example – American Black Bears
Total Length (cm) / Weight (kg)139 / 110
138 / 60
139 / 90
120.5 / 60
149 / 85
141 / 100
141 / 95
150 / 85
166 / 155
151.5 / 140
129.5 / 105
150 / 110
Interpret the slope and y-intercept of the least-squares regression line found earlier.
13