Relations Between Variables
Scientists are forever trying to find relations between quantities:
- Does the number of minutes of exercise per week influence blood pressure?
- Does the amount of time required for a ball to roll down a ramp depend on the slope of the ramp?
- Does the amount of fertilizer I use on plants affect their size?
In each case above, the scientists run experiments and collect pairs of numbers for the quantities that they are trying to relate to each other:
- (weekly exercise minutes, systolic blood pressure) , e.g. (35, 136), (0, 155), (200, 121), …
- (angle of ramp (deg.), time of ball (sec.)), e.g. (5, 17), (20, 6), (45, 3), …
- (cc fertilizer/sq. meter garden, plant height m.), e.g. (.5, .33), (.77, .02), (.01,.54), …
They might plot these number pairs on a graph and examine the graph for a trend. For example, in repeating the ball-rolling experiment pictured below the student records
the pairs of numbers representing the angle of the ramp and the corresponding “rolling times. She then plots these numbers on a graph as follows:
What can you conclude about a relationship between the angle of the ramp and the time the ball rolls? Can you explain this?
Another student does a very careful experiment to relate the growth of plants with the amount of fertilizer used and comes up with the following graph:
Is there a relationship between plant height and the amount of fertilizer used? Can you explain this?
These examples of seeking a relationship between variables can be quantified by using methods of statistical analysis called Correlation and Regression.
Correlation
Correlation analysis seeks to identify (by a single number) the degree to which there is a (linear) relation between the numbers in sets of data pairs. The correlation coefficient of a set of data pairs with x- and y-meansand respectively is
You don’t need to worry about computing this number; it’s easy to use a computer to calculate it. The interpretation of this number is more important – it is somewhere between –1 and 1. The closer r is to 1, the more positively correlated are the sets of numbers in the sense that an increase in x corresponds to a proportional increase in y; similarly with decreases in x corresponding to proportional decreases in y. On the other hand, if r is close to –1, then increases in x correspond to decreases in y and decreases in x correspond to increases in y, so we say that x and y are negatively correlated. Finally, if r is close to zero, there is little if any relationship between the variables – we say they are uncorrelated.
Consider the earlier graphs from the “ball-rolling” and “fertilizer” experiments:
- In the graph of time of rolling vs. angle of ramp as the angle increases, does the rolling time generally
- increase, decrease, or change in an unrelated fashion?
- Explain your answer from the graph.
- The correlation coefficient for this data turns out to be -.84. Does this agree with your answers above? Explain.
- In the graph of plant height vs. fertilizer concentration, as the amount of fertilizer per square meter increases, does plant height generally
- increase, decrease, or change in an unrelated fashion?
- Explain your answer from the graph.
- The correlation coefficient for this data turns out to be .37. Does this agree with your answers above? Explain.
- Consider the graph that represents the weight (lb.) vs. height (in.) for players on last year’s Cincinnati Bengals football team
As the height of players increases, does the weight generally
- increase, decrease, or change in an unrelated fashion?
- Explain your answer from the graph.
- Guess at the number below that you think best represents the correlation coefficient for this data? Explain your guess.
- -.75ii. .03 iii. .73 iv. .99
- Consider the graph generated from Dr. Denice Robertson’s research on lobsters and their production of eggs. She has measured the number of eggs produced by a lobster and the lobster’s length (mm.). Her data is graphed below
As the length of the lobster increases, does the number of eggs produced generally
- increase, decrease, or change in an unrelated fashion?
- Explain your answer from the graph.
- Guess at the number below that you think best represents the correlation coefficient for this data? Explain your guess
- -.89ii. -.13 iii. .25 iv. .91
Regression
Regression analysis is the search for a line or curve that best fits a set of data pairs. We will use linear regression which seeks a line with equation that “best fits” the data. The term “Best fits” has a precise mathematical meaning that we can think of as “minimizing the distances to the line”. For example, if a computer program for doing regression is applied to the data from the “Ball rolling experiment” the best fitting line is shown on the graph below:
.
It will turn out that any other line will give a larger overall distance to the points than this line does.
You can frequently estimate the equation of the regression line by estimating its slope (i.e. change in y divided by change in x) and its y-intercept b (i.e. the value of the value of y where the line crosses the y-axis). From the graph above, we could estimate that the line has y-intercept close to 6 and, since it appears to run between the points (5,6.5) and (65,.75), it has slope . Recognize that this is just a guess based on “eyeballing the graph” but this would give the equation
y=-0.096 x+ 6.
You can use a computer statistics package to see that the “best fit” linear regression line has the equation , pretty close to what was “eyeballed” above. It is important to notice that the “slope” of this line is negative, m=-0.0808, and the “intercept” is b=6.0527. One connection between this equation and the correlation coefficient is that the “slope” and the correlation coefficient have the same sign, they are both negative. This is because negatively correlated variables have a “downward trend” and the best fit line slopes down.
If you are interested, the formula for the regression coefficients m and b that appear in the equations for the “best fit line”, , are determined by the following formulas:
, .
You may notice that the slope m is close to the formula for the regression coefficient r, , but don’t worry about calculating any of this stuff because we’ll use the computer to do the work.
When you do regression analysis using a computer program, you’ll sometime see some indication of the coefficient of determination or “goodness of fit”, , where is the measured value and where is the value of the regression line evaluated at and where n is the sample size. R is a type of “average deviation of measured from estimated values” so, when R is small, the approximation is more accurate. The “best fit” line gives the smallest R from among all lines. In the example we looked at earlier, if you were to compare the R-values for the “best fit” line to the “eye-balled” line, you’d see that the “best-fit” line had a slightly smaller R.
If you are interested in “playing with” the placement of data points to see how their placement affects the regression line, go to the website for an interactive demonstration.
You might wonder, “Why would you care about the equation of the best fit line?” Well, If your line “fits the data” pretty well, then you could use the equation of the line to predict the relation between x and y, i.e., given a value of x, you could predict the value of y. For example, note that the “Ball rolling” experiment only measured the “rolling time” corresponding to angles of 5 deg, 10 deg, 15 deg, 20 deg, 25 deg, 30 deg, 40 deg, 65 deg. Suppose we wanted to predict the rolling time for a ball rolling down a 37 deg ramp. We could plug x=37 into the formula for the regression line, to get the “predicted rolling time” . Scientists use the regression equation to predict y-values for given x-values.
Indeed, scientists usually only use regression analysis when the variables x and y have a very special causal relationship. In particular, x should be treated as an independent variable that the experimenter can control or at least that the experimenter can select and y is the dependent variable, whose values depend on the values of x. The variable “angle of ramp” in the Ball rolling experiment can be controlled by the experimenter, thus is a good example of an independent variable. The corresponding “rolling time” depend on the choice of angle so must be a dependent variable. For this reason, it makes sense to seek a linear function relation that approximates the dependence of rolling time on angle, namely the regression equation.
Which of the other examples displayed this causal relationship?
- In the fertilizer/plant size experiment, the scientist controlled the amount of fertilizer spread on the field, so this plays the role of an independent variable and the plant height is treated as the dependent variable. We saw earlier that this data was very weakly correlated. Nevertheless, we can compute the regression line to be y=1.804631x+1.535264. The difference with the previous example is that the “goodness of fit” will be relatively much larger in this case than is the previous example; see the graph below that includes the regression line.
- In the football player example, height doesn’t cause weight nor does weight cause height. Actually, both are related to a general quality that we could refer to as a person’s “size”. Thus, although the variables are well correlated, it doesn’t make much sense to apply regression analysis to this data.
- Dr. Robertson’s study that relates egg production to size in lobsters does allow for regression analysis. Once you realize that the female lobster distributes her eggs on her tail, it is logical that the larger lobster has more room to carry eggs and thus will produce more. The regression equation for this data is
y=3758.525 x + -106704
and the graph (with regression line) is drawn below:
Homework on Correlation and Regression:
Turn in answers to problems 1—4 from earlier and also for the following:
- Using the Lobster Fecundity graph above:
- Estimate the slope and y-intercept of the line in above by the “eye-ball” method and create your equation for the linear regression line.
- Use your equation to estimate the number of eggs produced by a lobster that is 45 mm long. Use the “best fit” regression equation above to do this same estimate. Is the difference in these two estimates meaningful?
- You collect the data on minutes of weekly exercise vs. systolic blood pressure for a group of college students and plot the data:
- Draw what you think is the “best fit” line on the graph above. It’s best to use a clear ruler to draw the line that you think gives the minimum total distance from all the points.
- Estimate the slope and y-intercept for your line and create your regression equation.
- Use your regression equation to estimate the blood pressure for a student who does 180 min. exercise per week.