Statistics Chapter 4

Correlation and Regression

If we have two (or more) variables we are usually interested in the relationship between the variables.

Association between Variables

Two variables are associated when they are measured on the same individuals when some values of one variable tend to occur more often with some values of the second variable than the other values of the second variable.

Example

(a.)Height and weight are associated ( positively)

(b.)High school GPA and college GPA are associated (positively)

(c.)weight of car and MPG rating (negative)

(d.)age and whether a person has had a heart attack (positively)

When looking at the relationship between variables we can

(1)just look at the association (-or-)

(2)try to explain variation in one variable in terms of the other variable (that is, try to predict one variable from the other)

Response Variable

•denoted by y

•it is the variable that is to be predicted

•measure of the outcome of an experiment

•also called the dependent variable

Explanatory Variable

•denoted by x

•explains the change in the response variable

•also called independent variable

Want to explain of predict why based on the explanatory variable x.

In some cases there is no clear distinction between variables as far as which is the response and which is the explanatory variable.

Method of Understanding Relationship

(1)Construct numerical and graphical display of each variable

(2)construct a scatter plot of y versus x

(3)look for patterns and deviations from pattern (ie: outliers)

(4)When the overall pattern is quite regular, use a mathematical model to describe the pattern.

Scatter Diagrams

•A graph in which pairs of points, (x, y), are plotted with x on the horizontal axis and y on the vertical axis.

•The explanatory variable is x.

•The response variable is y.

•One goal of plotting paired data is to determine if there is a linear relationship between x and y.

If we have two (or more) variables we are usually interested in the relationship between the variables.

Paired Data (x, y)

Let x and y be our variables of interest. Suppose we have observations

(x1, y1), (x2, y2)… (xn, yn)

Scatter plot is a plot of n points (xi, yi)

Look for:

•pattern

•deviations from pattern

•describe the pattern in terms of strength, direction, and form

Some Patterns

•linear patterns = data points approximately follow a straight line

•Curvilinear = data points approximately follows some curve approximation

How Strong Is the Linear Correlation?

Not all relationships are linearly-correlated.

Statisticians need a quantitative measure of the strength of the linear association. How to do this?

The Sample Correlation Coefficient r - measure of the strength of linear association given data (x1, y1), (x2, y2)… (xn, yn)

Statisticians use the sample correlation coefficient r to measure the strength of the linear correlation between paired data.

1)r has no units.

2)–1 ≤ r ≤ 1

3)r > 0 indicates a positive relationship between x and y , r < 0 indicates a negative relationship.

4)r =0 indicates no linear relationship.

5)Switching the explanatory variable and response variable does not change r.

6)Changing the units of the variables does not change r.

A Computational Formula for r

The correlation coefficient is

where , , , and

There is a positive linear association if r > 0.

There is a negative linear association if r < 0.

There is no association if .

The computational formula is

Properties of Correlation

(1)Correlation makes no distinction between explanatory and response variables. corr(x, y) = corr(y, x)

(2)Correlation requires that both variables are quantitative.

(3)Correlation does not change values if there is a change in the scale of measurement or a change in the location of the measurements of either variable.

(4)Correlation s are always between -1 and +1

(5)Positive values of r indicate positive linear association. If x and y are perfectly positively related the r =+1. If x and y are perfectly negatively linearly related the r= -1 exactly

(6)values of r near 0 indicate weak of no linear association between x and y

(7)Correlation coefficient, r, measures the strength of linear association. Correlation is not a measure of strength of curved association.

(8)Correlation is not resistant to outliers

(9)Correlation does not provide a complete description of the relationship between variables.

NOTE: r is the sample correlation

is the population correlation

Illistration

Caribou (x, in hundreds) and wolf (y) populations

Interpreting the Value of r

•r = 0 : There is no linear relation for the points of the scatter diagram.

•r = 1 or r = –1 : There is a perfect linear relation between x and y; all points lie on a straight line.

•0 < r1 :The x and y values has a positive correlation. As x increases, y tends to increase.

•–1 < r0 :Thex and y values have a negative correlation. As x increases, y tends to decrease.

Example - Which of the following shows a strong negative correlation?

Critical Thinking

•Expect r to vary from sample to sample.

•So, consider the significance of r as well as its value when assessing the strength of a linear correlation. (Section 11.4)

•|r| ≈ 1 only implies a linear relationship between x and y.

•It does not imply a cause and effect relationship between x and y.

•The values of x and y may both depend linearly on some third lurking variable.

Example - Over the past few years, there has been a strong positive relationship between the annual consumption of coffee and the number of computers sold per year. Which conclusion is the best one to draw from this strong correlation?

a). Coffee consumption stimulates computer sales.

b). Computer users are sophisticated and thus are inclined to drinking coffee.

c). The correlation is purely accidental.

d). The responses of both variables probably reflect the increasing wealth of the citizenry.

Linear Regression

•Linear Regression - a mathematical technique for creating a linear model for paired data.

•Based on the “least-squares” criterion of best fit

Example - Caribou and wolf populations in Denali National Park

Questions

•Do the data points have a linear relationship?

•How do we find an equation for the best fitting line?

•Can we predict the value of the response variable for a new value of the predictor variable?

•What fractional part of the variability in y is associated with the variability in x?

Least-Squares Criterion - Method of Least Squares to fit Equation’s to Data

Let y=response variable and x=explanatory variable

Data: (x1, y1), (x2, y2)… (xn, yn)

Suppose the plot of y vs. x shows a straight line (linear) relationship. We want to determine the line of “best” fit.

A regression line is a straight line that describes how the average response value varies as the explanatory variable (x) changes. We can use the regression line to predict the value of y at a given x.

Equation of a straight line

where a= the y-intercept (values of y when x=0) and b=slope of the line (change in y for an event change in x)

If we know a and b we can predict y for a given value of x. How accurate the prediction is depends on how much scatter there is in the data about the line.

Methods of Fitting a Straight Line

(1.)“Eye Ball” method

(2.)Least Squares (1805)

The method of Least Squares finds the straight line that minimizes the sums of the squares of the vertical distances between the points and the line.

If line is then we want to minimize with respect to a and b. The least squares straight line is given by

whereis the standard deviation x and sy is the standard deviation of y. And

Properties of the Regression Equation

•The pointis always on the least-squares line.

•The slope tells us the amount that ychanges whenx increases by one unit.

Example - Caribou (x, in hundreds) and wolf (y) populations

Least-squares linear relationship between caribou and wolf populations:

Critical Thinking: Making Predictions

•We can simply plug in x values into the regression equation to calculate y values.

•Extrapolation may produce unrealistic forecasts

Coefficient of Determination

•Another way to gauge the fit of the regression equation is to calculate the coefficient of determination, r 2.

1). Compute r. Simply square this value to get r 2.

2). r 2 is the fractional amount of total variation iny that can be explained using the linear model.

3). 1 – r 2 is the fractional amount of total variation in y that is due to random chance (or possibly due to lurking variables).

Example - The linear correlation coefficient for a set of paired data is r = 0.86.

What fractional amount of the total variation in y is due to random chance and/or to lurking variables?

Comments about Least Squares Fit

•Since and since b is the change in y for a unit change in x, if r is close to 0 there will be little change in y as x changes ( - gives an approximately horizontal regression line). The larger r is in absolute values, the greater the change in y for a unit change in x.

•The least squares regression line passes through the point

•The regression line of y regressed on x is different from the regression line of x regressed on y. So selecting a regression variable is important in least squares regression lines. ( Recall – correlation coefficient is not affected by interchanging x and y)

•The square of the correlation coefficient, , (called the coefficient of determination) is the proportion of variation in the values of y that can be accounted for (explained by) the least squares regression line of y on x. Furthermore, is the proportion of total variation in y that is due to random chance or to the possibility of lurking variables that influence y.

Example – Suppose that five individuals with the same initial systolic blood pressure are given a blood pressure medication, each with a different dosage. After a period of time on the medication each individual has their systolic blood pressure measured. The data are:

Dosage (x) / Systolic Blood Pressure (y)
10 mg / 150
20 mg / 140
30 mg / 145
40 mg / 130
50 mg / 130

a) Make a scatter plot of the data

b) Find the correlation between Dosage and Blood Pressure

c) Find the Least squares regression line of the data

d) Find the coefficient of determination (the proportion of variation in the values of y that can be accounted for (explained by) the least squares regression line of y on x.)

1 | Page