Math 211 Introduction to Statistics

Chapter 7 CORRELATION THEORY

Correlation and Regression

In the last chapter we considered the problem of regression or estimation of one variable (the dependent variable) from one or more related variables (the independent variable). In this chapter we consider the closely related problem of correlation or the degree of relationship between variables which seeks to determine how well a linear or other equation describes or explains the relationship between variables.

If all values of the variables satisfy an equation exactly, we say that the variables are perfectly correlated. For example: the circumferences C and radii r of all circles are perfectly correlated since C=2r. If two dice are tossed 100 times, there is no relationship between corresponding points on each die; that is, they are uncorrelated.

Linear Correlation

If and denote the two variables under consideration, a scatter diagram shows the location of points on a rectangular coordinate system. If all points in this scatter diagram seem to lie near a line as in figure, the correlation is called LINEAR.

(a) Positive linear correlation (b) Negative linear correlation

•If tends to increase as increase, the correlation is called positive correlation.

•If tends to decrease as increase, the correlation is called negative correlation.

•If all the points seem to lie near some curve, the correlation is called nonlinear and a linear equation is appropriate for regression, as we have seen in the previous chapter.

•If there is no relationship indicated between the variables, we say that there is no correlation between them. (They are uncorrelated.)

No Correlation

Measures of Correlation

We can determine in a qualitative manner how well a given line or curve describes the relationship between variables by direct observation of the scatter diagram itself.

For example, it is seen that a straight line is far more helpful in describing the relation between and for the data of figure (a) than for the data of figure (b) because of the fact that there is less scattering about the line of figure (a).

If we are to deal with the problem of scattering or sample data about lines or curves in a quantitative manner it will be necessary for us to define measures of correlation.

Standard Error of Estimate

As we have seen, the least-squares regression line of on is

(1)

and the regression line of on is given by

(2)

If we let represent the value of for given values of as estimated from equation (1), a measure of the scatter about the regression line of on is supplied by the quantity

(3)

which is called the standard error of estimate of on . If the regression line (2) is used, an analogous standard error of estimate of on is defined by (4)

In general, .

Equation (3) can be written as

which may be more suitable for computation. A similar expression exists for equation (4) as follows

.

Explained and Unexplained Variation

The total variation of is defined as . This can be written as

.

The first term on the right of equation is called the unexplained variation, while the second term is called the explained variation. Similar results hold for the variable .

Coefficient of Correlation

The ratio of the explained variation to the total variation is called the coefficient of determination.

If there is zero explained variation (the total variation is all unexplained) this ratio is zero.

If there is zero unexplained variation (the total variation is all explained) the ratio is one. In the other cases the ratio lies between and .

Since the ratio is always nonnegative, we denote it by ,

.

The quantity , called the coefficient of correlation is given by

and varies between and . The + and – signs are used for the positive linear correlation and negative linear correlation respectively.

The standard deviation of is,

and .

The standard deviation of is

and .

Example : The following table shows the respective weights X and Y of a sample of fathers and their oldest sons.

Weight X of father(kg.) 65 63 67 64 68 62 70 66 68 67 69 71
Wei Weight Y of son(kg.) 68 66 68 65 69 66 68 65 71 67 68 70

(a)Find the least squares regression line of Y on X.

(b)Find the least squares regression line of X on Y.

(c)Compute the standard error of estimate, SY.X

(d)Compute the total variation, the unexplained variation and explained variation.

(e)Find the coefficient of determination and the coefficient of correlation.

Solution.

65 / 68 / 4225 / 4420 / 4624
63 / 66 / 3669 / 4158 / 4356
67 / 68 / 4489 / 4556 / 4624
64 / 65 / 4096 / 4160 / 4225
68 / 69 / 4624 / 4692 / 4761
62 / 66 / 3844 / 4092 / 4356
70 / 68 / 4900 / 4760 / 4624
66 / 65 / 4356 / 4290 / 4225
68 / 71 / 4624 / 4828 / 5041
67 / 67 / 4489 / 4489 / 4489
69 / 68 / 4761 / 4692 / 4624
71 / 70 / 5041 / 4970 / 4900

(a)The regression line of on is given by where and are obtained by solving the normal equations

given by

from which we find that and and thus .

(b)The regression line of on is given by where and are obtained by solving the normal equations

given by

from which we find that and and thus .

(c)

/ 65 / 63 / 67 / 64 / 68 / 62 / 70 / 66 / 68 / 67 / 69 / 71
/ 68 / 66 / 68 / 65 / 69 / 66 / 68 / 65 / 71 / 67 / 68 / 70
/ 66.76 / 65.81 / 67.71 / 66.28 / 68.19 / 65.33 / 69.14 / 67.24 / 68.19 / 67.71 / 68.66 / 69.62
/ 1.24 / 0.19 / 0.29 / -1.28 / 0.81 / 0.67 / -1.14 / -2.24 / 2.81 / -0.71 / -0.66 / 0.38

Therefore using the formula to compute the standard error of estimate we obtain .

(d)The total variation is

The unexplained variation is and the explained variation is .

(e) coefficient of determination=.

coefficient of correlation=.

Product-Moment Formula for the Linear Correlation Coefficient

If a linear relationship between two variables is assumed, the equation

becomes

(5)

where and .

This formula which automatically gives the proper sign of , is called the product – moment formula and clearly shows the symmetry between and .

If we write, , , (6)

then and will be recognized as the standard deviations of the variables and respectively while and are their variances. The new quantity is called the COVARIANCE of and .

In terms of the symbols of formulas (6), formula (5) can be written .

Short Computational Formulas

Formula (3) can be written in the equivalent form

.

Example. Find the coefficient of linear correlation between the variables X and Y presented in the following table.

/ 1 / 3 / 4 / 6 / 8 / 9 / 11 / 14
/ 1 / 2 / 4 / 4 / 5 / 7 / 8 / 9

Solution.

1 / 1 / -6 / -4 / 36 / 24 / 16
3 / 2 / -4 / -3 / 16 / 12 / 9
4 / 4 / -3 / -1 / 9 / 3 / 1
6 / 4 / -1 / -1 / 1 / 1 / 1
8 / 5 / 0 / 0 / 0 / 0 / 0
9 / 7 / 2 / 2 / 4 / 4 / 4
11 / 8 / 3 / 3 / 9 / 9 / 9
14 / 9 / 4 / 4 / 16 / 16 / 16

The coefficient of linear correlation is .

Example : For the data of the last example find

(a)the standard deviation of X.

(b)the standard deviation of Y.

(c)the variance of X.

(d)the variance of Y.

(e)the covariance of X and Y.

Solution. (a) The standard deviation of

(b) The standard deviation of

(c) The variance of

(d) The variance of

(e) The Covariance of and .

Sonuc ZorluLecture Notes