1

1/22/07 252corr (Open this document in 'Outline' view!)

L. CORRELATION

1. Simple Correlation

The simple sample correlation coefficient is or if spare parts , and are available, we can say

Of course, since the coefficient of determination is , and it is often easier to compute and to give the correlation the sign of . But note that the correlation can range from +1 to -1, while the coefficient of determination can only range from 0 to 1. Also note that since the slope in simple regression is , or or . The last equation has a counterpart in , where is the population correlation coefficient, so that testing is equivalent to testing and the simple regression coefficient and the correlation will have the same sign.

2. Correlation when x and y are both independent

If we want to test against and are normally distributed, we use . But note that if we are testing against , and , the test is quite different. We need to use Fisher's z-transformation. Let . This has an approximate mean of and a standard deviation of , so that .

(Note: To get , the natural log, compute the log to the base 10 and divide by .434294482. )

Example: Test when and .

To solve this we first compute and Compare this with . Since this is not between these two values of t, reject the null hypothesis. (Note that so that this is equivalent to an F test on in a regression.)

Example: Test when and .

This time compute Fisher's z-transformation (because is not zero)

. Finally Compare this with . Since –0.591 lies between these two values, do not reject the null hypothesis.

Note: To do the above with logarithms to the base 10, try . This has an approximate mean of and a standard deviation of , so that .

3. Tests of Association

a. Kendall's Tau. (Omitted)

b. Spearman's Rank Correlation Coefficient.

Take a set of points and rank both from 1 to to get . Do not attempt to compute a rank correlation without replacing the original numbers by ranks,A correlation coefficient between can be computed as in point 2 above, but it is easier to compute ,and then . This can be given a t test for as in point 2 above, but for between 4 and 30, a special table should be used. For really large , may be used.

Example: 5 applicants for a job are rated by two officers, with the following results. Note that in this example the ranks are given initially. Usually the data must be replaced by ranks.

Test to see how well the ratings agree.

In this case, we have a 1-sided test . Arrange the data in columns.

Note that . Since . If we check the table ‘Critical Values of the Spearman Rank Correlation Coefficient,’ we find that the critical value for and is .8000 so we must not reject the null hypothesis and we conclude that we cannot say that the rankings agree.

Example: We find that for . We want to do the same one-sided test as in the last problem .

We can do a t-test by computing and . This is compared with . Since the t we computed is above the table value, we reject the null hypothesis.

Or we can compute a z-score . Since this is above we can reject the null hypothesis.

c. Kendall's Coefficient of Concordance.

Take columns with items in each and rank each column from 1 to . The null hypothesis is that the rankings disagree.

Compute a sum of ranks for each row. Then , where is the mean of the s. If is disagreement, can be checked against a table for this test. If reject . For too large for the table use , where is the Kendall Coefficient of Concordance and must be between 0 and 1.

Example: applicants are rated by officers. The ranks are below.

Note that if we had complete disagreement, every applicant would have a rank sum of 10.5. . The Kendall Coefficient of Concordance says that the degree of agreement on a zero to one scale is . To do a test of the null hypothesis of disagreement , look up in the table giving ‘Critical values of Kendall’s as a Measure of Concordance’ for and ,so that we accept the null hypothesis of disagreement..

Example: For and we get , and wish to test

Since is too large for the table, use . Using a table, look up . Since 9 is below the table value, do not reject .

4. Multiple Correlation

If is the coefficient of determination for a regression , then the square root of , is called the multiple correlation coefficient. Note that where is the sample variance of , and that for large ,.

5. Partial Correlation (Optional)

If , its multiple correlation coefficient can be written as or . For example, in the multiple regression problem, we got three multiple correlation coefficients , and

If and we compute the partial correlation of we compute , the additional explanatory power of the third independent variable after the effects of the first two are considered. If we read from the computer printout,, where and is the number of independent variables.

Example: In the multiple regression problem with which we have been working is the additional explanatory power of beyond what was explained by and . It can be computed two ways. First . The partial correlation coefficient is actually . The sign of the partial correlation is the sign of the corresponding coefficient in the regression. (For the regression equation see below.)

For the second method of computing , recall that the last printout for the regression with which we were working was

Y = 1.51 + 0.595 X - 0.698 W - 0.937 H

Predictor Coef Stdev t-ratio p

Constant 1.5079 0.2709 5.57 0.001

X 0.5952 0.1198 4.97 0.003

W -0.6984 0.4860 -1.44 0.201

H -0.9365 0.5239 -1.79 0.124

Thus the t corresponding to is and, since ,.

6. Collinearity

If , and are highly correlated, then we have no real variation of relative to . This is a condition known as (multi)collinearity. The standard deviations for both and will be large and, in extreme cases, the regression process may break down. Recall that in section I, we said that small variation in can lead to large values of in simple regression and thus insignificant values of .

Similarly in multiple regression, lack of movement of the independent variables relative to one another leaves the regression process unable to tell what changes in the dependent variable are due to the various independent variables. This will be indicated by large values of or which cause us to find the coefficients insignificant when we use a t-test.

A relatively recent method to check for collinearity is to use the Variance Inflation Factor . Here is the coefficient of multiple correlation gotten by regressing the independent variable against all the other independent variables . The rule of thumb seems to be that we should be suspicious if any and positively horrified if . If you get results like this, drop a variable or change your model. Note that, if you use a correlation matrix for your independent variables and see a large correlation between two of them, putting the square of that correlation into the VIF formula gives you a low estimate of the VIF, since the R-squared that you get from a regression against all the independent variables will be higher.

Example: Note that in the printout in section 5, the standard deviations for the coefficients of and are quite large, resulting in small t-ratios and p-values which lead us to believe that the coefficients are not even significant when the significance level is 10%. The data from section J4 is repeated at right:

Computation from these numbers reveals that and Thus and so that , and . Finally, , a relatively high correlation. This and the relatively small sample size account for the large standard deviations and the generally discouraging results. Though the regression against two independent variables has been shown to be an improvement over the regression against one independent variable, addition of the third independent variable, in spite of the high was useless. Preliminary use of the Minitab correlation command, as below, might have warned us of the problem.

MTB > Correlation 'X' 'W' 'H'.

Correlations (Pearson)

X W

W -0.068

H -0.145 0.802

Actually for this problem, the largest VIF, for H, is only about 2.86, but it seems to interact with the small sample size.