BRII Correlations and Lab 3 F’17 1
Correlations
To this point in our review, we have dealt with single variables, learning how to describe their central tendency and distributions. Correlations deal with the relationship between two or more variables. With this measure, we can answer questions such as;“Are achievement tests scores related to Grade Point Averages?”, “Is counselor empathy related to counseling outcome?”, and “Is student toenail length related to success in graduate school?” Correlational studies, however, are limited in that they do not allow us to make conclusions about causes. No matter how high a correlation we find between two variables, we can only say the two variables are related, we cannot say one is a cause of the other.
To begin, let’s review what we mean by “cause”. When we say one variable has a causal effect on another variable we are saying that changing the first variable will produce a change in the second variable. We can only make this claim if we have, in fact, changed, (manipulated) one variable and observed the effects of this change on a second variable (while holding all other variables constant). Correlational studies do not meet this standard for making causal conclusions. In a correlational study, we simply measure the level of two variables for the same individual and then statistically determine whether knowledge of person’s score on one variable allows us to predict their score on the second variable better than if we did not have that knowledge. If we find a significant correlation between two variables there are always three possible causal explanations for why that relationship exists. These three explanations can be summarized as two problems.
1) The Directional Problem –For example, there is a correlation between Birth Rate in humans (variable A) and number of storks nesting in a village. Two possible explanations for this relationship are 1) that A causes B (i.e., High birthrates in humans cause higher nesting rates of storks; perhaps the storks are attracted to babies). An equally good explanation of this correlation is that B causes A (i.e., Higher rates of storks cause higher birth rates - perhaps my mother is correct and the storks really do bring babies).
2) The Third Variable Problem - Some outside variable (or set of variables) may effect both A and B in a manner so that they co-vary in a predictable way but A does not cause B and B does not cause A. In our example, a possible explanation of the relationship between human birth rates and the number of nesting storks may be that both are effected by the success of the previous harvest. The more food available, the more likely storks will nest in an area, and the more likely humans are to have offspring. So following a good harvest, stork nesting increases and so does human birth rates. They co-vary because they are both effected by a common underlying determinant, but A does not cause B, nor does B cause A.
What correlations do allow us to do is conclude that two variables are (or are not) related. This can be very interesting and valuable information – but it must be used with caution when making a prescription based on a correlation.
Dr. Oz said that having loving, monogamous sex regularly, keeps you younger. Double the amount of sex you have you and you can live longer.
“Being intimate with the people you love is critically important to longevity,” the doctor says. “We’ve got tons of data around it, but the basic rule of thumb is that if the average American has sex once a week, you want to have sex at least twice a week. It increases longevity by about three years. For women, it’s more about quality than quantity. If you don’t have that loving, conjugal relationship with someone you can grow with in life, then you’re not fun and fearless.”
Correlational studies are by definition ex post facto. In a correlational study the researcher measures variables as they naturally occur. They do not do anything to manipulate the variables. Many of the studies that members of this class are conducting are correlational. While there are two variables (at least) neither one is an independent variable nor a dependant variable.
The Pearson Correlation Statistic ( r ) provides two separate pieces of information. (1) The sign, negative or positive tells us the direction of the relationship. If a correlation is positive, it indicates that higher levels on one variable predict higher levels of the second variable (and conversely that lower levels of one variable predict lower levels of the other). A negative correlation indicates that higher levels of one variable predict lower levels of the second variable (and visa versa). When interpreting a correlation it is important to keep in mind that a negative correlation is not a negative result. It is just as informative as a positive correlation.
For example, in my Introduction to Experimental Psychology courses I have found a fairly high positive correlation between class attendance and final grades. The more classes a student attends the higher their final grade tends to be. If instead of measuring the number of classes students attend, I had measured the number of classes they missed, I would find a negative correlation between classes missed and final grades.
The second piece of information a correlation statistic provides is the strength of the relationship -- how accurately we can predict one variable based on the other. The value of a correlation ranges from zero to one. The closer the absolute value of the correlation is to 1 the stronger the relationship. Similarly, the closer the absolute value of the correlation is to zero the weaker the relationship.
One way to represent the correlation between two variables is to graph the relationship between two variables. On the Y (vertical axis) we put the scale for one variable, and on the X axis (horizontal axis) we put the second variable. For now, it really does not matter which you put on which axis. Each subjects’ score on the two variables are thus represented as a data point on this graph which is called a scatter diagram. On the diagram below a subject’s scores on two variables is represented by one data point. Subject 1 attended 40 out of 45 classes and obtained a final grade of 75%.
Data Points
When we plot data froma sample of scores, the pattern of the scores can tell us something about the relationship between the two variables. Two measures that plotted on a graph will produce scatter-plots that approximate a line. The less the scores deviate from the line the stronger the correlation. If the line is sloping upward, the direction of the correlation is positive (the higher the scores on one axis, the higher the scores on the other axis). If the slope is downward, the direction of the correlation is negative (higher scores on one variable are associated with lower scores on the second variable). This line(called the line of best fit) is the line that fits the data in a manner that reduces the overall distance between itself and all the data points. In other words, if you drew a series of lines on your scatterplot and then measured the total amount the data points varied from the line (in two-dimensional geometric space), you would find that there would be one line that produced the lowest total deviation scores. In other words, there is one line that summarizes the relationship between the two variables best, or that best fits the data. For example, in the following figure the line of best fit summarizes the relationship between classes attended and final grade.
Line of Best Fit
The line of best fit for a perfect correlation will have two characteristics.
1) All the data points will line up perfectly on the line of best fit.
2) If the scores are transformed to z-scores, the slope of the line of best fit will be will be 45(negative or positive.)
For example, this scatterplot depicts a perfect positive correlation.
When two variables are not related, they may appear to be random, or circular. For example, the scatterplot below show a correlation of +.06 (very low). The line of best fit is either (and equally) a straight vertical or straight horizontal line extending from at the mean of each distribution. No matter what score I am given on Variable X, my best prediction of the value of variable Y is the mean of the distribution of Y. Knowing the value of X does not provide any usable information.
.
While there are some great uses for scatterplots, which we will discuss, they are not great ways of presenting your results. Instead, we report the correlation statistic.
Significance of a correlation
SPSS will not only tell you the correlation between variables, it will also tell you the p value of the correlation. Like all other tests of significance, this tells us how likely we would be to find this degree of relationship, if the two variables are, in fact, not related. For example, suppose I put random numbers between 1 and 100 in one bucket, and another set of numbers between 1 and 100 in a second bucket. I have each of you pull a number from each bucket. If we entered the numbers from bucket one as variable 1 and the numbers from bucket 2 as variable2 into SPSS. We would expect to find a zero correlation between the two. There should be no relationship between the two variables. They should both be random, unrelated events. Even so, it is unlikely we would actually get a zero correlation between the two. Based on chance alone we could get even a high correlation. When we say a correlation is significant, we are saying that the probability of obtaining a correlation this high or higher, when in fact, there is no relationship between the two variables, is less than 5%. SPSS reports 2-tailed probability levels. We will talk more about 1 and two tailed tests when we talk about t-tests, but I will point out here that you should be using a 2-tailed test for your studies. One-tailed tests are only used when we have very strong reasons for predicting the direction of the correlation. None of your studies meets this criterion.
The magnitude of a correlation is strongly affected by the size of the sample. The more participants you have in your sample the less likely it is that you will obtain a high correlation by chance alone. If you think about it, if we draw just a few numbers from each of the buckets it is more likely they will form a pattern just due to chance than if we drew several pairs of numbers. The same magnitude of correlation may be significant in a large N study, but non-significant in a small N study. For example, in a study with 10 participants a correlation of .67 is needed in order for the result to be significant at the p = .05 level. In a study of 100 participants, the correlation need only be .20 to be significant at the same level. This leads to an interesting, but fairly common phenomenon in correlational studies. Often, researchers are very excited about their data and will run preliminary analysis on after getting data from only a few respondants. They often find fairly high correlations. As they run more subjects, they find the correlations tend to drop. On the other hand, even with high correlations on the small data sets, researchers often find the results are not significant, whereas the lower correlations they find with the larger sample are significant. Correlations are not meant to be used on small samples sizes, but do tend to give fairly stable correlations when sample sizes are larger. This is why I have suggested that those of you who are running correlational studies aim for sample sizes of at least 60.
Statistical significance just means that the correlation is unlikely to be due to chance, it does not mean it is high enough to be important of meaningful or anything of that sort. So how do we judge the size of a correlation? Different researchers have suggested different values, however Cohen (1988) has suggested the following guidelines which seen to be fairly widely used.
Correlations between .10 and .29 are considered small.
Correlations between .30 and .49 are medium and
Correlations over .5 are considered large.
Coefficient of Determination is a measure that can be easily obtained from the correlation statistic that SPSS produces for you. It is simply r squared. It is often more useful to talk about the relationship between your variables in terms of the Coefficient of Determination. For example, if the correlation between Intelligence test scores and GPA were r = .60, squaring it would give you .36 (I know that it seems wrong when you see that the square of a number is smaller than the number being squared, but for decimals, this is true). The coefficient of determination (in our example) indicates that 36% of the variability in Y is “accounted for” by the variability in X. Another way of saying this is that if the correlation between two variables is .60, then they have 36% of their variability”in common”. You may want to use this statistic when discussing the results of your study.
Linear and Nonlinear Relationships
One of the assumptions that the correlation statistic depends on is that the relationship between the two variables is linear. This means that there is a straight line that best describes the relationship between the variables. Not all relationships are linear. For example, arousal level and performance are not linearly related. Let’s say you have a very low arousal level (you can barely stay awake) and you write an exam. How well do you expect to do? On the other hand, lets say you are really highly aroused (15 cups of coffee), again how well do you expect to do. The relationship between arousal and performance is well known. At very high and very low levels of arousal performance is poor, but at middle levels performance is good. The line that best describes this relationship is curved. Looking at the scatterplot of your data should tell you if the relationship is curvilinear or has some other strange shape. If it is, do not despair. There are other ways of analyzing the data, or alternatively there are ways of transforming the data to make the relationship linear. If you find you have a non-linear relationship – let me know and we will discuss the best way to handle it. If you have a non-linear relationship between your variables, it is unlikely you will find a significant correlation. Looking at the distribution below, think about where the line of best fit would be. But this line is not the best fit for the data. In fact, looking at the scatterplot we can see that one variable is highly predictable if we know the value of the other variable, the relationship is just not linear, therefore a Pearson correlation analysis is not appropriate for this for this analysis.
Skewed Distributions
Skewed Distributions can also produce invalid correlation statistics. For example, assume I measure the time people take to finish a set of math problems and correlate this with the number of questions they answer correctly. I plot the relationship between time and number of problems correctly answered. I find that at short time limits they produce fewer correct answers. Students who took longer to complete the problems tend to score higher on the problems - - up until at about 5 minutes. After 5 minutes subjects all obtain perfect scores. Even if they take more time, after 5 minutes they can do no better. This is a ceiling effect and it is due to the limitations of the test. The relationship I obtained between Time and Score is not linear.
Outliers can also strongly effect a correlation either increasing it or decreasing it. When conducting a Correlational analysis you should always recheck the scores of obvious outliers to make sure they are not errors. Outliers can be identified by looking at the scatterplot.
Restricted Range and Correlations. All other things being equal,, the greater the variability among the observations in your distribution, the greater the value of the correlation. For example, if we look at the relationship between height and basketball playing ability in the general population the correlation would be fairly high. If, however, we only sampled people who were over 6 feet, the correlation between these two variables would be much lower. Looking at the diagram below we can see the correlation on the sub-set is lower than in the total set