Topic6. Bivariate correlation and regression
Introduction
Relationships
- Is there are relationshipbetween two variables in the population?
Nominal with nominal – chi-square
Nominal with ordinal – chi-square
Ordinal with ordinal – chi-square
Interval/ratio DV with a 2 category IV – independent samples t test
Interval/ratio DV with a >2 category IV (nominal or ordinal) – ANOVA (F test)
Interval/ratio with interval/ratio – t test (in bivariate correlation and regression)
- How strong is the relationship? (Measures of association)
Nominal with nominal – Lambda, etc.
Nominal with ordinal – Lambda, etc.
Ordinal with ordinal – Gamma, etc.
Interval/ratio DV with a >2 category IV (nominal or ordinal) – Eta-squared
Interval/ratio with interval/ratio – Pearson’s r (a.k.a. bivariate correlation)
- How can we characterize/what is the direction of the relationship?
Nominal with nominal – cross-tabulation or clustered bar chart
Nominal with ordinal – cross-tabulation or clustered bar chart
Ordinal with ordinal – cross-tabulation or clustered bar chart; Gamma, etc.
Interval/ratio DV with a 2 category IV – comparison of means table
Interval/ratio DV with a >2 category IV (nominal or ordinal) – comparison of means table or means plot
Interval/ratio with interval/ratio – scatterplot; bivariate correlation and regression
Scatterplots
Figure 1. Infant Mortality and Literacy (N=107).
Source: World95.sav.
Strength is determined by the spread of the cases. Direction is determined by the pattern of joint scores. You can even use scatterplots to identify nonlinear relationships. Scatterplots are very useful descriptive tools, but they don’t help us with making predictions about the population.
Covariation
- The most elementary measure for identifying a bivariate relationship between two interval-ratio variables
- The covariance is a building-block for other statistics including r (Pearson’s correlation) and b (the regression slope)
- The formulais:
; The numerator is referred to as the sum of the cross-products.
How does it work? Here is an example:
1 Negative covariance – scores below the mean on one variable are above the mean on the otherx / y / / /
1 / 10 / -2 / 2 / -4
2 / 9 / -1 / 1 / -1
3 / 8 / 0 / 0 / 0
4 / 7 / 1 / -1 / -1
5 / 6 / 2 / -2 / -4
Sum / 0 / 0 / -10
Mean / 3 / 8
Covariance / -2.0
2 Positive covariance – scores below the mean on one variable are below the mean on the other; scores above the mean on one variable are above the mean on the other
x / y / / /
1 / 6 / -2 / -2 / 4
2 / 7 / -1 / -1 / 1
3 / 8 / 0 / 0 / 0
4 / 9 / 1 / 1 / 1
5 / 10 / 2 / 2 / 4
Sum / 0 / 0 / 10
Mean / 3 / 8
Covariance / 2.0
3 No covariance
x / y / / /
1 / 6 / -2 / -2 / 4
1 / 10 / -2 / 2 / -4
2 / 7 / -1 / -1 / 1
2 / 9 / -1 / 1 / -1
3 / 8 / 0 / 0 / 0
3 / 8 / 0 / 0 / 0
4 / 9 / 1 / 1 / 1
4 / 7 / 1 / -1 / -1
5 / 10 / 2 / 2 / 4
5 / 6 / 2 / -2 / -4
Sum / 0 / 0 / 0
Mean / 3 / 8
Covariance / 0
The downside to the covariance is that the number doesn’t have any meaning in and of itself (it depends on the ‘metric’ of the variables). If only there was some way to convert it into something else...
The bivariate correlation coefficient (Pearson’s r)
- Indicates the strength and direction of a straight line (linear) relationship
- Is a symmetrical measure of association (i.e., it doesn’t matter which is the DV and which is the IV)
- Is a single number ranging from -1 to 1 with 0 indicating no relationship
- A formula (different than the one in our book, but using stuff we know):
Let’s work through an example (I took a random sample of 20 countries from the World95.sav data file):
Country / Infant mortality (y) / Literacy (x) / / / / /1 / Bangladesh / 106 / 35 / 51.7 / 2668.0 / -30.9 / 957.7 / -1598.5
2 / Burkina Faso / 118 / 18 / 63.7 / 4051.7 / -47.9 / 2299.0 / -3052.0
3 / Cent. Afri.R / 137 / 27 / 82.7 / 6831.5 / -38.9 / 1516.9 / -3219.1
4 / Czech Rep. a
5 / Ethiopia / 110 / 24 / 55.7 / 3097.2 / -41.9 / 1759.6 / -2334.5
6 / Finland / 5.3 / 100 / -49.0 / 2405.6 / 34.1 / 1159.6 / -1670.2
7 / Georgia / 23 / 99 / -31.3 / 982.7 / 33.1 / 1092.5 / -1036.1
8 / Hong Kong / 5.8 / 77 / -48.5 / 2356.8 / 11.1 / 122.2 / -536.6
9 / India / 79 / 52 / 24.7 / 607.8 / -13.9 / 194.5 / -343.8
10 / Iran / 60 / 54 / 5.7 / 32.0 / -11.9 / 142.7 / -67.5
11 / Lebanon / 39.5 / 80 / -14.8 / 220.4 / 14.1 / 197.5 / -208.6
12 / Libya / 63 / 64 / 8.7 / 74.9 / -1.9 / 3.8 / -16.8
13 / Lithuania / 17 / 99 / -37.3 / 1394.8 / 33.1 / 1092.5 / -1234.4
14 / Malaysia / 25.6 / 78 / -28.7 / 826.4 / 12.1 / 145.3 / -346.5
15 / N. Korea / 27.7 / 99 / -26.6 / 710.1 / 33.1 / 1092.5 / -880.8
16 / Netherlands / 6.3 / 99 / -48.0 / 2308.5 / 33.1 / 1092.5 / -1588.1
17 / New Zealand / 8.9 / 99 / -45.4 / 2065.5 / 33.1 / 1092.5 / -1502.2
18 / Nicaragua / 52.5 / 57 / -1.8 / 3.4 / -8.9 / 80.1 / 16.5
19 / Somalia / 126 / 24 / 71.7 / 5134.1 / -41.9 / 1759.6 / -3005.6
20 / U.Arab Em. / 22 / 68 / -32.3 / 1046.4 / 2.1 / 4.2 / -66.4
Sum / 1032.6 / 1253.0 / 0.0 / 36817.7 / 0.0 / 15804.9 / -22691.3
Mean / 54.3 / 65.9
a. The CzechRepublic has valid data for infant mortality, but not for literacy. I have used listwise deletion – that is, the results are based on the 19 countries with valid data on both variables.
; ;
; a strong negative relationship
But is there a relationship in the population? In other words, is r sufficiently different from 0 in my sample for me to assume it is different from 0 in the population? What is our population? Let’s perform a hypothesis test:
Requirements (assumptions) for Pearson’s correlation coefficient:
1. A straight line relationship
2. Interval-ratio level data
3. Random sampling
4. Normally distributed characteristics – “Testing the significance of Pearson’s r requires both X and Y variables to be normally distributed in the population. In small samples, failure to meet the requirement of normally distributed characteristics may seriously impair the validity of the test. However, this requirement is of minor importance when the sample size equals or exceeds 30 cases” (p. 329).
Two-tailed hypotheses and alpha (.05) [you can also test one-tailed hypotheses]:
H0: =0
H1: ≠0
The test statistic and sampling distribution: t
The formula for converting our correlation into a t value:
df=n-2=17
Critical value=2.110 (df=17, 2 tailed, alpha=0.05)
We reject H0.
But be careful!
1. The correlation coefficient is sensitive to unusual combinations of mean deviations (outliers on both X and Y) as well as outliers on X and Y separately (which influence the standard deviations).
2. It can only detect linear relationships.
You should always check univariate descriptive statistics as well as scatterplots before relying on the correlation.
Results from SPSS (using the complete data):
We can reject the null hypothesis (H0: =0) because p (‘Sig.’) is less than alpha (.05)
Correlation matrices
Later on, when we are working with more than two variables, you will see correlation matrices. These display all possible bivariate correlations. Here is an example of unedited output from SPSS:
You should improve the matrix for presentation:
Table 1. Bivariate correlations (N=74).
(1) / (2) / (3) / (4) / (5) / (6)Infant mortality (1) / 1.000
Literacy (2) / -.920* / 1.000
GDP (3) / -.692* / .627* / 1.000
Calories (4) / -.774* / .682* / .760* / 1.000
Birth rate (5) / .865* / -.871* / -.741* / -.757* / 1.000
Fertility (6) / .843* / -.863* / -.652* / -.691* / .974* / 1.000
*p<.01 (2-tailed)
Source: World95.sav.
Scatterplots and bivariate correlation in SPSS
Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define
Analyze → Correlate → Bivariate
Bivariate regression
- Bivariate correlation is not terribly helpful for identifying how meaningful a relationship is (i.e., what does-.942 really mean?).
- Bivariate regression is another way to describe the linear relationship between two interval-ratio variables
- With regression, we calculate the best-fitting line (a.k.a. the slope) and an intercept to summarize the relationship; the slope conveys more meaning because of the interpretation – each one unit increase in X leads to a B unit increase/decrease in Y
- ; the predicted score for the ith case is equal to the y-intercept (a) plus the product of the slope (b) and the ith case’s score on x.
The following formula gives us the best-fitting line (when certain assumptions hold):
; the slope is equal to the covariance between y and x divided by the variance of x
Covariance:; Variance of x:
; a is the y-intercept
Note that the slope is an asymmetrical measure of association because the denominator includes only the variance for x (i.e., the independent variable); whereas Pearson’s correlation is symmetrical because it includes the standard deviations for both x and y.
What numbers do you need to draw a line? You need a point and a slope
- The point that we are interested in is called the y-intercept
- The y-intercept is the value of y (the dependent variable) when the independent variable equals 0
- This is the point at which the line crosses the y axis
- SPSS refers to the y-intercept as the “(Constant)”; it is circled below
- Interpretation: the predicted number of deaths per 1,000 live births is 160.732 for a country with a literacy rate of 0 (i.e., a country in which nobody can read or write – this country doesn’t exist; the minimum literacy rate is 18% for Burkina Faso; we’ll deal with this in a bit)
- The second thing you need to draw a line is the slope
- The slope (or “b”) is the change in the dependent variable (y) for each one unit change in the independent variable (x).
- SPSS includes the slope in the ‘B’ column (it is in a rectangular box above)
- Interpretation: Every 1% increase in literacy reduces the number of deaths by 1.507 per 1,000 live births.
- What would ‘no relationship’ look like?
If we have these two things (the y-intercept and the slope) we can draw a line:
- is the predicted value of the dependent variable for case i
- a is the y-intercept (the value of the dependent variable when X= 0)
- b is the slope
- is the value of the independent variable for case i
- We could write the equation for our example as:
You can use this equation to predict scores of the dependent variable for individual cases:
yi / xi / / /Country / Infant mortality / Literacy / Predicted infant mortality / Predicted infant mortality / Prediction error
Afghanistan / 168 / 29 / 160.732 - 1.507*29 = / 117.029 / 51.0
Argentina / 25.6 / 95 / 160.732 - 1.507*95 = / 17.567 / 8.0
Armenia / 27 / 98 / 160.732 - 1.507*98 = / 13.046 / 14.0
Australia / 7.3 / 100 / 160.732 - 1.507*100 = / 10.032 / -2.7
Austria / 6.7 / 99 / 160.732 - 1.507*99 = / 11.539 / -4.8
Azerbaijan / 35 / 98 / 160.732 - 1.507*98 = / 13.046 / 22.0
Bahrain / 25 / 77 / 160.732 - 1.507*77 = / 44.693 / -19.7
Bangladesh / 106 / 35 / 160.732 - 1.507*35 = / 107.987 / -2.0
Barbados / 20.3 / 99 / 160.732 - 1.507*99 = / 11.539 / 8.8
Belarus / 19 / 99 / 160.732 - 1.507*99 = / 11.539 / 7.5
…
Notice that the predicted value does not equal each case’s observedvalue. These differences are called prediction errors. We will always have prediction errors – this is expected. Remember that the regression line is meant to summarize the relationship between two variables. Anytime that you summarize, you simplifyand simplification leads to the loss of information. So our picture is not perfect, but it gives us a general sense of the relationship.
Other ways to write the regression equation:
- ; where yiis the actual score and ei is the prediction error; or
- ; where the actual score is equal to the predicted score plus the prediction error.
The method that we are using to estimate the slope is referred to as ordinary least squares (OLS). When certain assumptions hold (we’ll discuss these in a few weeks), the OLS slope is the ‘best’ (i.e., it is better than the slopes calculated by other means). It is best because it minimizes the residual sum of squares (RSS):
In other words, the best-fitting line is the one that minimizes the squared prediction errors or the squared difference between each case’s actual score and their score predicted by the regression model.
How meaningful is the slope?
- Every 1% increase in literacy reduces the number of deaths by 1.507 per 1,000 live births
- The minimum value of literacy is 18 and the maximum is 100
- The predicted infant mortality rate for a literacy rate of18%=133.606
- The predicted infant mortality rate for a literacy rate of 100%=10.032
- This seems to be a pretty meaningful difference
- You could (and should) compute predicted probability for other less extreme values (perhaps the lower and upper quartiles or +/- 1 standard deviation)
Hypothesis testing
Meaningful, however, is not the same thing as statistically significant. Our slope coefficient is pretty close to 0. Is it different enough from 0 that we can assume that the slope is not equal to 0 in the population? We need a statistical hypothesis test to answer this question.
Some requirements (assumptions) for regression:
1. Both variables are measured at the interval-ratio level
2. There is a straight line relationship (There are ways to get around this that we’ll discuss later). Regression is sensitive to outliers (we’ll also deal with this later).
3. Random sample – do we have a random sample in this example? We’ll pretend just for now…
4. To test significance, you must assume the variables have normal distributions in the population or you must have a large sample.
Two-tailed hypotheses and alpha (.05) [you can also test one-tailed hypotheses]:
H0: =0
H1: ≠0
Test statistic and distribution: t
; df=n-2;
Like all standard errors, the standard error of the regression slope describes variability across all possible samples of the same size from the population. The standard error of the slope describes variability in the estimate of the slope across all possible samples. By using this formula (for t), we are converting the slope coefficient to a t score so that we can see how unusual our sample slope is if the null hypothesis is true.
Critical value≈ ± 1.980 (=.05; 2 tailed hypotheses; df= 105)
Observed t=-21.219
We reject the null hypothesis and conclude that there probably is a relationship between infant mortality and literacy in the population. We would not obtain a slope as far away from 0 as -1.507 very often if the population slope is equal to 0. The approximate probability of our data if the null hypothesis is true is:
p = 9.176113601021e-040 = .0000000000000000000000000000000000000009176113601021
So p <which would also allow us to reject the null hypothesis.
Note that the sample slope is a point estimate of the population slope. We can also calculate a confidence interval:CI=b±(z*sb). 95 out of 100 confidence intervals should contain the true population slope. Thus, we can be pretty confident that the population slope is between -1.648 and -1.366. Notice that this interval does not contain 0 – this is a third way to test the null hypothesis.
A potential problem…and a solution
The y-intercept isnot terribly meaningful because no countries have a score of 0% on literacy. There is a trick that we can use to get around this – it is known as grand mean centering. You simply subtract the mean value of the independent variable from every score and use this centered variable in the regression. SPSS Syntax:
* Grand mean centering.
freq vars=literacy /stats=mean.
compute lit_c=literacy-78.33644859813.
freq vars=literacy lit_c /stats=all /histogram.
LITERACY / LIT_CFrequency / Frequency
Valid / 18 / 1 / Valid / -60.34 / 1
24 / 2 / -54.34 / 2
… / …
99 / 22 / 20.66 / 22
100 / 3 / 21.66 / 3
Total / 107 / Total / 107
Missing / System / 2 / Missing / System / 2
Total / 109 / Total / 109
/
Before:
/ After:
Before:
After:
Little has changed…
Our intercept, however, is now 42.674:
- 42.674 is the predicted number of infant deaths for a country with ‘0’ literacy
- We have, however, changed what 0 means by centering the variable
- On the original literacy variable, 0 meant 0% literate
- On the centered variable (lit_c), 0 is equal to the mean literacy rate (78.34%)
- It is accurate to say that the predicted number of infant deaths for a country with mean literacy is 42.674 per 1,000 live births
- You may have noticed that the new intercept is the average infant mortality rate. This will be true in bivariate regression. The real benefit will come when we do multivariate regression with more than one independent variable. Centering will also come in very handy later on when we discuss interaction effects.
Measures of association
Notice that I haven’t said anything about determining the strength of the relationship from the y-intercept and slope. These tell us nothing about the strength of the relationship – they only tell us if there is a relationship and they help us to describe and understand the relationship.
There are three measures of association for interval-ratio variables that tell us about strength (in addition to other things). These are Pearson’s product moment correlation coefficient (r), the coefficient of determination, (r2), and the standardized slope coefficient (Beta).
The coefficient of determination
- Another measure of association for interval-ratio variables (PRE)
- It is equal to the correlation squared
- Because it is squared, it cannot tell us the direction of the relationship
- However, it has a more meaningful interpretationthan r
- It tells us how much our errors in predicting the dependent variable are reduced by taking into account the independent variable
- It reflects the total variation in the dependent variable explained by the independent variable
The logic of r2:
Imagine that I am trying to predictthe infant mortality rate forSouth Africa. What is my best guess without knowing anything about South Africa? My best guess would be the mean infant mortality rate across all countries:42.674 deaths per 1,000 live births
Now, imagine that I am allowed to use one variable to improve my prediction. I decide to use the literacy rate (uncentered to simplify the example). 76% of the population of South Africa is literate. I can plug 76%into the regression equation to generate a new estimate:
South Africa has an actual infant mortality rate of 47.1.
My prediction errors:
- Using only the grand mean: 4.426 = 47.1 – 42.674
- Using the literacy rate: 0.9 = 47.1 – 46.2
- I have reduced my prediction substantially – by a proportion of or 79.7%
The coefficient of determination (r2) is a measure of association that summarizes just how wrong we are in our predictions across all cases. It tells us how much better our prediction is when we use the independent variable to predict the dependent variable rather than guessing the mean.
The good news is that once you have calculated the correlation coefficient, it is really easy to get the coefficient of determination. All you have to do is square the correlation. It is also included in SPSS output:
In this example, we reduce our prediction errors by a proportion of .811 or by 81.1% when we use literacy to predict infant mortality. We can also say that literacy explains 81.1% of the variation in infant mortality. This is clearly a strong relationship.
Other measures of strength
We’ve already discussed the correlation coefficient. In a bivariate regression, the standardized slope coefficient is equal to the bivariate correlation. This will not be the case in multivariate regression. There is some controversy surrounding standardized slope coefficients. I am not a big fan of them, but we will talk about them more when we discuss multivariate regression.
Bivariate correlation and regression in SPSS – Example 2
Our research questions: Are there relationships between anti-Black stereotyping (the DV), age and education among Whites in the US?
Our hypotheses (alpha = .01):
H0: age ≤ 0H1: age > 0 / H0: education ≥ 0
H1: education < 0
The variables:
Stereotyping is an index that I created from four variables:
1. “Now I have some questions about different groups in our society. I’m going to show you a seven-point scale on which the characteristics of people in a group can be rated. In the first statement a score of 1 means that you think almost all of the people in that group are “rich.” A score of 7 means that almost everyone in the group are “poor.” A score of 4 means you think that the group is not towards one end or another, and of course you may any number on between that comes closest to where you think people in the group stand. Jews?”
2. Hard-working to Lazy
3. Not prone to violence to prone to violence
4. Intelligent to unintelligent
I have separate indexes for four target groups: Jews, Blacks, Hispanics, and Asians. We’ll focus on anti-Black stereotyping.
Each index can range from 4 to 28 (since the original items range from 1 to 7). Scores above 16 indicate negative stereotypes. Scores of 16 indicate neutrality (since 4 is neutral and there are 4 questions). Scores below 16 indicate positive stereotypes.
I have used a filter to select only the White respondents for the analysis: Data → Select Cases
SPSS Syntax:
freq vars=race.
USE ALL.
COMPUTE filter_$=(race=1).
VARIABLE LABEL filter_$ 'race=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE .
freq vars=race.
Univariate statistics for anti-Black stereotyping: