Edpr 7/8541 Statistical Methods Applied to Education I

For the bivariate linear regression problem, data are collected on an independent or predictor variable (X) and a dependent or criterion variable (Y) for each individual. Bivariate linear regression computes an equation that relates predicted Y scores (Ŷ) to X scores. The regression equation includes a slope weight for the independent variable, Bslope (b), and an additive constant, Bconstant (a):

Ŷ = Bslope X + Bconstant

(or)

Ŷ = bX + a

Indices are computed to assess how accurately the Y scores are predicted by the linear equation.

We will focus on applications in which both the predictor and the criterion are quantitative (continuous – interval/ratio data) variables. However, bivariate regression analysis may be used in other applications. For example, a predictor could have two levels like gender and be scored 0 for females and 1 for males. A criterion may also have two levels like pass-fail performance, scored 0 for fail and 1 for pass.

Linear regression can be used to analyze data from experimental or non-experimental designs. If the data are collected using experimental methods (e.g., a tightly controlled study in which participants have been randomly assigned to different treatment groups), the X and Y variables may be referred to appropriately as the independent and the dependent variables, respectively. SPSS uses these terms. However, if the data are collected using non-experimental methods (e.g., a study in which subjects are measured on a variety of variables), the X and Y variables are more appropriately referred to as the predictor and the criterion, respectively.

Understanding Bivariate Linear Regression

A significance test can be conducted to evaluate whether X is useful in predicting Y. This test can be conceptualized as evaluating either of the following null hypotheses: the population slope weight is equal to zero or the population correlation coefficient is equal to zero.

The significance test can be derived under two alternative sets of assumptions, assumptions for a fixed-effects model and those for a random-effects model. The fixed-effects model is probably more appropriate for experimental studies, while the random-effects model seems more appropriate for non-experimental studies. If the fixed-effects assumptions hold, linear or non-linear relationships can exist between the predictor and criterion. On the other hand, if the random-effects assumptions hold, the only type of statistical relationship that can exist between two variables is a linear one.

Regardless of the choice of assumptions, it is important to examine a bivariate scatterplot of the predictor and the criterion variables prior to conducting a regression analysis to assess if a non-linear relationship exists between X and Y and to detect outliers. If the relationship appears to be non-linear based on the scatterplot, you should not conduct a simple bivariate regression analysis but should evaluate the inclusion of higher-order terms (variables that are squared, cubed, and so on) in your regression equation. Outliers should be checked to ensure that they were not incorrectly entered in the data set and, if correctly entered, to determine their effect on the results of the regression analysis.

Fixed-Effects Model Assumptions for Bivariate Linear Regression

Assumption 1: The Dependent Variable is Normally Distributed in the Population for Each Level of the Independent Variable

In many applications with a moderate or larger sample size, the test of the slope may yield reasonably accurate p values even when the normality assumption is violated. To the extent that population distributions are not normal and sample sizes are small, the p values may be invalid. In addition, the power of this test may be reduced if the population distributions are non-normal.

Assumption 2: The Population Variances of the Dependent Variable are the same for All Levels of the Independent Variable

To the extent that this assumption is violated and the sample sizes differ among the levels of the independent variables, the resulting p value for the overall F test is not trustworthy.

Assumption 3: The Cases Represent a Random Sample from the Population, and the Scores are Independent of Each Other from One Individual to the Next

The significance test for regression analysis will yield inaccurate p values if the independence assumption is violated.

Random-Effects Model Assumptions for Bivariate Linear Regression

Assumption 1: The X and Y Variables are Bivariately Normally Distributed in the Population

If the variables are bivariately normally distributed, each variable is normally distributed ignoring the other variable and each variable is normally distributed at every level of the other variable. The significance test for bivariate regression yields, in most cases, relatively valid results in terms of Type I errors when the sample is moderate to large in size. If X and Y are bivariately normally distributed, the only type of relationship that exists between these variables is linear.

Assumption 2: The Cases Represent a Random Sample from the Population, and the Scores on Each Variable are Independent of Other Scores on the Same Variable

The significance test for regression analysis will yield inaccurate p values if the independence assumption is violated.

Effect Size Statistics for Bivariate Linear Regression

Linear regression is a more general procedure that assesses how well one or more independent variables predict a dependent variable. Consequently, SPSS reports strength-of-relationship statistics that are useful for regression analyses with multiple predictors. Four correlational indices are presented in the output for the Linear Regression procedure: the Pearson product-moment correlation coefficient (r), the multiple correlation coefficient (R), its squared value (R2), and the adjusted R2. However, there is considerable redundancy among these statistics for the single-predictor case: R = |r|, R2 = r2, and the adjusted R2 is approximately equal to R2. Accordingly, the only correlational indices we need to report in our manuscript for a bivariate regression are r and r2.

The Pearson product-moment correlation coefficient ranges in values from -1.00 to +1.00. A positive value suggests that as the independent variable X increases, the dependent variable Y increases. A zero value indicates that as X increases, Y neither increases nor decreases. A negative value indicates that as X increases, Y decreases. Values close to -1.00 or +1.00 indicate stronger linear relationships. The interpretation of strength of relationship should depend on the research context.

By squaring r, we obtain an index that directly tells us how well we can predict Y from X, r2 indicates the proportion of Y variance that is accounted for by its linear relationship with X. Alternatively, r2 (coefficient of determination) can be conceptualized as the proportion reduction in error that we achieve by including X in the regression equation in comparison with not including X in the regression equation.

Other strength-of-relationship indices may be reported for bivariate regression problems. For example, SPSS gives Standard Error of the Estimate on the output. The standard error of estimate is an index indicating how large the typical error is in predicting Y from X. it is a useful index over and above correlational indices because it indicates how badly we predict the dependent variable scores in the metric of these scores. In comparison, correlational statistics are unit-less indices and, therefore, are abstract and difficult to interpret.

Conducting a Bivariate Linear Regression Analysis

1. Open the data file

2. Click Analyze

à Regression

then click Linear

You will see the Linear Regression dialog box.

3. Select your dependent variable

then click ► to move it to the Dependent box.

(for this example – MATHACH was chosen)

4. Select your independent variable

then click ► to move it to the Independent box.

(for this example – VISUAL was chosen)

5. Click Statistics

You will see the Linear Regression: Statistics dialog box

6. Click Confidence intervals and Descriptives

Make sure that Estimates and Model fit are also selected.

7. Click Continue

8a. (For total sample information) Click OK

For this example, we will look at the total sample information.

8b. (For group information) Click Paste

Make the necessary adjustments to your syntax (i.e., temporary/select if command), then run the analysis.

Selected SPSS Output for Bivariate Linear Regression

The results of the bivariate linear regression analysis example are shown below. The B’s, as labeled on the output in the Unstandardized Coefficients box, are the additive constant, a (8.853) and the slope weight, b (.745) of the regression equation used to predict the dependent variable from the independent variable.

The regression or prediction equation is as follows:

Ŷ = Bslope X + Bconstant

(or)

Ŷ = bX + a

Predicted Mathematics Test Score = .745 Visualization Test Score + 8.853

Syntax:

REGRESSION

/DESCRIPTIVES MEAN STDDEV CORR SIG N

/MISSING LISTWISE

/STATISTICS COEFF OUTS CI R ANOVA

/CRITERIA=PIN(.05) POUT(.10)

/NOORIGIN

/DEPENDENT mathach

/METHOD=ENTER visual .

Based on the magnitude of the correlation coefficient, we can conclude that the visualization test is moderately related to the mathematics test (r = .438). Approximately 19% (r2 = .192) of the variance of the mathematics test is associated with the visualization test.

The hypothesis test of interest evaluates whether the independent variable predicts the dependent variable in the population. More specifically, it assesses whether the population correlation coefficient is equal to zero or, alternatively, whether the population slope is equal to zero. This significance test appears in two places for a bivariate regression analysis: the F test reported as part of the ANOVA table and the t test associated with the independent variable in the Coefficient table. They yield the same p value because they are identical tests: F (1, 498) = 118.519, p < .001 and t (498) = 10.887, p < .001. In addition, the fact that the 95% confidence interval for the slope does not contain the value of zero indicates that the hypothesis should be rejected at the .05 level.

Using SPSS Graphs to Display the Results

A variety of graphs have been suggested for interpreting linear regression results. The results of the bivariate regression analysis can be summarized using a bivariate scatterplot. Conduct the following steps to create a simple bivariate scatterplot for our example:

1. Click Graphs (on the menu bar)

then click Scatter

2. Click Simple

then click Define

3. Click the Dependent (Criterion) Variable and click ► to move it to the Y axis box.

4. Click the Independent (Predictor) Variable and click ► to move it to the X axis box.

5. Click OK

Once you have created a scatterplot showing the relationship between the two variables, you can add a regression line by following these steps:

1. Double-click on the chart to select it for editing, and maximize the chart editor.

2. Click Chart from the menu at the top of window in the chart editor

then click Options.

3. Click Total in the Fit Line box.

4. Click OK then close the Chart1 – SPSS Chart Editor

For Example: Your scatterplot would look like the one below:

An examination of the plot allows us to assess how accurately the regression equation predicts the dependent variable scores. In this case, the equation offers some predictability, but many points fall far off the line, indicating poor prediction for those points.

Regression

Page - 8