Chapter 5-29. Homework Problem Solutions
Chapter 5-1. What regression is and curvilinear correlation
Problem 1) Deterministic straight line function
Linear regression extends the idea of a linear equation that you encountered in basic algebra. The following Stata commands, executed in the do-file editor, produce a graph
similar to what is shown in basic algebra textbooks.
* -- linear equation* pcbarrow is a paired coordinate plot with barbs at both ends
clear
input y1 x1 y2 x2
-3 -2 5 2
end
#delimit ;
twoway (pcbarrow y1 x1 y2 x2, mlwidth(*2)
mlcolor(blue) lcolor(blue) lwidth(*2))
, ylabels(-5(1)5, grid glcolor(green) gmin gmax
angle(horizontal))
xlabels(-5(1)5, grid glcolor(green) gmin gmax)
ytitle(y) xtitle(x , height(5)) aspectratio(1)
yline(0, lcolor(black) lwidth(*2))
xline(0, lcolor(black) lwidth(*2))
scheme(s1color) plotregion(style(none))
title("y = a + bx = 1 + 2x")
;
#delimit cr
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
This is called a “deterministic” function because all of the data points are on the line. In other words, if we know the value of x, then we know y by simply solving the equation.
The idea of the “slope” of the line, which is b = 2, is straightforward. For each 1 unit increase in x, there is a 2 unit increase in y. The idea of the y-intercept is likewise straightforward. When x=0, y = 1 + 2(0) = 1, which is where the line crosses the y-axis. Knowing that the function is a straight line with y-intercept 1 and slope 2 completely determines the equation.
Plugging in some data, by choosing some values for x and then evaluating the equation to get the y, we have,
x / y-2 / -3
-1 / -1
0 / 1
1 / 3
2 / 5
Let’s add these data points to the graph,
* -- linear equationclear
input x y y1 x1 y2 x2
-2 -3 -3 -2 5 2
-1 -1 . . . .
0 1 . . . .
1 3 . . . .
2 5 . . . .
end
#delimit ;
twoway (pcbarrow y1 x1 y2 x2, mlwidth(*2)
mlcolor(blue) lcolor(blue) lwidth(*2))
(scatter y x , msize(*2) mlcolor(blue) mfcolor(blue))
, ylabels(-5(1)5, grid glcolor(green) gmin gmax
angle(horizontal))
xlabels(-5(1)5, grid glcolor(green) gmin gmax)
ytitle(y) xtitle(x , height(5)) aspectratio(1)
yline(0, lcolor(black) lwidth(*2))
xline(0, lcolor(black) lwidth(*2))
scheme(s1color) plotregion(style(none))
title("y = a + bx = 1 + 2x") legend(off)
;
#delimit cr
Now, let’s see if linear regression is smart enough to fit the same line through the data
points.
The assignment is :
Part 1) Cut-and-paste the following into the do-file editor and execute it. This will load the data into Stata and create a scatterplot.
clearinput x y
-2 -3
-1 -1
0 1
1 3
2 5
end
twoway (scatter y x)
Part 2) Fit a linear regression to these data, using the “regress” command.
(Warning: the output will look strange, with a lot of dots denoting what cannot be computed. All of the dots, or missing values, are due to the fact that the equation is fitted exactly, with no variability. With no variability, the statistics shown as dots cannot be computed. The “Coef” column, however, can still be interpreted as valid estimates.)
Solution:
regress y xSource | SS df MS Number of obs = 5
------+------F( 1, 3) = .
Model | 40 1 40 Prob > F = .
Residual | 0 3 0 R-squared = 1.0000
------+------Adj R-squared = 1.0000
Total | 40 4 10 Root MSE = 0
------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
x | 2 . . . . .
_cons | 1 . . . . .
------
Part 3) Did the linear regression fit the expected equation to these data?
Solution:
Source | SS df MS Number of obs = 5
------+------F( 1, 3) = .
Model | 40 1 40 Prob > F = .
Residual | 0 3 0 R-squared = 1.0000
------+------Adj R-squared = 1.0000
Total | 40 4 10 Root MSE = 0
------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
x | 2 . . . . .
_cons | 1 . . . . .
------
Yes, in the “Coef”, or coefficient column of the output, we see 2 on the x row of the regression table, and 1 on the _cons, or constant, row of the regression table. The 2 is the slope, and 1 is the y-intercept. So, the regression equation is:
Y = a + bx = 1 + 2x
as it should be. So, linear regression is smart enough to fit a straight line deterministic equation when the scatterplot reveals a perfect straight line.
Part 4) The Multiple R statistic is the correlation between the outcome variable, y, and the predictor variables taken as a set, which is just x for this problem. When there is only one predictor variable, then Multiple R is identically the Pearson correlation coefficient.
What is the Multiple R from Part 2?
Solution:
From the top part of the regression output, we see “R-squared = 1.0000”. This Multiple R2.
Taking the square root will give Multiple-R. We can compute this in our head, or use:
display sqrt(1.0000)1
So, Multiple-R = 1, denoting a perfect linear association, or linear correlation.
Part 5) The Multiple R2 statistic is the coefficient of determination of the outcome variable, y, and the predictor variables taken as a set, which is just x for this problem. It represents the proportion of variation in the outcome that is explained by the predictors. Interpreting the Multiple R2 statistic in this model, is it consistent with a deterministic (perfectly determined) equation?
Solution:
From the top part of the regression output, we see “R-squared = 1.0000”, which is the Multiple R2. Since it has a value of 1, this means that 100% of the variability in y was explained by x. That is, y was completed explained by x, consistent with a deterministic equation.
Part 6) Compute the Pearson correlation coefficient for these data, using the pwcorr command. First, do not add the “obs”, or sample size option, and do not add the “sig” or significance (the p value) option. Second, add on just the “obs” option. Third, add on both the “obs” option and the “sig” option.
r = ___
n = ___
p = ___
Solution:
pwcorr y xpwcorr y x , obs
pwcorr y x , obs sig
. pwcorr y x
| y x
------+------
y | 1.0000
x | 1.0000 1.0000
. pwcorr y x , obs
| y x
------+------
y | 1.0000
| 5
|
x | 1.0000 1.0000
| 5 5
|
. pwcorr y x , obs sig
| y x
------+------
y | 1.0000
|
| 5
|
x | 1.0000 1.0000
| 0.0000
| 5 5
r = ___ 1.00
n = ___ 5
p = ___ 0.0000, which would be reported as p < 0.001
The output in the statistical package SPSS is a bit more friendly, identifying for you what the three numbers represent. It would produce the following correlation matrix:
Correlations // x / y /
x / Pearson Correlation / 1 / 1.000** /
Sig. (2-tailed) / .000 /
N / 5 / 5 /
y / Pearson Correlation / 1.000** / 1 /
Sig. (2-tailed) / .000 /
N / 5 / 5 /
**. Correlation is significant at the 0.01 level (2-tailed).
Problem 2) Stochastic straight line function
In real life, we do not observe perfect relationships. Instead the scatterplot might have an overall linear look to it, but the values on the graph ( the x-y pairs, or y-x pairs in Stata )
do not lie perfectly on a straight line.
Let’s add some random noise to the data. We’ll use the normal random number generater to add some random variation to the data, where this noise has a mean of 0 and a standard deviation of 0.5.
Part 1) Cut-and-paste the following into the do-file editor and execute it. This will load the data into Stata and create a scatterplot.
clearinput x y
-2 -3
-1 -1
0 1
1 3
2 5
end
set seed 999 // this will ensure we all get the same result
replace x = x + rnormal(0, .25)
replace y = y + rnormal(0, .25)
twoway (scatter y x)
Now the data are a bit fuzzy. Listing the data,
list x y+------+
| x y |
|------|
1. | -1.897546 -3.018522 |
2. | -.8612195 -1.568057 |
3. | -.1446672 .9810534 |
4. | 1.271514 3.161788 |
5. | 2.102981 5.101994 |
+------+
Part 2) Fit a linear regression to these data, using the “regress” command. Notice that the Multiple R2 is no longer 1.
Solution:
Source | SS df MS Number of obs = 5
------+------F( 1, 3) = 205.13
Model | 43.5827172 1 43.5827172 Prob > F = 0.0007
Residual | .637403121 3 .212467707 R-squared = 0.9856
------+------Adj R-squared = 0.9808
Total | 44.2201203 4 11.0550301 Root MSE = .46094
------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
x | 2.051232 .1432201 14.32 0.001 1.595442 2.507023
_cons | .7383993 .2065807 3.57 0.037 .0809672 1.395831
------
We see that this time Multiple R2 = 0.9856, rather than 1, which is still a surprisingly good fit.
Part 3) Evaulate the prediction equation, or regression equation, for the first subject to see if it agrees with the actual y value, using the display command. In Stata multiplication is denoted by the asterisk, “*”. So, your command will look something like: display 1+2*2 , but the numbers will be different, where the x value comes from the first line of the data listing above, and the y-intercept and slope come from the regression model output in Part 2.
Solution:
display 0.7383993 + 2.051232*-1.897546-3.1539078
We see it is a little off from the true y value of -3.018522. That is because the predicted value is on the regression line, while the actual values are the points on the scatterplot.
Part 4) Overlay a scatterplot (twoway scatter) and linear fit (twoway lfit) similar to what was done in Chapter 2-1. The twoway lfit line is a line graph through the predicted values of y, which come from solving the prediction equation for each subject. We will verify that in a later problem.
Solution:
Problem 3) No Association
Two variables are said to have no association if the predictor variable (x) provides no information that helps predict the outcome variable (y). In this situation, the regression line has zero slope. Illustrating this,
clearinput x y
-2 2
-1 -2
0 2
1 -2
2 2
end
twoway (scatter y x)(lfit y x), ylabels(-5(1)5)
pwcorr y x , obs sig
regress y x
| y x
------+------
y | 1.0000
|
| 5
|
x | 0.0000 1.0000
| 1.0000
| 5 5
|
Source | SS df MS Number of obs = 5
------+------F( 1, 3) = 0.00
Model | 0 1 0 Prob > F = 1.0000
Residual | 19.2 3 6.4 R-squared = 0.0000
------+------Adj R-squared = -0.3333
Total | 19.2 4 4.8 Root MSE = 2.5298
------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
x | 0 .8 0.00 1.000 -2.545957 2.545957
_cons | .4 1.131371 0.35 0.747 -3.200527 4.000527
------
We see r = 0 and R, or Multiple R, is 0, indicating no association. Also, the slope (regression coefficient for x) = 0. A slope of 0 is consistent with the idea that the x variable provides no information about the y variable, since for every value of x, y does not change (is not afffected) In this situation, the best guess of what the y value will be for any given x is just the mean of y, since the mean is the most likely value if you have to make a guess, at least under the assumption that the y variable has a normal distribution. The regression equation was Y = a + bx = 0.4 + 0x = 0.4. Let’s see if the mean of y is 0.4,
sum yVariable | Obs Mean Std. Dev. Min Max
------+------
y | 5 .4 2.19089 -2 2
As expected, the mean is 0.4, consistent with the regression model with zero slope.
In real datasets, we don’t’ see the data line up with zero slope this nicely. Due to natural variation in study subjects, such as biologic variation, the “no association” situation looks more like a blob with no discernable pattern. To illustrate this, we will create two columns, or variables, of normally distributed random numbers and then correlate them. Since they are simply random numbers, the correlation should come out to be r=0.
clearset obs 100
set seed 333 // set random number seed
gen y = rnormal(5,2) // normal distribution mean=5, SD=2
gen x = rnormal(5,2)
twoway (scatter y x)(lfit y x), aspectratio(1)
pwcorr y x, obs sig
regress y x