LINEAR REGRESSION

Below are the birth weights (grams, “weight”) and weeks of gestation (“age”) of 24 infants.

Males are denoted by sex=1, and females by sex=0. The data can be copied from (shown at end of this handout).

Of interest is the relationship between age (independent, predictor, or “x” variable) and weight (dependent, response, or “y” variable). Suppose a scatter plot of the data is created with the weight variable on the y-axis and age on the x-axis. Furthermore, suppose you were to draw a straight line through the points to describe the relationship between the two variables.

Stat Regression  Fitted Line Plot

Response (Y): weight

Predictor (X): age

What is the slope of this line? (Answer: 115.528) What is the y-intercept of this line? (Answer: -1484.98). The slope tells us that the average weight of a baby is 115.5grams heavier than babies that had a week less of gestation. The y-intercept is not of concern here as it is used just to draw

a sensible straight line … a 0 week term baby obviously not weigh –1485 grams (a midsize baby that floats to the ceiling?).

How was the line decided? It is decided by “least-squares”. Imagine a rubber band that connects each point to a stick (the line). Furthermore, the pull of the rubber band is the square of the distance from the point to the stick; e.g., a 3-inch distance exerts 9 units of pull compared to a 2-inch distance exerting only 4 units of pull. The line is where the stick would reach equilibrium. The vertical distance between the fitted line and the points are called “residuals”.

Is there a way to tell how good our estimate of the slope is? And are there hypothesis tests? Yes and yes. Ho: Slope of line fit to the population is 0. Ha: Slope of line fit to the population would not be 0. That is, Ho: No relationship (“association” between X and Y variables. Ha: There is a relationship. The standard error gives us an idea of how good our estimate for the slope is.

Stat Regression  Regression

Response: Weight

Predictors: Age

Fitted Line Plot: weight versus age

Regression Analysis: weight versus age

The regression equation is

weight = - 1485 + 116 age

Predictor Coef SE Coef T P

Constant -1485.0 852.6 -1.74 0.096

age 115.53 22.10 5.23 0.000

Ignore the ANOVA table and focus on the above.

Of importance are the coefficient estimates (Constant= y-intercept=-1485, age= slope=115.5), the standard error estimate for the slope (22.10) and the p-value (P<0.001). Because the coefficient estimate is more than 2 standard errors from 0 we have good evidence that the slope is not 0. This is summarized by a very, very small P-value.

What good is all this? (1) Good for describing the relationship between the predictor and response variables. (2) Good for predicting the y-variable from an x-variable as long as the x-variable is not extrapolated outside of the range for the original x values from which we fit the line; e.g., don’t use this to predict a 55-week term baby’s weight.

(1) We expect about a 115.5-gram increase for each week of gestation before being given birth;

e.g., a 40 week term baby would way, on average, about 115.5 grams more than a 39 week term baby.

(2) How much do you expect a 40-week term baby, ignoring gender, to weigh?

Answer: -1485.0 + 115.53*40=3136 grams.

Data:

age weight sex

40 2968 1

38 2795 1

40 3163 1

35 2925 1

36 2625 1

37 2847 1

41 3292 1

40 3473 1

37 2628 1

38 3176 1

40 3421 1

38 2975 1

40 3317 0

36 2729 0

40 2935 0

38 2754 0

42 3210 0

39 2817 0

40 3126 0

37 2539 0

36 2412 0

38 2991 0

39 2875 0

40 3231 0

Simple Linear Regression

Suppose a straight line is to be fit to the points , where i=1,…,n; y is called the dependent variable and x is called the independent variable, and we wish to predict y from x. Sometimes y is called the response or outcome variable and x called the predictor or explanatory variable. A straight line takes the form where is the intercept and is the slope. The beta’s are estimated from the data by a method called least squares.

Let be the predicted value of y using the beta estimates. Least squares finds the values of beta which minimize , thus the name least squares. The model’s residual for each observation is , and thus least squares is minimizing the sum of the squared residuals (Residual Sum of Squares=RSS). The mathematics behind finding the optimal betas are a hassle to calculate by hand, but are quickly calculated by all statistical packages.

The statistics are based upon the idea that each y is created via the model and some noise term; . The statistician does not know the true values of the beta nor noise; only the y’s and x’s are observed. Distributional assumptions about the noise term determine what statistics are valid regarding the beta estimates.

Assuming the above model and that , the least square estimates for the betas are unbias. That is, if the mean of the noise distribution is 0, then the estimate for the betas will be, on average, correct. The estimates for the betas may be unbias, but because of the noise term it is unlikely that your estimate for betas actually equal the true beta values.

Assuming the above and that the noise has constant variance, , and each noise value is independent of each other, then unbias estimates of the beta variances can be made. These estimates are called the standard error estimates of the betas. Although the true variance is not known, yet needed for the standard error estimates, we can use an unbias estimate for the noise variance: RSS/(n-2), where n-2 is the degrees of freedom.

If the noise is also distributed as independent normal random variables, then estimates for the betas are also normally distributed. Consequently the Student’s t distribution can be used to construct confidence intervals and hypothesis tests for the betas.

Scatter plots of the original data, residuals against the ’s, and residuals against the independent variable are helpful for assessing the model’s assumptions and fit. Plots of the residuals against the x values will help show any systematic misfit or ways in which the data do not satisfy the model assumptions. Ideally, there should be no pattern between the residuals and the x values.