Chapter 9 Lecture Notes: Two-Sample Tests

Page 1QU1 Lecture NotesModule 7

Module 7: Regression & Correlation

Simple Linear Regression:

Once you know there is an association between 2 or more variables, if it is strong enough, you can make a statistically significant prediction of the DV based on the IV(s).
We can plot the paired observations of the dependent & IVs in a scatter diagram (a chart used to represent the relationship between two variables). The relationship between the variables can take many forms; straight-line (linear) or curvilinear AND positive (direct) or negative (inverse) depending on the slope.
Simple Linear Regression is when a single numerical independent variable is used to predict the dependent variable(ex. years experience & salary).
A regression equation is a mathematical equation, representing a unique line on a graph that expresses the relationship between two variables (“predicting equation” or “line of best fit”).

a)The Dependent Variable(“DV”)or response variableis the variable being predicted. Scaled on the Y-axis.

b)The Independent Variable(“IV”) is the variable used as the predictor. Scaled on the X-axis.

c)Use X to predict Y based on past experience (paired data). The regression equation is unique!

The linear relationship between two variables is given by the generalstraight-line equation: Ŷ = a + bX

Where X is the IV, Y is the DV, Ŷ is the predicted value of Y for a given X

a is the Y-intercept the value of Y when X=0 (where the line crosses the Y-axis).

b is the slope of the line measures the rate of change in Y per unit change in X (rise over run).

The simple linear regression model(aka. First-order model) is: Yi = 0 + 1Xi + i

Where 0 is the Y-intercept 1 is the slope for the population. i is the random error in the regression analysis.  is the random term that changes this from a deterministic model to a probabilistic model.

Least Squares Principle computes the regression equation that minimizes the sum of the squared vertical deviations from the "line of best fit" (minimizes the distance from the actual Y values to the predicted Y values).

The regression equation is:Ŷ = b0 + b1Xi

a)Xi is the value of the IV for observation iŶ is the predicted value of Y for observation i.

b)b0 is the Y-intercept (i.e. "a")b0 =  Y _ b1 (  X )= y – b1x

b1 is the slope of the line (i.e. "b"). b1 = (  XY ) - [( X ) ( Y ) / n]=SCxy

( X2 ) - [( X )2 / n]SSx

OR =[(  XY ) - [( X ) ( Y ) / n]] / n-1= cov (x,y)

[( X2 ) - [( X )2 / n]] / n-1s2x

c)b0 & b1 are called regression coefficients. Know how to use the calculator (see example).

d)Note: You can also have interval estimates for Ŷ and b1 (use the computer – see p. 670).

e)The text sometimes usesMS Excel spreadsheetsfor calculations, so know how to interpret the output.

Linear regression is based on the following assumptions (see graph on board for illustration): LEVEL 3 only

Normality of Error - For each value of X, there is a group of Y values, and these Y values are normally distributed. The means of these normal distributions of Y values all fall on the straight regression line. Ex. Will all 35 year olds have exactly the same income? No, some will have more and some will have less, but on average they will earn an amount that falls on the line of best fit Expected Value of εi = 0
Homoscedasticity - The standard deviations of these normal distributions are equal (vs. Heteroscedasticity).
Independence of Errors - The Y-values are independent. Ex. if one person makes $35K/yr., it will not prevent someone else from making $35K. This also implies that the errors between actual and predicted values of Y for each X are independent (Autocorrelation when error terms are correlated).
Relevant Range. We may predict within the range of observed X values, but should not extrapolate beyond that range unless you assume that the fitted relationship holds outside the range. Ex. strength vs. age.

Assessing the model:

To examine how well the IV predicts the DV, we need to first develop several measures of variation:

SSy=SSR+ SSE

a)The total sum of squares (SSy) is a measure of the observed Y values around their mean or total variation. SSy = SST =  (Yi– Y)2 = Y2 - [(1/n) (Y)2] = SSR + SSE

b)The regression sum of squares (SSR) is the explained variation attributed to the relationship between X & Y. Represents the sum of the squared difference between each predicted value of Y and the mean of Y.

SSR =  (Ŷ– Y)2 = b1[XY - (1/n)(X)(Y)]=b1 (n-1) cov(x,y)=SSy - SSE

c)The error sum of squares (SSE) is the unexplained variation that is attributed to factors other than the relationship between X & Y. Equal to the sum of the squared residuals.

Residual or Random Error is the difference between each predicted value and observed value of Y (Yi–Ŷ)

SSE =  (Yi - Ŷ)2 = Y2 - bo(Y) - b1(XY)=SSy– SSR

The standard error of the estimate is ameasure of the variation around the regression line.

a)It is similar to the sample standard deviation from single variable statistics, except it is based on squared deviations from the regression line instead of from the mean.

b)It is reported in the same units as the DV, thus it is an absolute measure (depends on scale of measurement).

c)A small standard error of the estimate indicates that the regression line fits the data points closely.

d)Se = Syx =  Y2 - b0( Y ) - b1( XY )= SSE (use this formula when possible) = √MSE

n - 2n - 2

The Coefficient of Determination (R2) is a measure of the usefulness of the regression equation (↑R2 means better predictions). It is the ratio of the explainable regression error (SSR) to the total error (SSy)  the percentage of total variation in the DV that is explained by the variability in the IV, thus it is a relative measure (thus you can compare two regression models even if their scales of measurement are different). Ranges from 0 to 1. Easier to interpret than coefficient of correlation(R) because the scale is percent.

R2= Explained variation=[cov (x,y)]2=SSR

Total Variation in ysx2 sy2SSy

Standard Deviation of the Slope(Sb1): Sb1=Se = Se(Note: Sx2 = ( X2 ) - ( X )2 )

(* used for t-test below)(n-1)sx2SSxn

Analysis of Variance (ANOVA): The comparison of several sample variances to test between-group differences.

ANOVA table:Source of Variationd.f.Sum of SquaresMean SquareF

RegressionkSSRMSR=SSR/kMSR/MSE

Errorn-k-1SSEMSE=SSE/(n-k-1)

Totaln-1SST

Where k = number of independent (explanatory) variables in the regression model & Degrees of Freedom (d.f) is a measure to adjust for sample size (n). The ANOVA table is used for simple & multiple regression.

t-test for significance of the slope:

We can determine the existence of a significant relationship between the X & Y variables (allowing us to make inferences about the population) by testing whether 1 (the true slope) is equal to zero.

H0: 1 = 0 (There is no linear relationship)* note: it is possible to have a different hypothesized 1

H1: 1  0 (There is a linear relationship)to test if the slope is a value other than 0.

2-tailed t-test with n-2 degrees of freedom

Reject H0 if t > Critical Value (CV) or t < -CV

t=b1 - 1=b1(If 1 = 0  usually equals 0 for this test)

sb1 sb1

Correlation Analysis(ex. can be used to measure relationships between investments for diversification):

Correlation Analysis is concerned with measuring the degree (strength) of association between two variables. For a population, it is usually measured by the population coefficient of correlation ().  ("rho") ranges from -1 for a perfect negative correlation (-slope) to +1 for a perfect positive correlation (perfectly predictable)(+slope). A  very close to 1 or -1 does not imply that one variable causes the other, just that you can predict the DV with confidence based on the IV. = 0 means there is no association between the two variables (no predictive value).
No concept of IV & DV (changes in variables may be closely related, but variables may not be closely related).

The “Pearson” coefficient of correlation (R) is the square root of the coefficient of determination, where r takes the same sign as b1. You may also use the Excel printout to determine r based on:

R = cov (x,y)= SSRwhere sx = √ [( X2 ) - [( X )2 / n]] / n-1 sxsy are on calculator

sx sy SST& sy = √ [(Y2 ) - [(Y )2 / n]] / n-1

Testing for the Existence of Correlation (significance of the coefficient of correlation):

H0:  = 0  there is no correlation H0:  0H0:  0

H1:  0  there is a correlationH1:  < 0H1:  > 0

Use the student's t-distribution with n-k-1 = n - 2 d.f.for simple regression (n = # of paired observations)

t = r - =r  n-2* will always give the same decision as the t-test for β

√ [(1 - r2)/ (n-2)] 1 - r2

Multiple Regression & Correlation Analysis:

Multiple regression is an extension of simple linear regression where you find the influence of 2 or more Independent Variables on the Dependent Variable (graphically more than 2 dimensions, so use Excel output).
A stepwise regression is a sequence of equations where IVs are added to the regression equation in the order that will ↑ R2 the fastest (most significant IVs first - partial F-test criterion). Only IVs with significant coefficients are included. Parsimony means that we wish to have a useful regression model with the least possible IVs.
Multiple regression and multiple correlation analysis are based on the following assumptions:

a)There is a linear relationship between the DV and each of the IVs and also straight-line relationship between the DV and all IVs combined (i.e. in 3 or more dimensions).

b)The DV is continuous and of interval or ratio scale (high level data, usually resulting from measurements).

c)The residual variation is the same for all fitted values of Y and these residuals are normally distributed with a mean of zero (homoscedasticity). When this assumption is violated, it is called heteroscedasticity.

d)Successive observations of the DVs are uncorrelated & independent. When violated, called autocorrelation - which means that the Y's are related (usually occurs if relationship observed over a long time).

e)The IVs are not highly correlated with each other. When this assumption is violated, it is called multicollinearity - which means that the X's are related.

f)Note: There are statistical tests to verify that these assumptions hold true.

The general form of the multiple regression equation ("predicting equation") is:

Population  Yi = 0 +1X1 + 2X2 + … +nXn + 

Sample  Ŷ = b0 + b1X1 + b2X2 + … + bnXn

a)This multiple regression equation has been extended to "n" IVs. There can be an unlimited number of IVs.

b)(epsilon) is the random error in Y for observation i.

c)Yi: dependent variable (or response variable) is the variable being predicted.

d)Xi's: independent or explanatory variables. Use all Xi's combined to predict Y.

e)b0: y-intercept.

f)bn: net change in Y for each unit change in Xn, holding all other Xi's constant.

g)b1, b2, …, bn are called net regression coefficients.

h)The least squares criterion is used to develop the regression equation. The regression equation is unique!

i)Microsoft Excel is needed to determine b0 and the net regression coefficients (understand computer output!).

3 measures of the effectiveness of the multiple regression equation:

(Multiple)Standard Error of the Estimate:

Se=  (Y - Ŷ)2=SSE= MSE see CGA notes for output ex.

n - k - 1n-k-1

a)k = the number of IVs. n = the number of observations in the sample.

b)This uses residual values (differences between actual and predicted values of Y).

c)This is similar to standard deviation for 1 variable and standard error of the estimate for 2 variables

d)Good: It is measured in the same units (absolute) as the DV.

e)Bad: It is difficult to determine what is a large value of the multiple standard error of the estimate.

(Multiple) Coefficient of Correlation: R = R2=√ (SSR / SSy)

a)R is similar to the "simple" coefficient of correlation (R)

b)R ranges from 0 to 1. There is more than 1 IV affecting Y. One may increase and another decrease while Y increases, so we cannot express this as a positive or negative number (different from r in this way)

c)Strong associations are close to 1. Weak relationships are close to 0.

(Multiple) Coefficient of Determination: R2 = SSR / SSy

R2 may range from 0 to 1. Computed value is a ratio, so relative measure.

It shows the fraction of the variation in Y that is explained by variation in the set of IVs (proportion of regression error).

The coefficient of multiple non-determination (1 - R2 = SSE / SSy) is the proportion of variation in Y not accounted for by the IVs (unexplained error).

The adjusted R2 is adjusted to reflect the number of IVs & the sample size. It is needed when comparing 2 or more regression models that predict the same DV, but have different numbers of IVs.

R2adj =1 - [ ( 1 –R2 ) (n - 1) ]

n - k - 1

Recall that an ANOVA ("analysis of variance") table shows the variation in Y that is explained by the regression equation as well as that which cannot be explained.

ANOVA table:Source of Variationd.f.Sum of SquaresMean SquareF

RegressionkSSRMSR=SSR/kMSR/MSE

Errorn-k-1SSEMSE=SSE/(n-k-1)

Totaln-1SSy

Where SSy = SSR + SSE, MSR = mean square regression error & MSE = mean square error

A Correlation Matrix shows all possible simple correlation coefficients. It is used to locate correlated IVs (multicollinearity). It is a table that shows how strongly each IV is correlated to the DV and the other IVs.

Global Significance Test (F-test): *note: also used to test β1 in simple linear regression (same result as t-test)

Characteristics of the F-distribution:

Continuous
Positively skewed
Ranges from 0 to plus infinity (cannot be negative)
Based on 2 parameters: d.f. in the numerator & d.f. in the denominator (family of F-distributions)

A Global Significance Test(F-test for the entire regression model) is used to investigate whether ANY of the IVs are significant net regression coefficients (i.e. are the IVs of no use in predicting Y?). See page 660-667.

H0:1 = 2 = … = n = 0 (All the net regression coefficients are zero no linear relationship)

H1: Not all i's are zero (at least one of the regression coefficients is not zero  there is a linear relationship between the DV and at least one of the IVs)

1-tailed F-test for the entire multiple regression model. The test statistic is the F distribution, which is positively skewed with the critical value and rejection region for a set level of significance in the right tail.

Decision Rule: Reject H0 if the computed F is greater than the critical value FU with "k" degrees of freedom in the numerator and "n-k-1" d.f in the denominator (k = # of IVs).

F = (SSR/k)=MSR* found on ANOVA table

[SSE/(n-k-1)]MSE

Individual t-test for each Independent Variable:

Once the F-test is “passed”, a test for the Slope of Individual IVs is used to determine which IVs have non-zero net regression coefficients and are therefore significant. The variables that are proven to have a zero net regression coefficient are deleted from further analysis.

H0:k = 0 (the regression coefficient is zero for IV "k") * note: can be 1-tailed.

H1: k 0

2-tailed t-test.

Decision Rule: Reject H0 if the computed t is greater than the positive critical value or less than the negative critical value with "n-k-1" degrees of freedom (k = # of IVs).

t = bk - k=bk(the t-valuesbk can usually be found on the computer output)

sbksbk(sbk is the standard error of the regression coefficient bk)

Market Model:

One of the most important applications of regression. The rate of return on a stock is linearly related to the rate of return on the overall market (index), with β1 measuring how sensitive the return is to changes in the market. If β > 1, then the stock is more volatile than the market. Use regression to estimate β1 with b1 (systematic risk).

R = β0 + β1 Rm + ε

10/16/18