Regression and Correlation

REGRESSION AND CORRELATION

INTRODUCTION

In simple correlation and regression studies, the researcher collects data on two numerical or quantitative variables to see whether a relationship exists between the variables. The independent variable (X) is the variable in the regression that can be controlled or manipulated. The dependent variable is the variable in regression that cannot be controlled or manipulated.

REGRESSION

The term regression analysis is used to refer to studies of relations between variables. Regression analysis then is a technique for quantifying the relationship between a criterion variable (also called a dependent variable) and one or more predictor variables (or independent variables). Put another way regression analysis is a technique whereby a mathematical equation is fitted to a set of data. It describes the relationship between two variables. It is a statistical method used to isolate cause- and –effect relation among variables. A line of best fit that is independent of individual judgement and drawn mathematically is called a regression line. Regression line equations once computed can be graphed and be used to estimate values previously unknown.

The reasons for computing a regression line are:

(i) to obtain a line of best fit free of subjective judgement. The regression line improves our estimates.

(ii) The regression equation can be used to make predictions within the given range of the data, that is, making interpolations.

(iii) the reliability of estimates made from such a line can be measured mathematically.

The purpose of the regression line is to enable the researcher to see the trend and make predictions on the basis of the available data. Values of Y will be predicted from the values of X, hence the closer the points are to the line, the better the fit and the predictions will be.

Each observation of bivariate data can be viewed as a point (x, y) where x is the explanatory or independent variable while y is the dependent or response variable. It is important to determine which one of the variables in question is the dependent variable and which one is the independent variable. The starting point in regression analysis is the construction of a scatter diagram/ scatter graph.

There are normally two regression line equations for any sets of bivariate data:

(1) Regression of Y on X. The line equation is given by Y= a+ bX where the line is used to predict/ estimate the value of Y that follows from any value of X

(2) Regression of X on Y. The equation is given by X = a + bY. The line equation is used to predict/ estimate the value of X that follows from any value of Y.

The guide on the choice of regression line equation is one should always use the line that has the variable to be estimated on the left hand side of the equation. In other words if you want to estimate X use X = a + bY and if you want to estimate Y use Y = a + bX. In our analysis we are going to use the Y on X regression equation, that is Y = a + bX. In this equation a is the y intercept term, while b is called the regression coefficient or the slope of the graph.

In single equation regression models one variable called the dependent variable or regressand is expressed as a linear function of one or more other variables called the explanatory variables or independent or regressor variables. It is assumed implicitly that causal relationships if any between the independent and dependent variables flow in one direction only, namely from the explanatory to the dependent (or regressand).

Although regression analysis deals with the dependence of one variable on other variables, it does not necessarily imply causation. A statistical relationship may be very strong but it never can establish causal connection. Our ideas of causation must come from outside statistics, ultimately from some theory or other. For example there is no statistical reason to assume that rainfall depends on crop yield. In regression analysis we try to estimate or predict the average value of one variable on the basis of fixed values of other variables. The regression equation is given by Y = a + bX or

Y = f(X) or Yi = a + bX +μi where μi is the stochastic error term or random error term ( it can also be looked at as the surrogate for all omitted variables. ß1 and ß2 are regression coefficients. ß1 is the intercept and ß2 is the slope coefficient.

E(Y|Xi ) = ß1 + ß2Xi .=

µi = Yi – E (Y|Xi), where Yi is the observed value and E (Y|Xi), is the predicted value.

Regression analysis techniques assume that

(1) Each item of data is independent of the others.

(2) Data measurements are unbiased.

(3) The error variance is constant over the entire range of data, rather than larger in some parts of the data range and smaller in others.

(4) There is no autocorrelation between the disturbances (or errors).

The probability distribution of the random error or has the following assumptions:

1. The mean of the probability of e or or is 0, that is the average of the values of or over an infinitely long series of experiments is 0 for each setting of the independent variable X. E(µi|Xi) = 0; that is zero mean value of disturbance µi. This assumption implies that the mean value of Y, E(Y), for a given value of X is E(Y) = ß1 + ß2 X

2. Variance of the probability distribution of e or or is constant for all settings of the independent variable X. Var (µi|Xi) = σi2 i.e equal variance of error term for all observations (that is there is homoscedasticity). For a straight line model, this assumption means that the variance of e or or is equal to a constant, say for all values of X.

3. Probability distribution of or is normal.

4. The values of e or or associated with any two observed values of Y are independent. That is the value of or associated with one value of Y has no effect on the values of or associated with other Y values. That is to say there is no autocorrelation between the values of the disturbances or errors (e or or ).

Problems and difficulties can be encountered when using regression analysis and these can cause results to be inaccurate and misleading. The four problems are listed below:

(a) An inadequate sample size may have been used to collect the data. Sample size should be at least two or three times the number of variables used in the regression equation and preferably much larger.

(b) The independent variables measured during the study may have been poorly measured, are in the wrong form or are not the right ones that have a direct effect on the dependent variable.

(c) The independent variables are highly correlated with each other. If two independent variables are perfectly correlated with each other (that is r =+1.00) , their effect will be the same as that of a single independent variable which has been used twice in the same regression analysis.

(d) The true relationship between the dependent variable and the independent variable(s) is not linear, or it is an unusual shape which cannot be analysed using regression techniques.

NB. The regression equation may apply to a certain range of data observations such that the regression equation derived from those data may not apply for figures smaller than the lowest value and values in excess of the largest observed value. For example if data sets observed range from 50 million to 100 million , the regression equation cannot be used for any values in excess of 100 million and those below 50 million. In other words when predictions are made, they are based on present conditions or on the premise that present trends will continue. This assumption may or may not prove true in the future. For example in 1979 some experts predicted that the US would run out of oil by the year 2003!

Standard methods of obtaining regression line equations

(1) Inspection method- this involves drawing a scatter diagram and then fitting the line of best fit where one feels it ought to be.

(2) Semi average- this involves splitting data into two equal groups and plotting the mean point for each group and joining the two points by means of a straight line.

(3) Least squares method. This minimises the total of the squared deviations.

Least squares method

The fitted regression line can be viewed as a “predictor line”. For each , regression line “predicts” . The difference between and its predicted equivalent is - and this difference is called a “residual” or error ei. . OLS minimizes the sum of the square of all residuals. or minimize .

The expression is the estimated regression line, which is the same as =

Residual ei

Yi = a + b X +μi

The estimated Yi is given as , while the estimated a is given by and that for b is . The error term μi or ei= (Yi -) = Yi -( + Xi )

The sum of the squares of the deviations (errors) of the Y values about their predicted values for all the n data points is . The quantities and that make the SSE a minimum are called the least squares estimates of the population parameters and and the prediction equation is called the least squares line. The least squares line is one that has the following 2 properties:

1. The sum of the errors (SE) equals 0.

2. Sum of squared errors (SSE) is smaller than that for any other straight line model.

Formulas for the least squares estimates for a and b are :

or we can use the formula

where n represents the pairs of X,Y values.

In the first set of equations for the y on x regression line equation n represents the pairs of sets of y and x values. NB. the calculated values of a and b should be to 3 decimal places.

Table 36.Advertising and Expenditure Statistics

From the data above one can get a and b in the regression equation Y= a + bX as follows:

The regression equation is therefore ( answer in $000). If we are required to estimate the sales for 2005 (when advertising expenditure is expected to be $5000 then substituting into our regression equation we get = 48 +8(5)= 88= $88000.

N.B. It is important that you be able to interpret the intercept and slope in terms of the data being utilized to fit the model. The model parameters should be interpreted only within the sampled range of the independent variable (in this case between $2000 and $6000).

Sometimes it is possible to have only one set of data available and one is expected to make projections based on the available data. From our initial table if the data on advertising expenditure is missing and we are required to estimate sales in a given period we need to choose the years as our x while the sales remain as y. The years are given coded values staring at 2000 being coded as x = 1; 2001 coded as 2 etc. Alternatively one could start at 2000 being coded as x = 0; 2001 coded as 1 etc. The estimates are the same in each case although the regression equation is slightly different.

From the table below b =3 and a =71 hence the regression equation is Y = 71 +3x. From this result forecast the sales for 2005 and 2007. To get the answer we need to establish the value of x in 2005 and 2007 and these are X=6 and X =8 respectively. Substituting into the equation this gives y2005 = 71 +3(6) = 89 ( i.e. 89000) for 2005 and also y2007= 71 +3(8)=95 ( i.e. $95000).

Table 37. Regression workings

Year / Coded (x) / Sales (y) / xy / x^2
2000 / 1 / 60 / 60 / 1
2001 / 2 / 100 / 200 / 4
2002 / 3 / 70 / 210 / 9
2003 / 4 / 90 / 360 / 16
2004 / 5 / 80 / 400 / 25
Total / 15 / 400 / 1230 / 55

An alternative formula which has the advantage of not using a previously calculated value can also be used. This formula is and

This will give the same results as the first formula used above.

NB. When one is drawing the scatter plot and the regression line, it is sometimes desirable to truncate the graph. The reason is to show the line drawn in the range of the independent and dependent variables.

For a set of X and Y values and the estimated regression equation , each X has an observed Y and a predicted value. The total variation is the sum of the squares of the vertical distances each point is from the mean. The total variation can be divided into two parts: that which is attributed to the relationship of X and Y and that which is due to chance. The variation obtained from the relationship (i.e. from the predicted values) is and is called the explained variation. Most of the variation can be explained by the relationship. The closer the value r is to +1 or -1, the better the points fit the line and the closer is to . In fact if all points fall on the regression line, = since would be equal to Y in each case.