Notes for Regression Models with Excel

Notes for Regression Models with Excel

Population  parameter (a characteristic)



Sample  a statistic

Statistics:

Parametric: assume a known distribution – mostly Normal

Non-Parametric: no assumption as to underlying distribution

Descriptive: no assumption or detailed analysis – mean, mode, median, frequency, etc. use “explore” or “pivot tables” or “Chart Wizard” or “Crosstabs” or scatter diagrams to see how data relates to each other

r= correlation coefficient -1 to +1

r2 = coefficient of determination 0 to 1

Variables:

Lots of different kinds – a symbol that can have one of many values

Independent

Dependent

Lurking

Moderating

Also Factors – is sort of a variable of variables

Scales for Variables:

Four major types:

Nominal – no scale or direction – male/female

Ordinal – no scale but does have a rank or direction – Likert scale

Least preferred 1 2 3 4 5 most preferred

Scale – interval – direction and magnitude – number of square feet in shopping

mall

ratio – same as interval but contains zero

Parametric statistics: use scale variables – t, z and F statistics

Non-parametric statistics: use ordinal, nominal variables - binomial distributions,

χ2 (chi) squared tests

Applications:

Hypothesis testing

Experimental Design

Statistical Process Control

Factor Analysis

Cluster Analysis

Multidimensional Scaling

Parametric Assumptions

1. observations must be independent

2. observations are from a normally distributed population

3. populations should have equal variances σ2= 

4. measurements are at least arithmetic (scale)

Multiple Linear Regression

Single Linear Regression

Y=mx + b + 

Time Series

X axis is time in equal intervals

Multiple Linear Regression

Assumptions of MLR

1. observations are independent

2. relationship between any two variables is linear

3. for each value of the independent variable there is a normal distribution for the

dependent variable

4. all distributions for the independent variables have same variance σ2= 

Very Very General Procedure to Build a MLR Model

1. check assumptions, choose variables, assess type of variable

2. perform exploratory statistics

- mean, mode, median

- scatter diagrams with Chart Wizard (histograms, stem-and-leaf plots etc)

3. run a correlation table

- determine order of adding variables

- eliminate highly correlated variables

4. Run a step-wise regression. Assess via r2, F and t

Excel’s regression Tool provides an extensive analysis of the regression equation. Its first table lists a number of statistics, including three measures of fit. As in the case of simple regression we can use r2 as a basic measure of fit. In the case of multiple regression r2 is called the coefficient of multiple determination and its square root, r is called the multiple r or the multiple correlation coefficient. Excel reports both values. The table also reports an adjusted r2. The reason for introducing another measure is that adding independent variables to the regression will never decrease the r2 and will usually increase it even if there is no reason why the added variable should help predict the dependent variable. A suitable adjustment accounts for the effects of the number of variables in the equation and will cause the adjusted r2 to decrease unless there is a compensating reduction in the differences between observed and predicted values. One informal guideline for how many independent variables to include in a regression equation is to find the largest value of the adjusted r2. However, we strongly recommend including variables that can be justified on practical or theoretical grounds, even if the adjusted r2 is not at its largest possible value.

The “Principle of Parsimony” holds that the best model is the one with the fewest independent variables – after all, this is a business course and it costs money to collect data – you want to sort of find the maximum r2 for the least cost; adding more variables for small improvements is often not practical nor does it make sense.

The regression report also lists a value for the F statistic (labeled significant F), which has the same interpretation as in a simple regression: it measures the probability of getting the observed r2 or higher, if, in fact all the regression coefficients were zero. This is a measure of the overall fit of the regression model to the data, but it is not a particularly stringent test. We would expect this probability to be low in any well considered model for which relevant data were being used.

The estimated coefficients of the regression equation, which we did noted as  can be found under the coefficients heading in the Excel regression output. Corresponding to each coefficient, the table provides a p-value which measures the probability that we would get an estimate at least this large if the true regression coefficient or zero. While the p-value does provide you with a quick check on each coefficient, the respective confidence intervals generally provide more useful information.

5. Write out model. Make appropriate plots and interpretations. .

MGTC74Page 1 of 4Regression R1