Notes for Regression Models with Excel
Population parameter (a characteristic)
Sample a statistic
Statistics:
Parametric: assume a known distribution – mostly Normal
Non-Parametric: no assumption as to underlying distribution
Descriptive: no assumption or detailed analysis – mean, mode, median, frequency, etc. use “explore” or “pivot tables” or “Chart Wizard” or “Crosstabs” or scatter diagrams to see how data relates to each other
r= correlation coefficient -1 to +1
r2 = coefficient of determination 0 to 1
Variables:
Lots of different kinds – a symbol that can have one of many values
Independent
Dependent
Lurking
Moderating
Also Factors – is sort of a variable of variables
Scales for Variables:
Four major types:
Nominal – no scale or direction – male/female
Ordinal – no scale but does have a rank or direction – Likert scale
Least preferred 1 2 3 4 5 most preferred
Scale – interval – direction and magnitude – number of square feet in shopping
mall
ratio – same as interval but contains zero
Parametric statistics: use scale variables – t, z and F statistics
Non-parametric statistics: use ordinal, nominal variables - binomial distributions,
χ2 (chi) squared tests
Applications:
Hypothesis testing
Experimental Design
Statistical Process Control
Factor Analysis
Cluster Analysis
Multidimensional Scaling
Parametric Assumptions
1. observations must be independent
2. observations are from a normally distributed population
3. populations should have equal variances σ2=
4. measurements are at least arithmetic (scale)
Multiple Linear Regression
Single Linear Regression
Y=mx + b +
Time Series
X axis is time in equal intervals
Multiple Linear Regression
Assumptions of MLR
1. observations are independent
2. relationship between any two variables is linear
3. for each value of the independent variable there is a normal distribution for the
dependent variable
4. all distributions for the independent variables have same variance σ2=
Very Very General Procedure to Build a MLR Model
1. check assumptions, choose variables, assess type of variable
2. perform exploratory statistics
- mean, mode, median
- scatter diagrams with Chart Wizard (histograms, stem-and-leaf plots etc)
3. run a correlation table
- determine order of adding variables
- eliminate highly correlated variables
4. Run a step-wise regression. Assess via r2, F and t
Excel’s regression Tool provides an extensive analysis of the regression equation. Its first table lists a number of statistics, including three measures of fit. As in the case of simple regression we can use r2 as a basic measure of fit. In the case of multiple regression r2 is called the coefficient of multiple determination and its square root, r is called the multiple r or the multiple correlation coefficient. Excel reports both values. The table also reports an adjusted r2. The reason for introducing another measure is that adding independent variables to the regression will never decrease the r2 and will usually increase it even if there is no reason why the added variable should help predict the dependent variable. A suitable adjustment accounts for the effects of the number of variables in the equation and will cause the adjusted r2 to decrease unless there is a compensating reduction in the differences between observed and predicted values. One informal guideline for how many independent variables to include in a regression equation is to find the largest value of the adjusted r2. However, we strongly recommend including variables that can be justified on practical or theoretical grounds, even if the adjusted r2 is not at its largest possible value.
The “Principle of Parsimony” holds that the best model is the one with the fewest independent variables – after all, this is a business course and it costs money to collect data – you want to sort of find the maximum r2 for the least cost; adding more variables for small improvements is often not practical nor does it make sense.
The regression report also lists a value for the F statistic (labeled significant F), which has the same interpretation as in a simple regression: it measures the probability of getting the observed r2 or higher, if, in fact all the regression coefficients were zero. This is a measure of the overall fit of the regression model to the data, but it is not a particularly stringent test. We would expect this probability to be low in any well considered model for which relevant data were being used.
The estimated coefficients of the regression equation, which we did noted as can be found under the coefficients heading in the Excel regression output. Corresponding to each coefficient, the table provides a p-value which measures the probability that we would get an estimate at least this large if the true regression coefficient or zero. While the p-value does provide you with a quick check on each coefficient, the respective confidence intervals generally provide more useful information.
5. Write out model. Make appropriate plots and interpretations. .
MGTC74Page 1 of 4Regression R1