*Lab 1 - 3: Stata Introduction, Tests, Violations of GM, Specification Issues

*Intro

*Open stata, change the directory to your drive

cd c:\

*open the datafile lab1.dta

Use lab1.dta

*open a log-file to save the results:

log using name.log, [replace append]

*Note: at the end of the session you have to close the log file by: log *close

*use commands that describe the characteristics of the variables:

descr x1 x2

list x10

sum x1 – x10

tab x9 x10

tab1 x8 x9 x10

tabstat x1 x2 x3 x4 x5 x6 y, statistics(min mean median p25 p75 max sd var range)

*generate new variables

gen x_inter=x4*x9

egen x1_mean=mean(x1)

*look at some correlations and covariances

corr x1 x2 x3 x4 x5

corr x1 x2 x3 x4 x5, cov

*use the drop down menu to look at some histograms and density plots

twoway (scatter x1 x3)

kdensity x1, norm

* T-Tests and Correlation

/*download the file garmit_esspanel1.dta into your drive

this file contains a panel of 18 OECD countries from 1961 till 1994. Backgroud-reading: Garrett-Mitchell paper on government spending*/

*open stata

*change the working directory to your m drive:

cd "c:\...\..."

*open the data file in stata:

use garmit_esspanel1.dta

*summarize the main characteristics of the interesting variables:

sum spend unem growthpc depratio left cdem trade lowwage fdi skand

*run some simple t-tests:

ttest spend=30

ttest spend=41

/*now test, whether government spending is different for different regions/ political systems */

ttest spend, by(skand)

ttest spend, by(skand) unequal

ttest spend, by(angel)

ttest spend, by(angel) unequal

/*or test whether the dispersion of government spending is different in different regions:*/

sdtest spend, by(skand)

sdtest spend, by(angel)

/*Is the dependent variable "spend" normally distributed (enough)? */

hist spend

kdensity spend

swilk spend

sktest spend

/* calculate some simple bivariate covariances and correlations and interpret the results, what is the problem with bivariate correlation? */

corr spend unem

corr spend left

corr spend unem, cov

corr spend left, cov

/*estimate several binary OLS models for government spending and interpret the regression results:*/

reg spend unem

reg spend left

reg spend growthpc

reg spend trade

reg spend depratio

* OLS and Violations of GM

*summarize the main characteristics of the interesting variables:

sum spend unem growthpc depratio left cdem trade lowwage fdi skand

*estimate a basic model for government spending and interpret the *regression table: Interpret the regression results, coefficients, standard *errors, confidence intervals, R², F-Test

reg spend unem growthpc depratio left cdem trade lowwage fdi

*now change the confidence levels (note the default setting is 95% levels)

set level 99

set level 90

*estimate the same regression again, what has changed, how and why?

*Set the level back to default:

set level 95

*estimate the model again:

reg spend unem growthpc depratio left cdem trade lowwage fdi

*Now run the same regression but estimate standardized coefficients – how *to interpret standardized coefficients? Why do we need standardized *coefficients?

reg spend unem growthpc depratio left cdem trade lowwage fdi, beta

*AV-plots (additional) variables are for multivariate models to show what *ceteris paribus effect a single independent variable has:

avplots

avplot unem

*for linear OLS models the marginal effect of a single independent variable *is equal for all values of the independent variable:

mfx

mfx compute, at(mean)

mfx compute, at(median)

*______

*Multicollinearity:

*calculate correlation coefficients for all explanatory variables, do you *find problems of multi-collinearity?

corr unem growthpc depratio left cdem trade lowwage fdi skand

*What is the trade-off between multi-collinearity and omitted variable *bias?

*If there is complete multicollinearity we don't have to worry, STATA does *the job:

*Linear transformations of a variable are perfect multicollinear to the *original variable:

gen unem2=2*unem+3

corr unem unem2

reg spend unem unem2 if year==1986

*Was does Stata do in case of perfect multicollinearity?

*Now look at two variables that are highly correlated but not *multicollinear, what is the problem?

gen unem3=2*unem+3*invnorm(uniform())

corr unem unem3

reg spend unem unem3 if year==1986

reg spend unem skand if year==1986

reg spend unem unem3 skand if year==1986

*run the original regression again:

reg spend unem growthpc depratio left cdem trade lowwage fdi

*Run a variance inflation factor test for higher order multicollinearity:

estat vif

estat vif, uncentered

*How to interprete the vif, what does vif measure?

* Principal component analysis:

corr trade lowwage fdi

factor trade lowwage fdi, pcf

predict score, r

corr trade lowwage fdi

*______

*Outliers:

*test for outliers:

reg spend unem growthpc depratio left cdem trade lowwage fdi

dotplot spend

symplot spend

rvfplot

lvr2plot

lvr2plot, ml(country)

dfbeta

*Solution: jacknife and bootstrapping:

jacknife _b _se, eclass saving(name.dta): reg spend unem growthpc depratio left cdem trade lowwage fdi skand

bootstrap _b _se, reps(1000) saving(name.dta): reg spend unem growthpc depratio left cdem trade lowwage fdi skand

*______

*Heteroscedasticity and Omitted Variable Bias

*calculate predicted values and residuals:

reg spend unem growthpc depratio left cdem trade lowwage fdi

predict spend_hat

predict spend_resid, resid

*Now create scatterplots for the residuals against some of the explanatory *variables and the country codes: what can you see? Omitted variable bias, *heteroskedasticity?

twoway (scatter spend_resid unem)

twoway (scatter spend_resid cc)

twoway (scatter spend_resid growthpc)

*Stata provides build in tests for omitted variable bias and *heteroskedasticity: How to interprete the test results?

estat hettest

estat ovtest

estat szroeter unem growthpc depratio left cdem trade lowwage fdi

*What can we do about Heteroskedasticity and omitted variable bias?

reg spend unem growthpc depratio left cdem trade lowwage fdi skand

*What can be observed, how to interpret the coefficient for "skand"?

*Now do the same tests again for the new model and interpret the results:

predict spend_hat2

predict spend_resid2, resid

estat hettest

estat ovtest

twoway (scatter spend spend_hat2 if year==1984, mlabel(country))

twoway (scatter spend_resid2 cc)

twoway (scatter spend_resid2 unem)

estat szroeter unem growthpc depratio left cdem trade lowwage fdi skand

*Just treating the standard errors: robust White SEs:

reg spend unem growthpc depratio left cdem trade lowwage fdi, robust

reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(robust)

reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(cluster cc)

reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(bootstrap)

reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(jacknife)

reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(hc2)

reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(hc3)

*GLS: robust Huber-White Sandwich Estimator

xtgls spend unem growthpc depratio left cdem trade lowwage fdi, p(h)

*Panel Heteroskedasticity

cd m:\

use "...path...\greene5.dta", clear

log using "...path...\lab1.log", replace

tsset firm year

*Estimate groupwise heteroskedastic model and test for heteroskedasticity

reg invest fval sval

display "White test"

predict res, resid

gen res2 = res^2

gen fval2 = fval^2

gen sval2 = sval^2

gen fs = fval * sval

reg res2 fval sval fval2 sval2 fs

ereturn list

scalar white = e(N)*e(r2)

display white

scal whitesig = chi2tail(e(df_m)-1,white)

display whitesig

display "Lagrange Multiplier test for groupwise heteroskedasticity in panel data"

scalar s2 = e(rss)/e(N)

xtgls invest fval sval, i(firm) p(h) c(i)

matrix list e(Sigma)

matrix sis = (1/s2)*e(Sigma)

matrix list sis

matrix sis1 = sis - I(5)

matrix list sis1

matrix sis2 = vecdiag(sis1)'

matrix list sis2

matrix sis3 = sis2'*sis2

matrix list sis3

matrix lmh = ((e(N)/e(N_g))/2)*sis3

scalar lm = lmh[1,1]

display "Lagrange Multiplier statistic = " lm

scalar lmsig = chi2tail((e(N_g)-1),lm)

display "Significance level = " lmsig

/*Estimate groupwise heteroskedastic and cross-sectionally correlated model and test for cross-sectional correlation*/

xtgls invest fval sval, i(firm) p(c) c(i)

xtgls invest fval sval, i(firm) p(h) c(i) igls

display "Lagrange multiplier test for cross-sectional correlation in panel data (MLE)"

predict yfit , xb

gen resgls = invest - yfit

set matsize 100

mkmat resgls, matrix(rgls)

matrix egls1 = rgls[1..20,1]

matrix egls2 = rgls[21..40,1]

matrix egls3 = rgls[41..60,1]

matrix egls4 = rgls[61..80,1]

matrix egls5 = rgls[81..100,1]

matrix egls = egls1, egls2, egls3, egls4, egls5

matrix list egls

matrix vcv = egls' * egls

matrix list vcv

matrix csc = corr(vcv)

matrix list csc

matrix lmm = csc[2...,1]\csc[3...,2]\csc[4...,3]\csc[5,4]

matrix list lmm

matrix lmlm = 10 * (lmm' * lmm)

scalar lml = lmlm[1,1]

display "Lagrange Multiplier statistic = " lml

scalar lmlsig = chi2tail(10,lml)

display "Significance level (alpha) = " lmlsig

display "Likelihood ratio test for MLE for cross-sectional correlation in panel data"

matrix si1 = (egls1' * egls1)/20

matrix si2 = (egls2' * egls2)/20

matrix si3 = (egls3' * egls3)/20

matrix si4 = (egls4' * egls4)/20

matrix si5 = (egls5' * egls5)/20

scalar lsi = ln(si1[1,1]) + ln(si2[1,1]) + ln(si3[1,1]) + ln(si4[1,1]) + ln(si5[1,1])

display lsi

quietly xtgls invest fval sval, i(firm) p(c) c(i) igls

scalar ldets = ln(det(e(Sigma)))

display ldets

scalar llr = 20 * (lsi - ldets)

display "Likelihood ratio statistic = " llr

scalar llrsig = chi2tail(10,llr)

display "Significance level (alpha) = " llrsig

*Now reestimate the model using the Beck&Katz approach

xtpcse invest fval sval, c(i)

log close

*______

use "...path...\garmit_esspanel1.dta", clear

*open log-file:

log using "...path...\lab1.log", append

*run basic model: OLS linear regression

reg spend unem growthpc depratio left cdem trade lowwage fdi

*test for heteroskedasticity

*check for heteroskedasticity

estat hettest

estat szroeter unem growthpc depratio left cdem trade lowwage fdi

*check for omitted variable bias

estat ovtest

* run models dealing with panel correlation and heteroskedasticity

reg spend unem growthpc depratio left cdem trade lowwage fdi, robust

reg spend unem growthpc depratio left cdem trade lowwage fdi, cluster(cc)

* compare different models

* iid error structure

xtgls spend unem growthpc depratio left cdem trade lowwage fdi, panels(iid)

* heteroscedastic but uncorrelated error structure

xtgls spend unem growthpc depratio left cdem trade lowwage fdi, panels(h)

* heteroscedastic and correlated error structure

xtgls govcons growthpc depratio left cdem trade lowwage, panels(c)

* independent autocorrelation structure

xtgls spend unem growthpc depratio left cdem trade lowwage fdi, corr(i)

* AR1 autocorrelation structure

xtgls spend unem growthpc depratio left cdem trade lowwage fdi, corr(ar1)

* panel specific AR1 autocorrelation structure

xtgls spend unem growthpc depratio left cdem trade lowwage fdi, corr(psar1)

*Parks Kmenta

xtgls govcons growthpc depratio left cdem trade lowwage, panels(c) corr(psar1)

*xtpcse

xtpcse spend unem growthpc depratio left cdem trade lowwage fdi

xtpcse spend unem growthpc depratio left cdem trade lowwage fdi, corr(ar1)

xtpcse spend unem growthpc depratio left cdem trade lowwage fdi, corr(psar1)

*______

*Autocorrelation:

*now let’s look at a single time-series, e.g. Germany (or UK) – what has changed, *what is the difference to the above models?

reg spend unem growthpc depratio left cdem trade lowwage fdi if country=="Germany"

*Do the same tests again: what can you observe?

estat hettest

estat ovtest

*Now let's turn to another violation of Gauss-Markov assumptions: autocorrelation, *serial correlation and test for it: Durbin-Watson statistic and Breusch-Godfrey *test, how to interprete the results?

estat dwatson

estat bgodfrey

*Or a simpler test: what can we observe?

predict spend_residger, resid

gen lagspend_residger=l.spend_residger

reg spend_residger lagspend_residger unem growthpc depratio left cdem trade lowwage fdi if country == "Germany"

*the simplest way to deal with serial correlation is to include the lagged values *of the dependent variable to the right hand side of the model – the LDV (lagged *dependent variable; BUT with including and excluding variables from the model and *do some more Omitted variable bias and heteroskedasticity tests…

*Now do the same for the UK!

*…

*opens another can of worms… We will talk about this in more detail when we talk *about time series models): How to interpret the coefficient of the LDV?

reg spend spendl unem growthpc depratio left cdem trade lowwage fdi if country=="Germany"

*And do the tests again:

estat dwatson

estat bgodfrey

*Run a Prais-Winsten model:

prais spend unem growthpc depratio left cdem trade lowwage fdi if country=="Germany"

*Play a little around

*______

*test for functional form:

reg spend unem growthpc depratio left cdem trade lowwage fdi

acprplot unem, mspline

acprplot left, mspline

*etc…

gen ln_unem = ln(unem)

reg spend ln_unem growthpc depratio left cdem trade lowwage fdi

gen sqrt_unem=unem^2

reg spend unem sqrt_unem growthpc depratio left cdem trade lowwage fdi

*what effect can we observe? Calculate the “turning point”!

*______

*Interpretatipon of a dummy variable and interaction effects:

reg spend skand if year==1986

reg spend unem skand

*gen skand_unem=unem*skand

reg spend unem skand skand_unem

reg spend unem skand skand_unem growthpc depratio left cdem trade lowwage fdi

*Interaction effects:

gen unem_trade=unem*trade

reg spend unem trade growthpc depratio left cdem lowwage fdi

reg spend unem trade unem_trade growthpc depratio left cdem lowwage fdi

ssc install sslope

sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(unem trade unem_trade)

sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(trade unem unem_trade)

*graphical display of IA effects:

sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(unem trade unem_trade) graph

sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(trade unem unem_trade) graph

sum trade unem

*Program marginal effects of IA effects:

***********************************************************************************

capture drop MV-lower

reg spend unem trade unem_trade growthpc depratio left cdem lowwage fdi

generate MV=((_n-1)*10)

replace MV=. if _n>17

matrix b=e(b)

matrix V=e(V)

scalar b1=b[1,1]

scalar b2=b[1,2]

scalar b3=b[1,3]

scalar varb1=V[1,1]

scalar varb2=V[2,2]

scalar varb3=V[3,3]

scalar covb1b3=V[1,3]

scalar covb2b3=V[2,3]

scalar list b1 b2 b3 varb1 varb2 varb3 covb1b3 covb2b3

gen conb=b1+b3*MV if _n<=17

gen conse=sqrt(varb1+varb3*(MV^2)+2*covb1b3*MV) if _n<=171

gen a=1.96*conse

gen upper=conb+a

gen lower=conb-a

graph twoway (line conb MV, clwidth(medium) clcolor(blue) clcolor(black))/*

*/(line upper MV, clpattern(dash) clwidth(thin) clcolor(black))/*

*/(line lower MV, clpattern(dash) clwidth(thin) clcolor(black)) , /*

*/ xlabel(0 20 40 60 80 100 120 140 160, labsize(2.5)) /*

*/ ylabel(-0.5 0 0.5 1 1.5 2, labsize(2.5)) yscale(noline) xscale(noline) legend(col(1) order(1 2) label(1 "Marginal Effect of Unemployment on Spending") label(2 "95% Confidence Interval") /*

*/ label(3 " ")) yline(0, lcolor(black)) title("Marginal Effect of Unemployment on Spending as Trade Openness changes", size(4))/*

*/ subtitle(" " "Dependent Variable: Government Spending" " ", size(3)) xtitle( Trade Openness, size(3) ) xsca(titlegap(2)) ysca(titlegap(2)) /*

*/ ytitle("Marginal Effect of Unemployment", size(3)) scheme(s2mono) graphregion(fcolor(white))

graph export m:\...\unem_trade1.eps, replace

translate @Graph m:\...\unem_trade1.wmf, replace

capture drop MV-lower

***********************************************************************************

*or much easier:

*download grinter.ado from Fred Boehmke

net from

reg spend unem trade unem_trade growthpc depratio left cdem lowwage fdi

grinter unem, inter (unem_trade) const02(trade) depvar(spend) kdensity yline(0)

1