*Lab 1 - 3: Stata Introduction, Tests, Violations of GM, Specification Issues
*Intro
*Open stata, change the directory to your drive
cd c:\
*open the datafile lab1.dta
Use lab1.dta
*open a log-file to save the results:
log using name.log, [replace append]
*Note: at the end of the session you have to close the log file by: log *close
*use commands that describe the characteristics of the variables:
descr x1 x2
list x10
sum x1 – x10
tab x9 x10
tab1 x8 x9 x10
tabstat x1 x2 x3 x4 x5 x6 y, statistics(min mean median p25 p75 max sd var range)
*generate new variables
gen x_inter=x4*x9
egen x1_mean=mean(x1)
*look at some correlations and covariances
corr x1 x2 x3 x4 x5
corr x1 x2 x3 x4 x5, cov
*use the drop down menu to look at some histograms and density plots
twoway (scatter x1 x3)
kdensity x1, norm
* T-Tests and Correlation
/*download the file garmit_esspanel1.dta into your drive
this file contains a panel of 18 OECD countries from 1961 till 1994. Backgroud-reading: Garrett-Mitchell paper on government spending*/
*open stata
*change the working directory to your m drive:
cd "c:\...\..."
*open the data file in stata:
use garmit_esspanel1.dta
*summarize the main characteristics of the interesting variables:
sum spend unem growthpc depratio left cdem trade lowwage fdi skand
*run some simple t-tests:
ttest spend=30
ttest spend=41
/*now test, whether government spending is different for different regions/ political systems */
ttest spend, by(skand)
ttest spend, by(skand) unequal
ttest spend, by(angel)
ttest spend, by(angel) unequal
/*or test whether the dispersion of government spending is different in different regions:*/
sdtest spend, by(skand)
sdtest spend, by(angel)
/*Is the dependent variable "spend" normally distributed (enough)? */
hist spend
kdensity spend
swilk spend
sktest spend
/* calculate some simple bivariate covariances and correlations and interpret the results, what is the problem with bivariate correlation? */
corr spend unem
corr spend left
corr spend unem, cov
corr spend left, cov
/*estimate several binary OLS models for government spending and interpret the regression results:*/
reg spend unem
reg spend left
reg spend growthpc
reg spend trade
reg spend depratio
* OLS and Violations of GM
*summarize the main characteristics of the interesting variables:
sum spend unem growthpc depratio left cdem trade lowwage fdi skand
*estimate a basic model for government spending and interpret the *regression table: Interpret the regression results, coefficients, standard *errors, confidence intervals, R², F-Test
reg spend unem growthpc depratio left cdem trade lowwage fdi
*now change the confidence levels (note the default setting is 95% levels)
set level 99
set level 90
*estimate the same regression again, what has changed, how and why?
*Set the level back to default:
set level 95
*estimate the model again:
reg spend unem growthpc depratio left cdem trade lowwage fdi
*Now run the same regression but estimate standardized coefficients – how *to interpret standardized coefficients? Why do we need standardized *coefficients?
reg spend unem growthpc depratio left cdem trade lowwage fdi, beta
*AV-plots (additional) variables are for multivariate models to show what *ceteris paribus effect a single independent variable has:
avplots
avplot unem
*for linear OLS models the marginal effect of a single independent variable *is equal for all values of the independent variable:
mfx
mfx compute, at(mean)
mfx compute, at(median)
*______
*Multicollinearity:
*calculate correlation coefficients for all explanatory variables, do you *find problems of multi-collinearity?
corr unem growthpc depratio left cdem trade lowwage fdi skand
*What is the trade-off between multi-collinearity and omitted variable *bias?
*If there is complete multicollinearity we don't have to worry, STATA does *the job:
*Linear transformations of a variable are perfect multicollinear to the *original variable:
gen unem2=2*unem+3
corr unem unem2
reg spend unem unem2 if year==1986
*Was does Stata do in case of perfect multicollinearity?
*Now look at two variables that are highly correlated but not *multicollinear, what is the problem?
gen unem3=2*unem+3*invnorm(uniform())
corr unem unem3
reg spend unem unem3 if year==1986
reg spend unem skand if year==1986
reg spend unem unem3 skand if year==1986
*run the original regression again:
reg spend unem growthpc depratio left cdem trade lowwage fdi
*Run a variance inflation factor test for higher order multicollinearity:
estat vif
estat vif, uncentered
*How to interprete the vif, what does vif measure?
* Principal component analysis:
corr trade lowwage fdi
factor trade lowwage fdi, pcf
predict score, r
corr trade lowwage fdi
*______
*Outliers:
*test for outliers:
reg spend unem growthpc depratio left cdem trade lowwage fdi
dotplot spend
symplot spend
rvfplot
lvr2plot
lvr2plot, ml(country)
dfbeta
*Solution: jacknife and bootstrapping:
jacknife _b _se, eclass saving(name.dta): reg spend unem growthpc depratio left cdem trade lowwage fdi skand
bootstrap _b _se, reps(1000) saving(name.dta): reg spend unem growthpc depratio left cdem trade lowwage fdi skand
*______
*Heteroscedasticity and Omitted Variable Bias
*calculate predicted values and residuals:
reg spend unem growthpc depratio left cdem trade lowwage fdi
predict spend_hat
predict spend_resid, resid
*Now create scatterplots for the residuals against some of the explanatory *variables and the country codes: what can you see? Omitted variable bias, *heteroskedasticity?
twoway (scatter spend_resid unem)
twoway (scatter spend_resid cc)
twoway (scatter spend_resid growthpc)
*Stata provides build in tests for omitted variable bias and *heteroskedasticity: How to interprete the test results?
estat hettest
estat ovtest
estat szroeter unem growthpc depratio left cdem trade lowwage fdi
*What can we do about Heteroskedasticity and omitted variable bias?
reg spend unem growthpc depratio left cdem trade lowwage fdi skand
*What can be observed, how to interpret the coefficient for "skand"?
*Now do the same tests again for the new model and interpret the results:
predict spend_hat2
predict spend_resid2, resid
estat hettest
estat ovtest
twoway (scatter spend spend_hat2 if year==1984, mlabel(country))
twoway (scatter spend_resid2 cc)
twoway (scatter spend_resid2 unem)
estat szroeter unem growthpc depratio left cdem trade lowwage fdi skand
*Just treating the standard errors: robust White SEs:
reg spend unem growthpc depratio left cdem trade lowwage fdi, robust
reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(robust)
reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(cluster cc)
reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(bootstrap)
reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(jacknife)
reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(hc2)
reg spend unem growthpc depratio left cdem trade lowwage fdi, vce(hc3)
*GLS: robust Huber-White Sandwich Estimator
xtgls spend unem growthpc depratio left cdem trade lowwage fdi, p(h)
*Panel Heteroskedasticity
cd m:\
use "...path...\greene5.dta", clear
log using "...path...\lab1.log", replace
tsset firm year
*Estimate groupwise heteroskedastic model and test for heteroskedasticity
reg invest fval sval
display "White test"
predict res, resid
gen res2 = res^2
gen fval2 = fval^2
gen sval2 = sval^2
gen fs = fval * sval
reg res2 fval sval fval2 sval2 fs
ereturn list
scalar white = e(N)*e(r2)
display white
scal whitesig = chi2tail(e(df_m)-1,white)
display whitesig
display "Lagrange Multiplier test for groupwise heteroskedasticity in panel data"
scalar s2 = e(rss)/e(N)
xtgls invest fval sval, i(firm) p(h) c(i)
matrix list e(Sigma)
matrix sis = (1/s2)*e(Sigma)
matrix list sis
matrix sis1 = sis - I(5)
matrix list sis1
matrix sis2 = vecdiag(sis1)'
matrix list sis2
matrix sis3 = sis2'*sis2
matrix list sis3
matrix lmh = ((e(N)/e(N_g))/2)*sis3
scalar lm = lmh[1,1]
display "Lagrange Multiplier statistic = " lm
scalar lmsig = chi2tail((e(N_g)-1),lm)
display "Significance level = " lmsig
/*Estimate groupwise heteroskedastic and cross-sectionally correlated model and test for cross-sectional correlation*/
xtgls invest fval sval, i(firm) p(c) c(i)
xtgls invest fval sval, i(firm) p(h) c(i) igls
display "Lagrange multiplier test for cross-sectional correlation in panel data (MLE)"
predict yfit , xb
gen resgls = invest - yfit
set matsize 100
mkmat resgls, matrix(rgls)
matrix egls1 = rgls[1..20,1]
matrix egls2 = rgls[21..40,1]
matrix egls3 = rgls[41..60,1]
matrix egls4 = rgls[61..80,1]
matrix egls5 = rgls[81..100,1]
matrix egls = egls1, egls2, egls3, egls4, egls5
matrix list egls
matrix vcv = egls' * egls
matrix list vcv
matrix csc = corr(vcv)
matrix list csc
matrix lmm = csc[2...,1]\csc[3...,2]\csc[4...,3]\csc[5,4]
matrix list lmm
matrix lmlm = 10 * (lmm' * lmm)
scalar lml = lmlm[1,1]
display "Lagrange Multiplier statistic = " lml
scalar lmlsig = chi2tail(10,lml)
display "Significance level (alpha) = " lmlsig
display "Likelihood ratio test for MLE for cross-sectional correlation in panel data"
matrix si1 = (egls1' * egls1)/20
matrix si2 = (egls2' * egls2)/20
matrix si3 = (egls3' * egls3)/20
matrix si4 = (egls4' * egls4)/20
matrix si5 = (egls5' * egls5)/20
scalar lsi = ln(si1[1,1]) + ln(si2[1,1]) + ln(si3[1,1]) + ln(si4[1,1]) + ln(si5[1,1])
display lsi
quietly xtgls invest fval sval, i(firm) p(c) c(i) igls
scalar ldets = ln(det(e(Sigma)))
display ldets
scalar llr = 20 * (lsi - ldets)
display "Likelihood ratio statistic = " llr
scalar llrsig = chi2tail(10,llr)
display "Significance level (alpha) = " llrsig
*Now reestimate the model using the Beck&Katz approach
xtpcse invest fval sval, c(i)
log close
*______
use "...path...\garmit_esspanel1.dta", clear
*open log-file:
log using "...path...\lab1.log", append
*run basic model: OLS linear regression
reg spend unem growthpc depratio left cdem trade lowwage fdi
*test for heteroskedasticity
*check for heteroskedasticity
estat hettest
estat szroeter unem growthpc depratio left cdem trade lowwage fdi
*check for omitted variable bias
estat ovtest
* run models dealing with panel correlation and heteroskedasticity
reg spend unem growthpc depratio left cdem trade lowwage fdi, robust
reg spend unem growthpc depratio left cdem trade lowwage fdi, cluster(cc)
* compare different models
* iid error structure
xtgls spend unem growthpc depratio left cdem trade lowwage fdi, panels(iid)
* heteroscedastic but uncorrelated error structure
xtgls spend unem growthpc depratio left cdem trade lowwage fdi, panels(h)
* heteroscedastic and correlated error structure
xtgls govcons growthpc depratio left cdem trade lowwage, panels(c)
* independent autocorrelation structure
xtgls spend unem growthpc depratio left cdem trade lowwage fdi, corr(i)
* AR1 autocorrelation structure
xtgls spend unem growthpc depratio left cdem trade lowwage fdi, corr(ar1)
* panel specific AR1 autocorrelation structure
xtgls spend unem growthpc depratio left cdem trade lowwage fdi, corr(psar1)
*Parks Kmenta
xtgls govcons growthpc depratio left cdem trade lowwage, panels(c) corr(psar1)
*xtpcse
xtpcse spend unem growthpc depratio left cdem trade lowwage fdi
xtpcse spend unem growthpc depratio left cdem trade lowwage fdi, corr(ar1)
xtpcse spend unem growthpc depratio left cdem trade lowwage fdi, corr(psar1)
*______
*Autocorrelation:
*now let’s look at a single time-series, e.g. Germany (or UK) – what has changed, *what is the difference to the above models?
reg spend unem growthpc depratio left cdem trade lowwage fdi if country=="Germany"
*Do the same tests again: what can you observe?
estat hettest
estat ovtest
*Now let's turn to another violation of Gauss-Markov assumptions: autocorrelation, *serial correlation and test for it: Durbin-Watson statistic and Breusch-Godfrey *test, how to interprete the results?
estat dwatson
estat bgodfrey
*Or a simpler test: what can we observe?
predict spend_residger, resid
gen lagspend_residger=l.spend_residger
reg spend_residger lagspend_residger unem growthpc depratio left cdem trade lowwage fdi if country == "Germany"
*the simplest way to deal with serial correlation is to include the lagged values *of the dependent variable to the right hand side of the model – the LDV (lagged *dependent variable; BUT with including and excluding variables from the model and *do some more Omitted variable bias and heteroskedasticity tests…
*Now do the same for the UK!
*…
*opens another can of worms… We will talk about this in more detail when we talk *about time series models): How to interpret the coefficient of the LDV?
reg spend spendl unem growthpc depratio left cdem trade lowwage fdi if country=="Germany"
*And do the tests again:
estat dwatson
estat bgodfrey
*Run a Prais-Winsten model:
prais spend unem growthpc depratio left cdem trade lowwage fdi if country=="Germany"
*Play a little around
*______
*test for functional form:
reg spend unem growthpc depratio left cdem trade lowwage fdi
acprplot unem, mspline
acprplot left, mspline
*etc…
gen ln_unem = ln(unem)
reg spend ln_unem growthpc depratio left cdem trade lowwage fdi
gen sqrt_unem=unem^2
reg spend unem sqrt_unem growthpc depratio left cdem trade lowwage fdi
*what effect can we observe? Calculate the “turning point”!
*______
*Interpretatipon of a dummy variable and interaction effects:
reg spend skand if year==1986
reg spend unem skand
*gen skand_unem=unem*skand
reg spend unem skand skand_unem
reg spend unem skand skand_unem growthpc depratio left cdem trade lowwage fdi
*Interaction effects:
gen unem_trade=unem*trade
reg spend unem trade growthpc depratio left cdem lowwage fdi
reg spend unem trade unem_trade growthpc depratio left cdem lowwage fdi
ssc install sslope
sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(unem trade unem_trade)
sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(trade unem unem_trade)
*graphical display of IA effects:
sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(unem trade unem_trade) graph
sslope spend unem trade unem_trade growthpc depratio left cdem lowwage fdi, i(trade unem unem_trade) graph
sum trade unem
*Program marginal effects of IA effects:
***********************************************************************************
capture drop MV-lower
reg spend unem trade unem_trade growthpc depratio left cdem lowwage fdi
generate MV=((_n-1)*10)
replace MV=. if _n>17
matrix b=e(b)
matrix V=e(V)
scalar b1=b[1,1]
scalar b2=b[1,2]
scalar b3=b[1,3]
scalar varb1=V[1,1]
scalar varb2=V[2,2]
scalar varb3=V[3,3]
scalar covb1b3=V[1,3]
scalar covb2b3=V[2,3]
scalar list b1 b2 b3 varb1 varb2 varb3 covb1b3 covb2b3
gen conb=b1+b3*MV if _n<=17
gen conse=sqrt(varb1+varb3*(MV^2)+2*covb1b3*MV) if _n<=171
gen a=1.96*conse
gen upper=conb+a
gen lower=conb-a
graph twoway (line conb MV, clwidth(medium) clcolor(blue) clcolor(black))/*
*/(line upper MV, clpattern(dash) clwidth(thin) clcolor(black))/*
*/(line lower MV, clpattern(dash) clwidth(thin) clcolor(black)) , /*
*/ xlabel(0 20 40 60 80 100 120 140 160, labsize(2.5)) /*
*/ ylabel(-0.5 0 0.5 1 1.5 2, labsize(2.5)) yscale(noline) xscale(noline) legend(col(1) order(1 2) label(1 "Marginal Effect of Unemployment on Spending") label(2 "95% Confidence Interval") /*
*/ label(3 " ")) yline(0, lcolor(black)) title("Marginal Effect of Unemployment on Spending as Trade Openness changes", size(4))/*
*/ subtitle(" " "Dependent Variable: Government Spending" " ", size(3)) xtitle( Trade Openness, size(3) ) xsca(titlegap(2)) ysca(titlegap(2)) /*
*/ ytitle("Marginal Effect of Unemployment", size(3)) scheme(s2mono) graphregion(fcolor(white))
graph export m:\...\unem_trade1.eps, replace
translate @Graph m:\...\unem_trade1.wmf, replace
capture drop MV-lower
***********************************************************************************
*or much easier:
*download grinter.ado from Fred Boehmke
net from
reg spend unem trade unem_trade growthpc depratio left cdem lowwage fdi
grinter unem, inter (unem_trade) const02(trade) depvar(spend) kdensity yline(0)
1