SOCY7709: Quantitative Data Management
Instructor: Natasha Sarkisian
Managing your results and writing programs
Managing data analysis results
Once you are done with data management and turn to data analysis, there is a number of tools in Stata that make your life easier in terms of presenting your results and automating the analysis process. We will first focus on the presentation of results. For this demonstration, let’s use the imputed marriage file we created, marriage_imputed.dta, and run three models that add variables in step by step:
. mi import ice
.reshape long interv mar educ emp enrol, i(id) j(year)
. mi estimate, post: xtreg educ i.race, i(id)
Multiple-imputation estimates Imputations = 5
Random-effects GLS regression Number of obs = 72972
Group variable: id Number of groups = 6081
Obs per group: min = 12
avg = 12.0
max = 12
Average RVI = 0.0149
Largest FMI = 0.0178
DF adjustment: Large sample DF: min = 12803.78
avg = 80932.39
max = 216778.05
Model F test: Equal FMI F( 2, 4948.5) = 77.86
Within VCE type: Conventional Prob > F = 0.0000
------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
race |
2 | -.2220287 .0671802 -3.30 0.001 -.3537115 -.090346
3 | -.9855834 .078833 -12.50 0.000 -1.140108 -.8310589
|
_cons | 12.95061 .0363786 356.00 0.000 12.87931 13.02191
------+------
sigma_u | 2.166422
sigma_e | .55457878
rho | .93850006 (fraction of variance due to u_i)
------
Note: sigma_u and sigma_e are combined in the original metric.
. est store model1
. mi estimate, post: xtreg educ i.race mar emp enrol birthdate , i(id)
Multiple-imputation estimates Imputations = 5
Random-effects GLS regression Number of obs = 72972
Group variable: id Number of groups = 6081
Obs per group: min = 12
avg = 12.0
max = 12
Average RVI = 0.0412
Largest FMI = 0.0986
DF adjustment: Large sample DF: min = 447.38
avg = 3.77e+06
max = 2.62e+07
Model F test: Equal FMI F( 6, 8119.9) = 1050.55
Within VCE type: Conventional Prob > F = 0.0000
------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
race |
2 | -.1411899 .058076 -2.43 0.015 -.2550301 -.0273497
3 | -.9416435 .0682705 -13.79 0.000 -1.075475 -.8078124
|
mar | .2112441 .0062024 34.06 0.000 .1990546 .2234336
emp | .063669 .0062508 10.19 0.000 .0513976 .0759404
enrol | -.6026058 .0093276 -64.60 0.000 -.62089 -.5843216
birthdate | -.0001792 .0000294 -6.10 0.000 -.0002368 -.0001216
_cons | 12.85989 .0328686 391.25 0.000 12.79547 12.92431
------+------
sigma_u | 1.8326545
sigma_e | .52811421
rho | .92332576 (fraction of variance due to u_i)
------
Note: sigma_u and sigma_e are combined in the original metric.
. est store model2
. mi estimate, post: xtreg educ i.race mar emp enrol birthdate parpres pared , i(id)
Multiple-imputation estimates Imputations = 5
Random-effects GLS regression Number of obs = 72972
Group variable: id Number of groups = 6081
Obs per group: min = 12
avg = 12.0
max = 12
Average RVI = 0.0417
Largest FMI = 0.0992
DF adjustment: Large sample DF: min = 442.39
avg = 23190.77
max = 147906.27
Model F test: Equal FMI F( 8,12658.6) = 1039.62
Within VCE type: Conventional Prob > F = 0.0000
------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
race |
2 | .3671885 .0552626 6.64 0.000 .2588588 .4755181
3 | .2627608 .0693041 3.79 0.000 .1268739 .3986477
|
mar | .2113036 .0061743 34.22 0.000 .199169 .2234382
emp | .0622549 .0062293 9.99 0.000 .0500243 .0744855
enrol | -.6044645 .0092907 -65.06 0.000 -.6226765 -.5862524
birthdate | -.0001578 .0000273 -5.79 0.000 -.0002113 -.0001043
parpres | .0267513 .0022266 12.01 0.000 .0223847 .0311178
pared | .2760357 .0086083 32.07 0.000 .2591509 .2929204
_cons | 8.572751 .1012551 84.66 0.000 8.374293 8.771209
------+------
sigma_u | 1.6964547
sigma_e | .52811421
rho | .91165134 (fraction of variance due to u_i)
------
Note: sigma_u and sigma_e are combined in the original metric.
. est store model3
. est table model1 model2 model3, b(%5.3f) star varlabel
------
Variable | model1 model2 model3
------+------
race |
Black | -0.222*** -0.141* 0.367***
Hispanic | -0.986*** -0.942*** 0.263***
Married | 0.211*** 0.211***
Employed | 0.064*** 0.062***
Enrolled in school | -0.603*** -0.604***
Date of birth | -0.000*** -0.000***
Parents' occupational ~s | 0.027***
Parents' education | 0.276***
Constant | 12.951*** 12.860*** 8.573***
------
legend: * p<0.05; ** p<0.01; *** p<0.001
We can then transfer this table into Excel via Textpad (block select mode in Textpad; then in Excel, use Data Text to columns fixed width finish).
Another command, estout, provides more flexibility in formatting, e.g.:
. set linesize 200
. estout model1 model2 model3, label cells("b(fmt(%5.3f)) se(par) _star")
------
model1 model2 model3
b se _star b se _star b se _star
------
1b.Race/ethnicity 0.000 (.) 0.000 (.) 0.000 (.)
2.Race/ethnicity -0.222 (0.067) *** -0.141 (0.058) * 0.367 (0.055) ***
3.Race/ethnicity -0.986 (0.079) *** -0.942 (0.068) *** 0.263 (0.069) ***
Married 0.211 (0.006) *** 0.211 (0.006) ***
Employed 0.064 (0.006) *** 0.062 (0.006) ***
Enrolled in school -0.603 (0.009) *** -0.604 (0.009) ***
Date of birth -0.000 (0.000) *** -0.000 (0.000) ***
Parents' occupatio~s 0.027 (0.002) ***
Parents' education 0.276 (0.009) ***
_cons 12.951 (0.036) *** 12.860 (0.033) *** 8.573 (0.101) ***
------
Estimates store makes a temporary copy of estimates; if you want to store your results permanently, you can store them in a file:
. estimates save filename [, append replace]
and then load them in:
. estimates use filename
In the latter command, you can use “number” option if your file contains more than one set of estimates. Once you loaded the estimates from file, you can use est store and the follow-up est table or estout.
Introduction to writing Stata programs
The key difference between a regular do file and a Stata program is that you are creating your own command that can be later called out by typing its name. Programs are typically saved as .ado files, but they can also be embedded in .do files.
When do we opt for writing a program? The two most common situations are:
--when we plan to do the same procedure over and over again but with different inputs
--when we want the results to be presented in a certain way
Here is a very simple program that can be loaded from .do file:
program define sc709
di as text "We love data management!"
end
Once we run this, the program is loaded into Stata memory, and we can type the command and get the result:
. sc709
We love data management!
Suppose we want to make this program more complicated; we can display some more stuff:
program define socy7709
di as result "`c(current_date)'"
di as text "We love data management!"
di as error "This is a super informative error message"
exit=100
end
But if we try to run this from do file, we get an error – the program is already defined. Once you ran the program once, you have to drop it if you want to run a modified version of this do file:
capture program drop socy7709
program define socy7709
di as result "`c(current_date)'"
di as text "We love data management!"
di as error "This is a super informative error message"
exit=100
end
socy7709
. socy7709
2 Dec 2015
We love data management!
This is a super informative error message
r(100);
end of do-file
r(100);
So your program can be part of your do file, and once you load it, you can keep reusing it later in the do file as many times as needed. In fact, while you are writing and troubleshooting a new program, that might be the easiest way to work on it because you can right away drop the program, load the program back in, and attempt to run it, all from within a do-file. Once you are done with the program, however, the most common way to use it is to save it as an .ado file; these get saved in the ado directory. To find where it is, we can use
. sysdir
STATA: C:\Program Files (x86)\Stata14\
BASE: C:\Program Files (x86)\Stata14\ado\base\
SITE: C:\Program Files (x86)\Stata14\ado\site\
PLUS: c:\ado\plus\
PERSONAL: c:\ado\personal\
OLDPLACE: c:\ado\
Your own programs should be saved in ado\personal and the name of the file should match the name of program – so we would save this program as socy7709.ado. In that file, we would have our program plus some standard information – Stata version for which it is written, date last modified, what the program does, and program syntax (since our program only requires the command and no input, the syntax is empty):
program define socy7709
version 14.0
//last update December 2, 2016
//this program displays some predefined output, current date, and gives an error message
syntax
di as result "`c(current_date)'"
di as text "We love data management!"
di as error "This is a super informative error message"
exit=100
end
This program is very simple as it does not require any input, but most programs do. Recall when we looked at loops, we did multiple embedded loops, and specifically, a double loop for regressing different outcomes on marital status with different omitted categories:
foreach var_y in hrs1 prestg80 educ {
forvalues cat=1/5 {
reg `var_y' ib`cat'.marital
}
}
We can generalize this program so that it runs with any outcome variables and with any number and values of categories:
capture program drop regress_omit
program define regress_omit
version 14.0
//last update December 2, 2016
//this program regresses all variables y on a categorical variable x while omitting one category of x at a time
syntax, x(varname) y(varlist)
//identify all existing values of x
qui levelsof `x', local(categ)
//loop over all y variables
foreach var_y in `y' {
//loop from first to last category of x
foreach level in `categ' {
reg `var_y' ib`level'.`x'
}
}
end
regress_omit, x(race) y(emp mar enrol)
When writing complex programs, we cannot avoid the need to troubleshoot our programs. That can be done by setting trace on in order to see where the program breaks:
.set trace on
You can switch it back off by using
.set trace off
In this process, it is often useful to add extra display (di) lines to the program that show what specific elements are equal to at specific point in the program – i.e., does the local contain what you want it to contain?
We can further modify this program above so that it runs with any command, not just reg:
capture program drop regress_omit
program define regress_omit
version 14.0
//last update December 2, 2016
//this program regresses all continuous variables y on a categorical variable x while omitting one category of x at a time
syntax, x(varname) y(varlist) cmd(string)
//identify all existing values of x
qui levelsof `x', local(categ)
//loop over all y variables
foreach var_y in `y' {
//loop from first to last category of x
foreach level in `categ' {
`cmd' `var_y' ib`level'.`x'
}
}
end
regress_omit, x(race) y(emp mar enrol) cmd(mi estimate: xtreg)
Now for a more complex program example – suppose, we want to present means for three groups, with significance tests indicating if groups 2 and 3 differ from group 1. Our data are multiply imputed, so we would use mi estimate: mean followed up by test in order to do this. For example:
. mi estimate, post: mean mar, over(race)
Multiple-imputation estimates Imputations = 5
Mean estimation Number of obs = 72972
Average RVI = 0.0710
Largest FMI = 0.1046
Complete DF = 72971
DF adjustment: Small sample DF: min = 396.43
avg = 1275.42
Within VCE type: Analytic max = 2055.68
White: race = White
Black: race = Black
Hispanic: race = Hispanic
------
Over | Mean Std. Err. [95% Conf. Interval]
------+------
mar |
White | .6414751 .0023778 .6368107 .6461396
Black | .3597424 .0036407 .3526027 .3668822
Hispanic | .5527132 .0048194 .5432385 .562188
------
Note: numbers of observations in e(_N) vary among imputations.
. test [mar]White=[mar]Black
( 1) [mar]White - [mar]Black = 0
F( 1, 72971) = 4134.91
Prob > F = 0.0000
If we have a bunch of variables and we want to have a nice table with three columns and significance test results indicated as stars, we have a lot of work to do. Instead, we will write a program to do that.
capture program drop mimeans3gr
program define mimeans3gr
version 14.0
//last update December 2, 2016
syntax varlist [if] [pweight], OVer(varname)
//create a temporary variable for "over" variable without category labels
tempvar cat3var
qui gen `cat3var'=`over'
//display a header based on labels specified in "over"
di as text "Variable" _col(20) "| `: label (`over') 1'" _col(`=20+20') ///
"| `: label (`over') 2'" _col(`=20+20+20') "| `: label (`over') 3'" ///
_col(`=20+20+20+20') "|"
//create a macro called `1' that will contain variables from varlist, one by one
tokenize `varlist'
//loop through variables
while `"`1'"' ~= "" {
qui mi estimate, post esampvaryok: mean `1' `if' [`weight' `exp'], over(`cat3var')
//save matrices with results and create locals with desired quantities
mat means=e(b)
mat var=e(V)
mat N=e(_N_vm_mi)
local varname = e(varlist)
local mean1=means[1,1]
local mean2=means[1,2]
local mean3=means[1,3]
local N1=N[1,1]
local N2=N[1,2]
local N3=N[1,3]
//do significance tests and store results
local stars1=""
local stars2=""
qui test [`1']1=[`1']2
local pval=r(p)
if `pval'<.05 {
local stars1 `stars1'*
}
if `pval'<.01 {
local stars1 `stars1'*
}
if `pval'<.001 {
local stars1 `stars1'*
}
qui test [`1']1=[`1']3
local pval=r(p)
if `pval'<.05 {
local stars2 `stars2'*
}
if `pval'<.01 {
local stars2 `stars2'*
}
if `pval'<.001 {
local stars2 `stars2'*
}
//display the results for current variable
di as text "`varname'" _col(20) "|" %5.2f `mean1' _col(`=20+13') ///
"(" %5.0f `N1' ")" "|" _col(`=20+13+6') %5.2f `mean2' _col(`=20+13+6+7') ///
"`stars1'" _col(`=20+13+6+7+6') "(" %5.0f `N2' ") " "|" _col(`=20+13+6+7+6+7') ///
%5.2f `mean3' _col(`=20+13+6+7+6+7+6') "`stars2'" _col(`=20+13+6+7+6+7+6+7') ///
"(" %5.0f `N3' ") " "|"
//advance the macro `1' to the next variable
mac shift
}
end
. mimeans3gr mar emp enrol if educ>12, ov(race)
Variable | White | Black | Hispanic |
mar | 0.61 (18815)| 0.38*** ( 8004) | 0.53*** ( 4228) |
emp | 0.74 (18815)| 0.77*** ( 8004) | 0.78*** ( 4228) |
enrol | 0.12 (18815)| 0.12 ( 8004) | 0.15*** ( 4228) |
1