SOCY7709: Quantitative Data Management

Instructor: Natasha Sarkisian

Managing your results and writing programs

Managing data analysis results

Once you are done with data management and turn to data analysis, there is a number of tools in Stata that make your life easier in terms of presenting your results and automating the analysis process. We will first focus on the presentation of results. For this demonstration, let’s use the imputed marriage file we created, marriage_imputed.dta, and run three models that add variables in step by step:

. mi import ice

.reshape long interv mar educ emp enrol, i(id) j(year)

. mi estimate, post: xtreg educ i.race, i(id)

Multiple-imputation estimates Imputations = 5

Random-effects GLS regression Number of obs = 72972

Group variable: id Number of groups = 6081

Obs per group: min = 12

avg = 12.0

max = 12

Average RVI = 0.0149

Largest FMI = 0.0178

DF adjustment: Large sample DF: min = 12803.78

avg = 80932.39

max = 216778.05

Model F test: Equal FMI F( 2, 4948.5) = 77.86

Within VCE type: Conventional Prob > F = 0.0000

------

educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

race |

2 | -.2220287 .0671802 -3.30 0.001 -.3537115 -.090346

3 | -.9855834 .078833 -12.50 0.000 -1.140108 -.8310589

|

_cons | 12.95061 .0363786 356.00 0.000 12.87931 13.02191

------+------

sigma_u | 2.166422

sigma_e | .55457878

rho | .93850006 (fraction of variance due to u_i)

------

Note: sigma_u and sigma_e are combined in the original metric.

. est store model1

. mi estimate, post: xtreg educ i.race mar emp enrol birthdate , i(id)

Multiple-imputation estimates Imputations = 5

Random-effects GLS regression Number of obs = 72972

Group variable: id Number of groups = 6081

Obs per group: min = 12

avg = 12.0

max = 12

Average RVI = 0.0412

Largest FMI = 0.0986

DF adjustment: Large sample DF: min = 447.38

avg = 3.77e+06

max = 2.62e+07

Model F test: Equal FMI F( 6, 8119.9) = 1050.55

Within VCE type: Conventional Prob > F = 0.0000

------

educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

race |

2 | -.1411899 .058076 -2.43 0.015 -.2550301 -.0273497

3 | -.9416435 .0682705 -13.79 0.000 -1.075475 -.8078124

|

mar | .2112441 .0062024 34.06 0.000 .1990546 .2234336

emp | .063669 .0062508 10.19 0.000 .0513976 .0759404

enrol | -.6026058 .0093276 -64.60 0.000 -.62089 -.5843216

birthdate | -.0001792 .0000294 -6.10 0.000 -.0002368 -.0001216

_cons | 12.85989 .0328686 391.25 0.000 12.79547 12.92431

------+------

sigma_u | 1.8326545

sigma_e | .52811421

rho | .92332576 (fraction of variance due to u_i)

------

Note: sigma_u and sigma_e are combined in the original metric.

. est store model2

. mi estimate, post: xtreg educ i.race mar emp enrol birthdate parpres pared , i(id)

Multiple-imputation estimates Imputations = 5

Random-effects GLS regression Number of obs = 72972

Group variable: id Number of groups = 6081

Obs per group: min = 12

avg = 12.0

max = 12

Average RVI = 0.0417

Largest FMI = 0.0992

DF adjustment: Large sample DF: min = 442.39

avg = 23190.77

max = 147906.27

Model F test: Equal FMI F( 8,12658.6) = 1039.62

Within VCE type: Conventional Prob > F = 0.0000

------

educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

race |

2 | .3671885 .0552626 6.64 0.000 .2588588 .4755181

3 | .2627608 .0693041 3.79 0.000 .1268739 .3986477

|

mar | .2113036 .0061743 34.22 0.000 .199169 .2234382

emp | .0622549 .0062293 9.99 0.000 .0500243 .0744855

enrol | -.6044645 .0092907 -65.06 0.000 -.6226765 -.5862524

birthdate | -.0001578 .0000273 -5.79 0.000 -.0002113 -.0001043

parpres | .0267513 .0022266 12.01 0.000 .0223847 .0311178

pared | .2760357 .0086083 32.07 0.000 .2591509 .2929204

_cons | 8.572751 .1012551 84.66 0.000 8.374293 8.771209

------+------

sigma_u | 1.6964547

sigma_e | .52811421

rho | .91165134 (fraction of variance due to u_i)

------

Note: sigma_u and sigma_e are combined in the original metric.

. est store model3

. est table model1 model2 model3, b(%5.3f) star varlabel

------

Variable | model1 model2 model3

------+------

race |

Black | -0.222*** -0.141* 0.367***

Hispanic | -0.986*** -0.942*** 0.263***

Married | 0.211*** 0.211***

Employed | 0.064*** 0.062***

Enrolled in school | -0.603*** -0.604***

Date of birth | -0.000*** -0.000***

Parents' occupational ~s | 0.027***

Parents' education | 0.276***

Constant | 12.951*** 12.860*** 8.573***

------

legend: * p<0.05; ** p<0.01; *** p<0.001

We can then transfer this table into Excel via Textpad (block select mode in Textpad; then in Excel, use Data Text to columns  fixed width  finish).

Another command, estout, provides more flexibility in formatting, e.g.:

. set linesize 200

. estout model1 model2 model3, label cells("b(fmt(%5.3f)) se(par) _star")

------

model1 model2 model3

b se _star b se _star b se _star

------

1b.Race/ethnicity 0.000 (.) 0.000 (.) 0.000 (.)

2.Race/ethnicity -0.222 (0.067) *** -0.141 (0.058) * 0.367 (0.055) ***

3.Race/ethnicity -0.986 (0.079) *** -0.942 (0.068) *** 0.263 (0.069) ***

Married 0.211 (0.006) *** 0.211 (0.006) ***

Employed 0.064 (0.006) *** 0.062 (0.006) ***

Enrolled in school -0.603 (0.009) *** -0.604 (0.009) ***

Date of birth -0.000 (0.000) *** -0.000 (0.000) ***

Parents' occupatio~s 0.027 (0.002) ***

Parents' education 0.276 (0.009) ***

_cons 12.951 (0.036) *** 12.860 (0.033) *** 8.573 (0.101) ***

------

Estimates store makes a temporary copy of estimates; if you want to store your results permanently, you can store them in a file:

. estimates save filename [, append replace]

and then load them in:

. estimates use filename

In the latter command, you can use “number” option if your file contains more than one set of estimates. Once you loaded the estimates from file, you can use est store and the follow-up est table or estout.

Introduction to writing Stata programs

The key difference between a regular do file and a Stata program is that you are creating your own command that can be later called out by typing its name. Programs are typically saved as .ado files, but they can also be embedded in .do files.

When do we opt for writing a program? The two most common situations are:

--when we plan to do the same procedure over and over again but with different inputs

--when we want the results to be presented in a certain way

Here is a very simple program that can be loaded from .do file:

program define sc709

di as text "We love data management!"

end

Once we run this, the program is loaded into Stata memory, and we can type the command and get the result:

. sc709

We love data management!

Suppose we want to make this program more complicated; we can display some more stuff:

program define socy7709

di as result "`c(current_date)'"

di as text "We love data management!"

di as error "This is a super informative error message"

exit=100

end

But if we try to run this from do file, we get an error – the program is already defined. Once you ran the program once, you have to drop it if you want to run a modified version of this do file:

capture program drop socy7709

program define socy7709

di as result "`c(current_date)'"

di as text "We love data management!"

di as error "This is a super informative error message"

exit=100

end

socy7709

. socy7709

2 Dec 2015

We love data management!

This is a super informative error message

r(100);

end of do-file

r(100);

So your program can be part of your do file, and once you load it, you can keep reusing it later in the do file as many times as needed. In fact, while you are writing and troubleshooting a new program, that might be the easiest way to work on it because you can right away drop the program, load the program back in, and attempt to run it, all from within a do-file. Once you are done with the program, however, the most common way to use it is to save it as an .ado file; these get saved in the ado directory. To find where it is, we can use

. sysdir

STATA: C:\Program Files (x86)\Stata14\

BASE: C:\Program Files (x86)\Stata14\ado\base\

SITE: C:\Program Files (x86)\Stata14\ado\site\

PLUS: c:\ado\plus\

PERSONAL: c:\ado\personal\

OLDPLACE: c:\ado\

Your own programs should be saved in ado\personal and the name of the file should match the name of program – so we would save this program as socy7709.ado. In that file, we would have our program plus some standard information – Stata version for which it is written, date last modified, what the program does, and program syntax (since our program only requires the command and no input, the syntax is empty):

program define socy7709

version 14.0

//last update December 2, 2016

//this program displays some predefined output, current date, and gives an error message

syntax

di as result "`c(current_date)'"

di as text "We love data management!"

di as error "This is a super informative error message"

exit=100

end

This program is very simple as it does not require any input, but most programs do. Recall when we looked at loops, we did multiple embedded loops, and specifically, a double loop for regressing different outcomes on marital status with different omitted categories:

foreach var_y in hrs1 prestg80 educ {

forvalues cat=1/5 {

reg `var_y' ib`cat'.marital

}

}

We can generalize this program so that it runs with any outcome variables and with any number and values of categories:

capture program drop regress_omit

program define regress_omit

version 14.0

//last update December 2, 2016

//this program regresses all variables y on a categorical variable x while omitting one category of x at a time

syntax, x(varname) y(varlist)

//identify all existing values of x

qui levelsof `x', local(categ)

//loop over all y variables

foreach var_y in `y' {

//loop from first to last category of x

foreach level in `categ' {

reg `var_y' ib`level'.`x'

}

}

end

regress_omit, x(race) y(emp mar enrol)

When writing complex programs, we cannot avoid the need to troubleshoot our programs. That can be done by setting trace on in order to see where the program breaks:

.set trace on

You can switch it back off by using

.set trace off

In this process, it is often useful to add extra display (di) lines to the program that show what specific elements are equal to at specific point in the program – i.e., does the local contain what you want it to contain?

We can further modify this program above so that it runs with any command, not just reg:

capture program drop regress_omit

program define regress_omit

version 14.0

//last update December 2, 2016

//this program regresses all continuous variables y on a categorical variable x while omitting one category of x at a time

syntax, x(varname) y(varlist) cmd(string)

//identify all existing values of x

qui levelsof `x', local(categ)

//loop over all y variables

foreach var_y in `y' {

//loop from first to last category of x

foreach level in `categ' {

`cmd' `var_y' ib`level'.`x'

}

}

end

regress_omit, x(race) y(emp mar enrol) cmd(mi estimate: xtreg)

Now for a more complex program example – suppose, we want to present means for three groups, with significance tests indicating if groups 2 and 3 differ from group 1. Our data are multiply imputed, so we would use mi estimate: mean followed up by test in order to do this. For example:

. mi estimate, post: mean mar, over(race)

Multiple-imputation estimates Imputations = 5

Mean estimation Number of obs = 72972

Average RVI = 0.0710

Largest FMI = 0.1046

Complete DF = 72971

DF adjustment: Small sample DF: min = 396.43

avg = 1275.42

Within VCE type: Analytic max = 2055.68

White: race = White

Black: race = Black

Hispanic: race = Hispanic

------

Over | Mean Std. Err. [95% Conf. Interval]

------+------

mar |

White | .6414751 .0023778 .6368107 .6461396

Black | .3597424 .0036407 .3526027 .3668822

Hispanic | .5527132 .0048194 .5432385 .562188

------

Note: numbers of observations in e(_N) vary among imputations.

. test [mar]White=[mar]Black

( 1) [mar]White - [mar]Black = 0

F( 1, 72971) = 4134.91

Prob > F = 0.0000

If we have a bunch of variables and we want to have a nice table with three columns and significance test results indicated as stars, we have a lot of work to do. Instead, we will write a program to do that.

capture program drop mimeans3gr

program define mimeans3gr

version 14.0

//last update December 2, 2016

syntax varlist [if] [pweight], OVer(varname)

//create a temporary variable for "over" variable without category labels

tempvar cat3var

qui gen `cat3var'=`over'

//display a header based on labels specified in "over"

di as text "Variable" _col(20) "| `: label (`over') 1'" _col(`=20+20') ///

"| `: label (`over') 2'" _col(`=20+20+20') "| `: label (`over') 3'" ///

_col(`=20+20+20+20') "|"

//create a macro called `1' that will contain variables from varlist, one by one

tokenize `varlist'

//loop through variables

while `"`1'"' ~= "" {

qui mi estimate, post esampvaryok: mean `1' `if' [`weight' `exp'], over(`cat3var')

//save matrices with results and create locals with desired quantities

mat means=e(b)

mat var=e(V)

mat N=e(_N_vm_mi)

local varname = e(varlist)

local mean1=means[1,1]

local mean2=means[1,2]

local mean3=means[1,3]

local N1=N[1,1]

local N2=N[1,2]

local N3=N[1,3]

//do significance tests and store results

local stars1=""

local stars2=""

qui test [`1']1=[`1']2

local pval=r(p)

if `pval'<.05 {

local stars1 `stars1'*

}

if `pval'<.01 {

local stars1 `stars1'*

}

if `pval'<.001 {

local stars1 `stars1'*

}

qui test [`1']1=[`1']3

local pval=r(p)

if `pval'<.05 {

local stars2 `stars2'*

}

if `pval'<.01 {

local stars2 `stars2'*

}

if `pval'<.001 {

local stars2 `stars2'*

}

//display the results for current variable

di as text "`varname'" _col(20) "|" %5.2f `mean1' _col(`=20+13') ///

"(" %5.0f `N1' ")" "|" _col(`=20+13+6') %5.2f `mean2' _col(`=20+13+6+7') ///

"`stars1'" _col(`=20+13+6+7+6') "(" %5.0f `N2' ") " "|" _col(`=20+13+6+7+6+7') ///

%5.2f `mean3' _col(`=20+13+6+7+6+7+6') "`stars2'" _col(`=20+13+6+7+6+7+6+7') ///

"(" %5.0f `N3' ") " "|"

//advance the macro `1' to the next variable

mac shift

}

end

. mimeans3gr mar emp enrol if educ>12, ov(race)

Variable | White | Black | Hispanic |

mar | 0.61 (18815)| 0.38*** ( 8004) | 0.53*** ( 4228) |

emp | 0.74 (18815)| 0.77*** ( 8004) | 0.78*** ( 4228) |

enrol | 0.12 (18815)| 0.12 ( 8004) | 0.15*** ( 4228) |

1