Getting Started with STATA

Getting STATA:

STATA is a widely used piece of statistical software. In economics, it is the most commonly used software in labour and other fields that rely on large microdata files. You can obtain a 6-month license to STATA/IC version 15 from the STATA website for $45 US:

You do not need the more powerful versions STATA/SE or STATA/MP. If you can find an earlier release at a ‘better’ price it should be adequate if it isn’t too old (don’t go earlier than version 12).When you have installed a version of STATA please let me know which version number you have.

Learning STATA:

STATA is widely used and as a result there are many help sites, discussion boards and online videos. Here is a thorough tutorial from Princeton University:

Here is a link to some videos from STATA itself:

Once you open STATA you can see a “Help” tab at the top of the window. This gives you access to the software’s help features including help by ‘STATA command’. It also allows access to the internal STATA manual. The Princeton tutorial mentioned above ( )has useful discussions of “Data Management” (how to read inand save data). See also the discussions of “log” (a file to record the results of your file) and “do” files (a program file where you can write your commands). Also familiarize yourself with the commands for creating new variables (“generate”) and notice especially how STATA creates dummy variables). Also useful are commands defining what observations are to be in the sample (“drop” and "keep") and basic statistical commands: “summarize” (for creating descriptive statistics) and “tabulate” for creating frequencies. ‘table’ and ‘means’ can also be useful for generating basic descriptive statistics. The “bys” option is also useful for generating statistics for sub-groups of the sample. I have provided you with a basic example of a STATA program file with notes on the basic commands and instructions on how to run it (see the file "example_2018.do" and the datafile that it reads example_data_2018.csv – save a copy of each). Some explanation of basic commands is also given below.

Do-files and the STATA editor:

A STATA do-file is a script or program file. It contains STATA commands that will be executed when the do-file is run (executed). When doing an assignment or project you will want to create do-files so that you have a record of what you have done. This record allows you to avoid having to remember and re-enter sequences of commands if you make errors in your program. They also save you a lot of time if you need to run a similar program at a later date as you can just edit your old program file.

To create a program (do) file in STATA first open STATA (double-click the STATA icon). Once STATA has opened go to the icons in the upper-left part of the computer screen, highlight each to see what it does. Click on the icon called the “New Do-file Editor”. This is STATA’s built in program file editor. You now have a blank window in which you can type the individual STATA commands that you wish to include in your program. Once you have typed in all your commands you can save the result (either click the ‘save icon’ or go under ‘File’ and click the save option). This will create a permanent record of your program (it will be saved as a STATA “do” file with extension .do). Do files can be run later by double-clicking on them. You can also run or execute a do-file from the editor if you wish (look under ‘Tools’ or click the Execute icon to run the program).A do-file can also be run by typing the command ‘do’ and the location and name of the do-file on STATA’s command line e.g. if the file is called: example_2018.do and is in a folder C:\Mike\ typing: do C:\Mike\example_2018.do will run it.

An example data file and program

Have a look at the example do-file fromthe course website (example_2018.do) and also look at the data-file that it reads (example_data_2018.csv, a printout of this file is on p.8 of this document). The data-file example_data_2018.csv is in comma-delimited (spreadsheet) format just like the file you will use on Assignment 1. Look first at the data file by opening it (your computer will try to open it with Excel if you double-click on the file). Once open in Excel, you will see the names of seven variable names in the first row of the file: RECORD, SEX, AGE, WAGE, EDUCATION, UNION and WGHT. Below this are 20 made up observations on these variables (each row contains the data on a particular person or observation). Codebooks for data files tell you what different values of a variable mean. A codebook for this file might look something like the following:

Name:Description:Code (value reported)

RECORDPerson ID number in file

SEXSex of individual

Man 1

Woman 2

AGEAge in years (15-75)

WAGEHourly wage in $ (rounded)

EDUCATIONEducation level

Low1

Middle2

High3

UNIONUnionized job

No0

Yes1

WGHTweighting variable

The codebook tells you how to interpret the entries in the data-file. Using the codebook you can see for example that the first person in the dataset is a man, age 22, paid $16/hour, with middle level education in a non-union job. The RECORD value is just an identifier for the observation. WGHT is discussed below. We now want to look at a STATA do-file that will read this file and do some basic calculations using the data from the file.

Example do-file: example_2018.do(a printout of the commands in the file is on p.7)

To look at the do-file right-click on it and choose “edit” – this will open the do-file in STATA’s do-file editor. The editor assigns different colors to different types of code. Lines colored green are explanatory comments rather than commands (in the editor you can tell STATA that you are writing a comment by starting the line with an *). The lines with text in blue are STATA commands while text shown in black describes text such as variable names or options associated with some STATA command. Lets look at the major commands in the file.

The first STATA command you see in the file is:

set more off

This command controls on-screen output. When the program is running it will show the resulting output on screen. Sometimes STATA is set up so that it pauses after a screen is full of output "set more off" has STATA advance through the output on screen without you having to manually press <enter> at the end of each full screen of output.

Output files:

The next command is:

log using c:\Users\Mike\3111\example_2018.log, replace

this tells STATA to create an output file called “example_2018.log” located in a folder on my computer called “c:\Users\Mike\3111\”. All commands and resulting output from running the program will be written to this file and it will be in text format. This is useful as it gives you a record of your session. If you do not include a ‘log’ command STATA will only display the output on screen. If the extension .log is used the output file will be in text format. If, as in the example, it has the extension .smcl it will be in a more formattedouput (I find the .smcl format useful if I want to cut and paste output into Excel).

Reading in data:

The next command you will see reads the data file ‘example_data.csv’ it is:

insheet using c:\Users\Mike\3111\example_data_2018.csv

‘insheet’ tells STATA that it is reading a worksheet format file. "using" indicates the file you intend to read and the text following ‘using’ tells STATA the location and name of the file in which the data can be found. STATA can read datafiles in several common data formats including free-format (where variables are text separated by a space), fixed format (where each variable is confined to specific columns of a data matrix) and STATA’s own dataset format. Further detail on how to read data in these formats is provided in the example program comments.

Creating new variables or changing old variables:

The next commands in the example do-file are “generate” statements (‘gen’ for short – in STATA you can abbreviate commands by using the first three letters in the command). These allow you to create new variables from existing variables. Put the name of the new variable on the left-hand side of the equation and the expression defining the new variable on the right-hand side:

gen wage_cents=wage*100

creates a variable called wage_cents which is ‘wage’ times 100. Here are a few other examples:

genagesq=age*age

genage_cubed=age^3

genlog_age=log(age)

In STATA (* is multiplication, \ is division, ^ is for exponents, log or ln is for natural logs etc. You can find more information on STATA functions and operators by looking in the Help menu and giving “functions” as the STATA command you desire help for).

Dummy variables equal 1 if an observation has some characteristic and 0 if it does not. Dummies can also be created using the ‘gen’ command. Notice that the variable UNION in the example data is already a dummy. To create a dummy from existing variables write ‘gen’ followed by the name you want to give the dummy variable and on the other side of the equation state the logical condition that the observation must satisfy if the dummy is to equal 1. For such logical statements operators are: == for equals, <= less than or equal, >= greater than or equal, != not equal. If there are multiple conditions | means “or” and & means “and”. Here are some examples from the do-file:

gen male=(sex==1) creates a dummy “male” which equals 1 for observations with

the variable ‘sex’ equal to 1 and equal to 0 for other observations.

genunion_man=(sex==1) & (union==1) the dummy "union_man" equals 1 for men

who are unionized.

genhed_un=(union==1) | (education==3) dummy hed_un equals 1 for those

who either are unionized or have high level education.

Replace is another command that is sometime useful in defining or redefining variables. For example:

gen old=0

replace old=1 if (age>40)

These two commands first generate a variable "old" that is always 0 and then replaces the 0 with 1's if the stated logical condition (age>40) is satisfied.

Defining the sample:

A couple of other useful commands ‘drop if’ and ‘keep if’ allow you to create subsets from the dataset in memory (this is useful if you only want to perform calculations on some subset of observations):

drop if(sex==1) deletes any observations that satisfy sex==1.

keep if(sex==2) keeps only observations that satisfy sex==2.

‘drop’ and ‘keep’ are not reversible, i.e. you cannot recall the observations eliminated later in the program though they will still be in the original datafile. ‘drop’ and ‘keep’ can also be used to eliminate variables from the dataset: ‘drop sex’ would eliminate the variable sex from the dataset. ‘keep sex wage union’ keeps only the variables indicated and drops the rest.

Descriptive statistics:

Descriptive statistics are created by the next few commands: ‘summarize’, ‘tabulate’ and ‘table’.

‘summarize’or ‘sum’ for short asks STATA to calculate basic descriptive statistics for all variables in the dataset, i.e. means, standard deviations, minimum and maximum values e.g.

summarize

If "summarize" is followed by a list of variable names then STATA will calculate the descriptive statistics only for the variables in the list e.g. the following gives summary stats only for age and sex:

summarize sex age

You can also add an option to summarize asking for additional detail (medians, percentiles, measures of skewness), e.g. summarize sex age, detail

Adding [fweight=???] where ???is the name of the weight variable (‘wght’in example_data_2018.csv)

will give statistical results where STATA weights each observation with variable ‘wght’. For example:

summarize [fweight=wght]

STATA has a number of weighting options use the ‘fweight’ option (frequency weight) in combination with the descriptive statistics commands.

The next command adds an ‘if’ condition to the summarize command. This will restrict the calculation of summary stats to observations that satisfy the if condition (i.e. have sex==1):

summarize if(sex==1)

Many of STATAs statistical commands can be used in combination with an if-statement.

Adding "bys sex:" before summarize sorts ("s") the data "by" values of the variable ‘sex’ and then reports a separate table ofsummary statistics for all observations with each value of ‘sex’. The second command below gives weighted summary statistics by sex only for the variable ‘age’.

bys sex: summarize

bys sex: summarize age [fweight=wght]

‘tabulate’ allows you to calculate the frequencies with which a variable takes specific values. You can weight the calculation if appropriate by adding the option [fweight=wght]. Like ‘summarize’ ‘tabulate’ can be used in combination with weighting and ‘if’options.

tabulateeducation gives frequencies of observations for each value of ‘education’.

A weighted version:

tabulateeducation [fweight=wght]

Including two or more variables will give a cross-tabulation reporting the frequency

of observations in each possible combination of the two variables. To get shares rather

than just observation counts you need to add ",cell column row" to the usual command:

tabulate sexeducation, cell column row

The next command repeats the previous command but only for observations that have union==1.

tabulate sex education if(union==1), cell column row

The next command ‘table’ creates a table reporting the mean value of ‘wage’ for subsamples where each observation hasa specific value of ‘education’.

table education, c(mean wage)

‘if’ or weighting options can be added to the ‘table’ commands. ‘table’ can also be used to generate statistics other than the ‘mean’ e.g. median, minimum (min), maximum (max), sum, frequency (freq), etc.

Estimation Commands:

Estimation commands have the same basic format. The do-file gives ‘regress’ (Ordinary Least Squares regression as an example). The basic format is to specify the type of estimate to be made ("regress" for OLS regression) then give the name of the dependent variable, followed by a list of the explanatory variable names. Like the ‘summarize’,‘tabulate’and ‘table’ commands it is possible to weight the results and impose "if" conditions so that the estimates are only generated for certain subsamples. You can also

use the by-sort ("bys") option to have the specification estimated on a series of subsamples. Unless it is told not too STATA assumes that the estimated equation has an constant (intercept).

For example:

regress wage age union [pweight=wght]

estimates a simple OLS regression with dependent variable wage and explanatory variables

age and union. Note that if you want to weight the observations you use the ‘pweight’ option not ‘fweight’.

This does the same thing but only for observations that have sex==1.

regress wage age union if(sex==1) [pweight=wght]

This creates separate regression estimates (wage the dependent variable, age and union the explanatory variables) for sets of observations which have a common value of sex.

bys sex: regress wage age union

Ending your session:

Before shutting down, STATA wants you to clear the data file from memory (type

"clear" to do so) you can then exit the program by typing ‘exit’.

You can find the output for the program in the log file opened earlier in the program: example_2018.smcl.

Try running example_2018.do on your own computer. Note that you will have change the location(c:\Users\mshannon\3111\) of the source datafile and log-files to the relevant folder locations on your own computer. Have a look at the onscreen output and the record of that output in the log-file to see what each command does for you.

* This is the same as example_2018.do except that it omits the explanatory comments

* on each command.

set more off

log using c:\Users\Mike\3111\example_2018.smcl, replace

insheet using c:\Users\Mike\3111\example_data_2018.csv

gen wage_cents=wage*100

genagesq=age*age

genage_cubed=age^3

genlog_age=log(age)

gen male=(sex==1)

genunion_man=(sex==1) & (union==1)

genhed_un=(union==1) | (education==3)

gen old=0

replace old=1 if (age>40)

summarize

summarize [fweight=wght]

summarize age sex

bys sex: summarize

bys sex: summarize age [fweight=wght]

tabulate education

tabulate education [fweight=wght]

tabulate sex education, cell column row

tabulate sex education if(union==1), cell column row

table sex, c(mean wage)

regress wage age union [pweight=wght]

regress wage age union if(sex==1) [pweight=wght]

bys sex: regress wage age union

drop if(sex==1)

drop if(sex==1 & age<30)

keep if(sex==2)

clear

exit

Here is what the data-file example_data_2018.csv looks like:

RECORD / SEX / AGE / WAGE / EDUCATION / UNION / WGHT
1 / 1 / 22 / 16 / 2 / 0 / 10
2 / 1 / 32 / 19 / 2 / 0 / 5
3 / 2 / 54 / 30 / 2 / 0 / 10
4 / 2 / 44 / 24 / 1 / 1 / 15
5 / 2 / 68 / 31 / 3 / 1 / 10
6 / 2 / 21 / 25 / 2 / 0 / 10
7 / 1 / 36 / 30 / 3 / 0 / 5
8 / 1 / 57 / 31 / 1 / 0 / 5
9 / 1 / 18 / 13 / 1 / 0 / 5
10 / 2 / 28 / 24 / 2 / 1 / 15
11 / 2 / 35 / 29 / 1 / 1 / 15
12 / 1 / 31 / 28 / 3 / 0 / 10
13 / 1 / 64 / 32 / 2 / 1 / 10
14 / 2 / 56 / 29 / 3 / 0 / 5
15 / 2 / 16 / 11 / 1 / 0 / 5
16 / 1 / 27 / 20 / 1 / 0 / 15
17 / 1 / 31 / 22 / 3 / 1 / 10
18 / 2 / 41 / 23 / 2 / 1 / 5
19 / 1 / 45 / 26 / 3 / 0 / 15
20 / 1 / 52 / 25 / 2 / 0 / 10

1