Stata Guide

Professor Thornton

Economics 415/514

Econometrics


INTRODUCTION

This guide provides an explanation of Stata commands required to do the in-class assignments and homework for Econ 415 and 514. It does not explain all Stata capabilities or commands. Stata is a very powerful statistical package, and if you desire to learn its full capabilities you should consult the Stata User’s Manuals.

DATA SETS

In this guide, Stata commands are explained in the context of examples. The examples are based on the following two data sets. It is assumed that each data set is contained in a file on a memory stick in drive E. If your data files are on drive C, or on a memory stick located in a different drive such as drive F, modify the examples below accordingly (e.g., replace the letter E with the letter C or F). Using a memory stick is recommended.

WAGEDATA

The data file WAGEDATA consists of a cross-section of 49 workers. The variables are WAGE = monthly wage, EDUC = years of education beyond the eighth grade, EXPER = years of experience, AGE = age of worker, GENDER = indicator variable for gender (1 if male, 0 if female), RACE = indicator variable for race (1 if white, 0 if nonwhite), CLERICAL = indicator variable for clerical worker (1 if clerical worker, 0 otherwise), MAINT = indicator variable for maintenance worker (1 if maintenance worker, 0 otherwise), CRAFTS = indicator variable for crafts worker (1 if crafts worker, 0 otherwise).

CPS85

The data file CPS85 consists of 526 randomly selected employed workers from the May 1985 current population survey conducted by the Department of Commerce. This is a survey of over 50,000 households conducted monthly, and it serves as the basis for the national employment and unemployment statistics. The variables are: ED = years of education, SOUTH = dummy variable (1 if worker lives in south, 0 otherwise), NONWH = dummy variable (1 if worker is nonwhite, zero otherwise), HISP = dummy variable (1 if worker is Hispanic, 0 otherwise), FE = dummy variable (1 if worker is female, 0 otherwise), MARR = dummy variable (1 if worker is married with spouse present in household, 0 otherwise), MARRFE = dummy variable (1 if worker is married female with spouse present in household, 0 otherwise), EX = years of labor market experience, UNION = dummy variable (1 if worker has union job, 0 otherwise), WAGE = average hourly earnings in constant 2003 dollars, AGE = age in years, MANUF = dummy variable ( 1 if worker works in manufacturing industry, 0 otherwise), CONSTR = dummy variable ( 1 if worker works in construction industry, 0 otherwise), MANAG = dummy variable (1 if worker is managerial or administrative, 0 otherwise), SALES = dummy variable (1 if worker is in sales, 0 otherwise), CLER = dummy variable ( 1 if worker is clerical worker, 0 otherwise), SERV = dummy variable (1 if worker is a service worker, 0 otherwise), PROF = dummy variable (1 if worker is professional or technical, 0 otherwise),

STARTING STATA

To start Stata click on the Stata icon on the computer desktop. When Stata starts you will see five separate windows. 1) Command window. 2) Results window. 3) Review window. 4) Variables window. 5) Properties window. The command window is where you type commands. After typing a command, to execute it press ENTER. The results window shows the output produced by the commands. The review window lists the commands that have been entered in the command window. If you click on a command you will move it back into the command window where you can edit and execute it. The variables window lists all variables that are currently in memory. If you click on a variable its name is placed in the command window. The properties window provides information on the characteristics of the variables and data.

EXECUTING STATA COMMANDS AND PROGRAMS

Stata commands and a Stata program can be executed in three ways. 1) Command window. 2) Do file. 3) Dialogue box. Commands can be written and executed in the command window, a do file, or by using pull-down menu, dialogue boxes. The basic Stata language syntax is

command [varlist] [=exp] [if exp] [in range] [weight] [, options]

Square brackets denote optional qualifiers. Italics denote information that you provide depending on what you want Stata to do. Command denotes a command name. Varlist denotes a list of individual variables. Exp denotes a logical expression. Range denotes an observation range. Weight denotes a weighting expression. Options (preceded by a comma) denote a list of options. The shortest commands have a command name only. All commands must be written in lowercase letters. Each command must be written on its own line; carriage return designates the end of a command. To allow a command to be written on two or more consecutive lines type: #delimit; . This tell Stata that you will end a command line with a semicolon. To turn off the delimiter type #delimit cr. Any Stata command can be shortened to its first three letters. For example, the summarize command can be shortened to sum.

CREATING A STATA DATA FILE

Most data is contained in either an external Excel file or ascii file and must be imported into Stata in the form of a Stata data file.

Example #1

The Excel File wagedata has 49 observations on 9 variables. The names of the variables are wage, educ, exper, age, gender, race, clerical, maint, crafts. You want to create a Stata data file named wagedata. You then want to save this dataset in a file on drive E.

Launch Stata. Use the menu bar and select File → Import →Excel Spreadsheet. Click Browse… Go to drive E:. Click on the file named wagedata. Click Import First Row As Variable Names. Click OK. To save the Stata file named wagedata.dta, select File → Save as and type wagedata.dta in the File Name box. Click Save. To view the contents of the data file, click on the Data Editor (Browse) icon directly under the menu bar (it looks like a spreadsheet with a magnifying glass in the upper right-hand corner).

OPENING A STATA DATA FILE

The following example explains how to open a Stata data file that you saved on drive E.

Example #2

Launch Stata and open the Stata data file named wagedata.dta.

Steps

Use the menu bar to select File → Open. Go to drive E: and click on wage. Click Open.

CREATING NEW VARIABLES

The generate command is used to create new variables from existing variables. It uses the following arithmetic operators to create new variables. They are carried-out in the following order if parentheses are not used: ^ (exponentiation), * (multiplication), / (division), + (addition), - (subtraction). The operator for the natural logarithm is LN. The generate command also uses logical expressions to create new variables. Logical expressions are also used when you add the in or if qualifier to a command line. Stata uses the following eight relational and logical operators.

= = equal to

!= not equal to

> greater than

less than

>= greater than or equal to

<= less than or equal to

and

| or

! not

Note that the relational operator for equal to is a double equal sign.

Example #3

You want to open the data file wagedata on drive E and create several new variables.

Open the data file. In the Command Window, type the following commands.

generate logwage = ln(wage)

generate yearwage = wage*12

generate daywage = wage / 30

generate agesq = age**2

generate agecub = age**3

generate toteduc = educ + 8

Stata will create the variables logwage, yearwage, daywage, agesq, agecub, and toteduc, and place them in the open data file wagedata along with all existing variables in this datafile.

Example #4

You want to use the data file wagedata to create new dummy variables for education. One dummy variable has two educational categories. The other four dummy variables quantify four educational categories.

generate college = (educ > 4)

This creates a dummy variable named college that can take a value of 1 or 0. The generate command assigns a value of 1 to the variable college if the variable educ is greater than 4 and a value of 0 to the variable college for all observations that do not have a value of one.

generate lhs = (edu<4)

generate hs = (edu==4)

generate scol = (edu>4 & edu<8)

generate col = (edu>=8)

These four dummy variables quantify a qualitative variable for education with four categories. 1) Less than high school (lhs). 2) High school (hs). 3) Some college, but no degree (scol). 4) College degree (col). There is one dummy variable for each educational category.

DELETING VARIABLES FROM A SAS DATASET

Example #5

You want to delete the variables logwage and college that you created in examples 3 and 4.

drop logwage college

DISPLAYING OBSERAVATIONS ON VARIABLES IN A STATA DATA FILE

Example #6

You want to display the observations on the variables wage, educ, and exper in the results window.

list wage educ exper

DESCRIBING AND ANALYZING DATA

Examples #7 through #17 below involve describing and analyzing data. The data are contained in the Excel file CPS85. Use this Excel file to create a Stata data file named CPS85.

FREQUENCY DISTRIBUTIONS AND SCATTER DIAGRAMS

Example #7

You want to display a relative frequency distribution, absolute frequency distribution, and scatter diagram for the variables wage and ed.

histogram wage

histogram ed

histogram wage, frequency

histogram ed, frequency

graph twoway scatter wage ed

The histogram command without the option frequency tells Stata to display the histogram of a relative frequency distribution. Adding the option frequency tells Stata to display the histogram of an absolute frequency distribution.

DESCRIPTIVE STATISTICS

Example #8

You want to calculate the mean, variance, standard deviation, coefficient of variation, maximum value, minimum value, and the number of observations for the variables wage, ed, ex, fe, age, union. You also want to calculate the covariances and correlation coefficients for these variables.

tabstat wage ed ex fe age union, stats(mean variance sd cv min max n)

correlate wage ed ex fe age union

correlate wage ed ex fe age union, covariance

If you want Stata to report the number of observations, mean, standard deviation, maximum, and minimum values to these variables use the summarize command.

summarize wage ed ex fe age union

If you use the summarize command with no variable names, then Stata reports these measures for all variables in the data file.

LINEAR REGRESSION

Example #9

You want to run a linear regression of wage on ed. You also want to print the variance-covariance matrix for the parameter estimates.

regress wage ed

estat vce

Example #10

You want to run a linear regression of wage on ed, ex, and fe. You also want to print the variance-covariance matrix for the parameter estimates.

regress wage ed ex fe

estat vce

Example #11

You want to run a linear regression of wage on ed, ex, and fe. You want to test the following hypotheses. 1) Education and experience have no joint effect on wage; that is, the coefficient of ed and the coefficient of ex are jointly equal to zero 2) The marginal effects of ed and ex are equal; that is the coefficients of ed and ex are equal. 3) The sum of the marginal effects of ed and ex is equal to 2; that is, the sum of the coefficients of ed and ex is 2.

regress wage ed ex fe

test (ed=0)(ex=0)

test ed=ex

test ed+ex=2

Note that if you are testing more than one hypothesis (restriction), you must enclose each one in parentheses.

Example #12

You want to run a linear regression of wage on ed, ex, and fe, and impose the restriction that the coefficients of ed and ex are equal. Thus, your objective is to estimate a restricted model that imposes a restriction on the model parameters.

constraint define 1 ed=ex

cnsreg wage ed ex fe, constraints(1)

The constraint define 1 command creates a single restriction. The cnsreg command tells Stata to run a regression with a restriction imposed on the coefficients. The constraints(1) option tells State to impose the single restriction created by the constraint define 1 command.

Example #13

You want to run a linear regression of wage on ed, ex and fe. You want to check for multicollinearity among the explanatory variables. To do this you want to run a regression of each explanatory variable on all remaining explanatory variables so you can calculate variance inflation factors. You also want to calculate the correlation coefficients for the explanatory variables.

regress wage ed ex fe

regress ed ex fe

regress ex ed fe

regress fe ed ex

correlate ed ex fe

Example #14

You want to estimate a varying slope parameter model where wage depends upon ed, ex, fe, and the interaction variable edfe, which is the product of ed and fe. This interaction variable allows the coefficient of ed to depend upon fe.

generate edfe = ed*fe

regress wage ed ex fe edfe

Note that to estimate this model, you must first create an interaction term for ed and fe.

Example #15

You want to estimate a log-linear functional form, where the logarithm of wage depends upon ed, ex, fe.

generate Logwage = ln(wage)

regress logwage ed ex fe

Note that to estimate this model, you must first create a new variable named logwage, which is the natural logarithm of the variable wage.