Stata Guide and Assignments
Professor Thornton
Economics 515
Econometrics
INTRODUCTION
This guide provides an explanation of Stata commands required to do the in-class assignments and homework for Econ 515. It does not explain all Stata capabilities or commands. Stata is a very powerful statistical package, and if you desire to learn its full capabilities you should consult the Stata User’s Manuals.
DATA SETS
In this guide, Stata is explained in the context of examples. The examples use the following data sets. Each data set is contained in an Excel file. It is assumed that the Excel file is located on a stick disk in drive E:.
WAGE
The data file WAGE contains a cross-section of 935 males. The variables are as follows. WAGE = monthly earnings in dollars (year 2007 dollars). HOURS = average hours worked per week. IQ = IQ score. KWW = knowledge of world work score. EDU = years of education. EXPER = years of work experience. TENURE = years with current employer. AGE = age in years. MARRIED = dummy variable for marital status (1 if married, 0 otherwise). BLACK = dummy variable for race (1 if black, 0 otherwise). SOUTH = dummy variable for region of country where worker lives (1 if individual lives in south, 0 otherwise). URBAN = dummy variable for urban area (1 if individual lives in Standard Metropolitan Statistical Area, 0 otherwise). SIBS = number of siblings. BRTHORD = birth order. MEDUC = mother’s years of education. FEDUC = father’s years of education. Note that missing observations are denoted by the number -999.
SMOKE
The data file SMOKE contains a cross-section of 807 consumers. The variables are as follows. EDUC = years of schooling. CIGPRIC = price of cigarettes in cents per pack. WHITE = dummy variable for race (1 if white, 0 otherwise). AGE = age in years. INCOME = annual income in dollars. CIGS = number of cigarettes smoked per day. RESTAURN = dummy variable for state smoking restrictions (1 if state has restaurant smoking restrictions, 0 otherwise).
AUTO
The data file AUTO contains annual data for General Motors and Chrysler Corporation for the period 1935 to 1954. The variables are as follows. Year = year. IGM = investment spending for General Motors. PGM = expected profit for General Motors measured by market capitalization. CGM = desired capital stock for General Motors measured by actual capital stock. IC = investment spending for Chrysler. PC = expected profit for Chrysler measured by market capitalization. CC = desired capital stock for Chrysler measured by actual capital stock. All variables except year are in millions of dollars.
WINE
The data file WINE consists of annual time-series data for the wine industry in Australia for the period 1956 to 1975. The variables are as follows. Q = real per capita amount of wine bought and sold. S = an index of wine storage costs for producers. PW = real price of wine measured by the price of wine relative to the consumer price index. PB = real price of beer measured by the price of beer relative to the consumer price index. A = real per capita advertising expenditure on wine. Y = real per capita disposable income.
HEALTHPANEL
The data file HEALTHPANEL consists of panel data for 50 states for the years 1991 to 2000. The variables are as follows. STATE is an id number for each state. YEAR is year. SPEND is healthcare spending per capita in dollars. INC is income per capita in dollars. AGE65 is the percent of the population 65 years of age or older. INS is the percent of the population with health insurance coverage.
INMATE
The data file INMATE contains data on 1445 inmates in a North Carolina prison that served time, released, and followed for a period of time to determine if they would be arrested again, and if so the amount of time that elapsed until that arrest. The variables are as follows. BLACK = 1 if black, 0 otherwise. ALCOHOL = 1 if a history of alcohol problems, 0 otherwise. DRUGS = 1 if a history of drug problems, 0 otherwise. SUPER = 1 if release from prison was supervised, 0 otherwise. MARRIED = 1 if married when sent to prison, 0 otherwise. FELON = 1 if a felony sentence, 0 otherwise. WORKPRG = 1 if a member of a prison work program, 0 otherwise. PROPERTY = 1 if a property crime, 0 otherwise. PERSON = 1 if a crime against a person, 0 otherwise. PRIORS = the number of prior convictions. EDUC = years of schooling. RULES = the number of rules violations in prison. AGE = age in months. TSERVED = time served in prison in months. FOLLOW = length of time followed after release from prison in months. DURAT = amount of time until arrested after release from prison, or until the inmate was no longer followed, in months. CENS = 1 if DURAT is right censored, 0 otherwise.
STARTING STATA
To start Stata click on the Stata icon on the computer desktop. When Stata starts you will see five separate windows. 1) Command window. 2) Results window. 3) Review window. 4) Variables window. 5) Properties window. The command window is where you type commands. After typing a command, to execute it press ENTER. The results window shows the output produced by the commands. The review window lists the commands that have been entered in the command window. If you click on a command you will move it back into the command window where you can edit and execute it. The variables window lists all variables that are currently in memory. If you click on a variable its name is placed in the command window. The properties window provides information on the characteristics of the variables and data.
EXECUTING STATA COMMANDS AND PROGRAMS
Stata commands and a Stata program can be executed in three ways. 1) Command window. 2) Do file. 3) Dialogue box. Commands can be written and executed in the command window, a do file, or by using pull-down menu, dialogue boxes. The basic Stata language syntax is
command [varlist] [=exp] [if exp] [in range] [weight] [, options]
Square brackets denote optional qualifiers. Italics denote information that you provide depending on what you want Stata to do. Command denotes a command name. Varlist denotes a list of individual variables. Exp denotes a logical expression. Range denotes an observation range. Weight denotes a weighting expression. Options (preceded by a comma) denote a list of options. The shortest commands have a command name only. All commands must be written in lowercase letters. Each command must be written on its own line; carriage return designates the end of a command. To allow a command to be written on two or more consecutive lines type: #delimit; . This tell Stata that you will end a command line with a semicolon. To turn off the delimiter type #delimit cr. Any Stata command can be shortened to its first three letters. For example, the summarize command can be shortened to sum.
STATA ASSIGNMENTS
The first Stata assignment involves learning how to create and open a Stata data file, create variables, construct histograms and scatter diagrams, and calculate descriptive statistics. The data file wage will be used for this assignment.
CREATING A STATA DATA FILE
Most data is contained in either an external Excel file or ascii file and must be imported into Stata in the form of a Stata data file.
Example
Create and save a Stata data file from the Excel data file named wage located on your stick disk in drive e:. This file contains 935 observations on 16 variables and the names of the variables. The first row of the Excel file contains the variable names. In the column under each variable name is the data for that variable.
Steps
Launch Stata. Use the menu bar and select File → Import →Excel Spreadsheet. Click Browse… Go to drive E:. Click on the file namedwage. Click Import First Row As Variable Names. Click OK. To save the Stata file named wage.dta, select File → Save as and type wage.dta in the File Name box. Click Save.
Comments
- Versions that preceded Stata 12 do not allow you to import an excel spreadsheet directly. In these versions, launch Excel and save the Excel file wage.xls as a tab delimited text file wage.txt. SelectFile → Import →ASCII data created by a spreadsheet and fill in the dialogue box. Alternatively, type the following command in the command window: insheet using e:wage.txt.
- If the data file is a tab delimited text file that contains data but no variable names, then you would use the following command: infile wage hours iq kww edu exper tenure age married black south urban sibs brthord meduc feduc using e:wage.txt. Note that the infile command is followed by the variable names. Alternatively, you can use the menu bar and select File → Import → Unformatted ASCII data, and fill in the dialogue box.
OPENING A STATA DATA FILE
Example
Launch Stata and open the Stata data file named wage.dta.
Steps
Use the menu bar to select File → Open. Go to drive E: and click on wage. Click Open.
CREATING NEW VARIABLES
Existing variables in a data file can be used to create new variables by using the generate command. Some often used operators are given below.
+addition
-subtraction
/division
*multiplication
^power
lnnatural logarithm
expexponential
sqrtsquare root
Example
Create two new variables. 1) Logarithm of wage named logwage 2) Age squared named agesq
Commands
generate logwage=ln(wage)
generate agesq=age*age
CREATING DUMMY VARIABLES
To create dummy variables, use the generate command and logical expressions. Logical expressions are also used when you add the in or if qualifier to a command line. Stata uses the following eight relational and logical operators.
= = equal to
!=not equal to
greater than
less than
>= greater than or equal to
<=less than or equal to
and
|or
!not
Note that the relational operator for equal to is a double equal sign.
Example
Create a dummy variable named college that takes a value of 1 if a worker has education beyond a high school degree and 0 otherwise.
Command
generate college = (edu>12)
Comment
When Stata is given a logical expression, such as edu12, it evaluates the expression and assigns a value of 1 if true and 0 if false.
Example
Use the quantitative variable edu to create a qualitative variable with four educational categories. 1) Less than high school (lhs). 2) High school (hs). 3) Some college, but no degree (scol). 4) College degree (col). To do this you must create 4 dummy variables, one for each educational category. The commands are
generate lhs = (edu<12)
generate hs = (edu==12)
generate scol = (edu>12 & edu<16)
generate col = (edu>=16)
HISTOGRAMS AND SCATTER DIAGRAMS
The commands used to construct histograms and scatter diagrams are histogram and graph.
Example
Use the data file wage.dta and construct the histogram of the absolute frequency distribution for the variable wage.
Command
histogram wage, frequency
Example
Construct a scatter diagram for the variables wage and edu.
Command
graph twoway scatter wage edu
DESCRIPTIVE STATISTICS
The commands used to calculate descriptive statistics are summarize, and tabstat.
Example
Calculate the standard set of descriptive statistics (mean, standard deviation, minimum and maximum values, and number of observations) for the variables wage edu exper married iq tenure age black south urban.
Command
summarize wage edu exper married iq tenure age black south urban
Comment
If you include the option ,detail Stata will calculate additional descriptive statistics such as the median, variance, percentiles, etc.
Example
Calculate specific descriptive statistics. These include the mean, variance, standard deviation, coefficient of variation, maximum and minimum values, and number of observations for the variables wage edu exper married iq tenure.
Command
tabstat wage edu exper married iq tenure, stats(mean variance sd cv min max n)
Example
Decompose the sample into the following three subsamples and calculate descriptive statistics for each subsample. 1) Less than a high school education. 2) High school education. 3) Post high school education.
Commands
summarize wage edu exper married iq tenure if edu<12
summarize wage edu exper married iq tenure if edu==12
summarize wage edu exper married iq tenure if edu>12
Comments
- The if qualifier specifies the observations to use based on the values that the variable education takes.
- The if edu==12 qualifier uses a double equals sign since a qualifier is a logical expression.
Example
Calculate the sample correlation coefficients for the variables wage edu exper married iq tenure.
Command
correlate wage edu exper married iq tenure
CLASSICAL LINEAR REGRESSION MODEL
This Stata assignment involves learning how to estimate a classical linear regression model using the ordinary least squares (OLS) estimator. The data file wage will be used for this assignment. The command used to estimate a classical linear regression model using OLS is regress. Other useful commands that can be used along with regress are predict, correlate, and mfx.
Example
Use the OLS estimator to run a linear regression of wage on edu, exper, tenure, iq and married.
Command
regress wage edu exper married iq tenure
Example
For the previous regression, save the predicted (fitted) values of wage and name this variable fitwage, save the residuals and name this variable residuals, and display the variance covariance matrix of estimates.
Commands
predict fitwage, xb
predict residuals, resid
estat vce
Comments
- The predict and estat vce commands use information from the most recently estimated model.
- The predict options xb and resid tell Stata to save the predicted values and residuals, respectively.
- If you click on the data browser icon at the top of Stata, this will open a spreadsheet that contains your data. You will see two new variables named fitwage and residuals.
Example
For the previous regression, estimate the elasticities of all variables, calculate standard errors for the elasticity estimates, and t-statistics for the zero null hypothesis.
Command
mfx, eyex
Comments
- The mfx command with option eyex uses information from the most recently estimated model.
- The t-statistics are asymptotic t-statistics and labeled Z in the table reported by Stata.
- If you use the mfx command without the option eyex, Stata will report estimates of the marginal effects for the most recently estimated model.
Example
Use the OLS estimator to run a linear regression of the logarithm of wage on edu, exper, tenure, iq and married. Calculate estimates of the elasticities of all variables.
Commands
generate logwage=ln(wage)
regress logwage edu exper married iq tenure
mfx, dyex
Comment
When estimating a log-linear functional form, to calculate estimates of elasticities use the option dyex with the mfx command.
HYPOTHESIS TESTING
This Stata assignment involves learning how to test hypotheses using the t-test, F-test, asymptotic t-test, Wald test, Likelihood ratio test, and Lagrange multiplier test. The data file wage will be used for this assignment. The commands used to calculate test statistics are test, testnl, lrtest, lincom, and nlcom. Other useful commands for testing hypotheses and imposing restrictions are estimates store, constraint, and cnsreg.
Example
Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypotheses using a t-test. 1) job tenure has no effect on the wage. 2) The effect of one additional year of education on the wage is equal to the effect of one additional year of work experience on the wage.
Commands
regress wage edu exper tenure iq married
lincom edu – exper
Comments
- The regress command reports the t-statistic for the zero null hypothesis for each variable.
- The lincom command estimates the value of a new coefficient that is equal to the difference between the coefficients of edu and exper, calculates an estimate of the standard error of this new coefficient estimate, and reports the t-statistic for the zero null hypothesis (the difference between the coefficients of edu and exper is zero).
- The lincom command uses information from the most recently estimated model.
Example
Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypothesis using an asymptotic t-test. 1) The effect of one additional year of education on the wage is equal to the square of the effect of one additional year of job tenure on he wage.
Commands
regress wage edu exper married tenure iq
nlcom _b[edu] - _b[tenure]^2
Comments
- The nlcom command estimates the value of a new coefficient that is equal to the difference between the coefficient of edu and the square of the coefficient of exper, calculates an estimate of the asymptotic standard error of this new coefficient estimate, and reports the asymptotic t-statistic for the zero null hypothesis (the difference between the coefficient of edu and the square of the coefficient of exper is zero).
- The notation _b[edu] and_b[tenure] designate the coefficients of the variables edu and tenure.
- The nlcom command uses information from the most recently estimated model.
Example