*Use the CPS sample we modified in the previous workshop

use "C:\Stata Workshop\cps_sample_modified.dta", clear

*Categorize income categories into rough income levels for future use

replace faminc = 0 if faminc == 1

replace faminc = 5000 if faminc == 2

replace faminc = 7500 if faminc == 3

replace faminc = 10000 if faminc == 4

replace faminc = 12500 if faminc == 5

replace faminc = 15000 if faminc == 6

replace faminc = 20000 if faminc == 7

replace faminc = 25000 if faminc == 8

replace faminc = 30000 if faminc == 9

replace faminc = 35000 if faminc == 10

replace faminc = 40000 if faminc == 11

replace faminc = 50000 if faminc == 12

replace faminc = 60000 if faminc == 13

replace faminc = 75000 if faminc == 14

replace faminc = 100000 if faminc == 15

replace faminc = 150000 if faminc == 16

*Turn sex into a binary variable

replace sex = 0 if sex == 2

*Keep only individuals over age 25 to get a more accurate picture of family income

keep if age >= 25

*The egen command stands for "Extensions to generate" and provides additional ways to generate variables. This command creates a new variable that takes on 2 values, either the mean income when collegegrad == 0 or the mean income when collegegrad == 1.

egen mean_faminc = mean(faminc), by(collegegrad)

*Then, we can create a new variable that is the ratio of the individual's family income to the mean level, depending on if they are a college graduate or not.

gen ratio_inc_faminc = famincome/mean_faminc

*Alternate method of obtaining income/mean income ratio

*sum (short for summarize) provides some summary statistics including the mean. Like all estimation commands, sum

*saves some estimations as local macros.

sum famincome if collegegrad == 0

*We use the local macro r(mean), called with `r(mean)' to generate the ratio for each value of collegegrad

gen alt_ratio_inc_faminc = famincome/`r(mean)' if collegegrad == 0

sum famincome if collegegrad == 1

replace alt_ratio_inc_faminc = famincome/`r(mean)' if collegegrad == 1

*The following example illustrates the forvalues loop. This loop iterates through a series of numerical values by the increment specified in parantheses. Inside the loop, the iterator has the same syntax as a local macro. This example creates dummy variables for different age categories.

forvalues i = 20(5)45 {

local top = `i' + 4

gen age`i'to`top' = age >= `i' & age <= `top'

}

*The following command lists all duplicates for a given variable(s)

duplicates list age

*We can then drop all duplicates of the given variable(s). We must specify the force command because this is a dangerous thing to do and Stata wants to make sure we are sure we want to do this. This can be useful when merging many data sets with individual id's for example.

duplicates drop age, force

*Next, we are going to look at a data set of crime rates aggregated by the number of days to 21.

use "C:\ECON 104\MLDA Crime CA.dta", clear

*We specify e a local macro of variables so that we can save lines of code and perform actions on all of these variables in a loop. Remember that a local macro only exists in the current context.

local vars ill_drugs_r dui_r liquor_laws_r violent_r murder_r

*Keep just the variables we are going to use to make the program run faster and to make the data set more manageable to look at.

keep `vars' days_to_21

*The foreach loop iterates over either a local or a list of variables. In this case, we are iterating over the local macro which has the list of variables in it. In this loop, we are generating a scaled version of each of the variables in the local macro.

foreach var of local vars {

gen `var'_rate_scaled = `var'*100

}

*The next several lines make the same variables we just made, but we use global macros instead. The difference between local and global macros is that global macros last outside of the current context, so that we can specify the global macro and then run the code separately and use the global macro, which is not possible with local macros.

*They are used less often though due to potential name conflicts (imagine if you want to use vars again.

use "C:\ECON 104\MLDA Crime CA.dta", clear

global vars ill_drugs_r dui_r liquor_laws_r violent_r murder_r

keep $vars days_to_21

foreach var of varlist $vars {

gen `var'_actual_rate = `var'*100

}

*This next section illustrates the use of the "if programming command" (as distinct from the "if qualifier"). The "if qualifier" is used only to restrict the commands to using a subset of the data, while the "if programming command" can tell Stata to perform certain commands only if certain commands are met. In this case, we use the command inside of a loop so that some variables are scaled by 10 and some variables are scaled by 100.

use "C:\ECON 104\MLDA Crime CA.dta", clear

local vars ill_drugs_r dui_r liquor_laws_r violent_r murder_r

keep `vars' days_to_21

foreach var of local vars {

if `var' == ill_drugs_r | `var' == violent_r {

gen `var'_actual_rate = `var'*10

}

else {

gen `var'_actual_rate = `var'*100

}

}

use "C:\ECON 104\MLDA Crime CA.dta", clear

*Another useful egen command is rowmean, which takes the mean of a given set of variables for each observation

egen mean_rate = rowmean($vars)

*The following creates the rowmean for every numerical variable in the dataset by using the feature that a * is a symbol for representing anything, so *_r is all variables ending in _r.

egen alt_mean_rate = rowmean(*_r)

*This next section produces fitted values and plots the points for a regression discontinuity design (which this data set was designed for). Our goal is to plot age on the x-axis and the dui rate on the y-axis, with quadratic fits on both sides of the 21 year old threshold.

use "C:\ECON 104\MLDA Crime CA.dta", clear

*First, our graph would look too cluttered if we plotted every age in days, so we put the data into 26 day bins.

collapse (mean) age dui_r, by(age_26day)

*We want quadratic fits, so create the square of age

gen age_26day_sq = age_26day^2

*We want a quadratic fit only for ages 19 and 21 (the left side of the graph)

reg dui_r age_26day age_26day_sq if age >= 19 & age < 21

*The command predict is producing fitted values (in this case a quadratic fit). When we plot these fitted values, it will give us the quadratic curve that most closely fits the data.

predict fitted_left_quad if age >= 19 & age < 21

reg dui_r age_26day age_26day_sq if age >= 21 & age < 23

predict fitted_right_quad if age >= 21 & age < 23

*Create the graph with all of the appropriate options to make it look good

#delimit ;

graph twoway (scatter dui_r age_26day)

(line fitted_left_quad age_26day, lcolor(black))

(line fitted_right_quad age_26day, lcolor(black))

if age >= 19 & age <= 23,

xlabel(#6)

ylabel(#6, nogrid angle(horizontal))

title(DUI Arrests Age Profile)

xtitle(Years Until 21)

ytitle(DUI Arrests per 10000 People)

legend(off)

graphregion(style(none) color(gs16))

;

*Create your own program

capture program drop sign

program define sign

version 13.1

display as text "Brandon Heck"

display "{txt}{hline 62}"

end

sign