List and Modify Variables

The workflow for creating a model from SCCS data falls into roughly five sections:

List and modify variables

The first step is to find variables from the SCCS to use in the model. One should begin by searching the codebook, and listing the names of variables that might be of use in the model. Any transformations of variables should be done at this stage. All newly created variables must be added to the SCCS. To create a description of these new variables for display in model outputs, use the addesc function.

All variables used in a regression model must be ordinal. About 37 percent of SCCS variables are categorical, such that the numbers represent categories, and not an ordinal ranking. Thus, for example, v233 is “Major Crop Type”, and the numbers indicate which category of crops a society produces: roots, cereals, tree fruits, etc. A categorical variable like this can be used in a regression only by being converted into dummy variables. For example, a dummy variable could be created for the presence of tree crops: it would equal one for those societies that produce tree fruits, and would equal zero for all other societies.

The function mkdummy(sn,v) will create dummy variables. It takes two arguments: sn is the SCCS variable name, and v is the category number. One finds the variable names and category numbers in the codebook. To make a dummy indicating the presence of tree crops, the function would be

mkdummy(“v233”,4)

since tree crops are represented by number 4 in v233. The function will create a dummy variable called v233d4, and will also create a description (in this case “Major Crop Type == Tree fruits”) to be used in outputs.

Make imputed data

The data collected for model building will in almost all cases contain missing values. This step in the workflow imputes several new datasets, using covariates for each variable to create a conditional distribution of estimates for each missing value, and then replacing the missing value with a draw from the distribution; as a result, each of the imputed datasets may have slightly different values for the estimated cells. The key to successful imputation is to have good covariates for each variable. The function doMI begins the search for good covariates by grouping each variable in a cluster of collinear variables. For each cluster, the best covariates are selected from a set of non-missing variables, including both network lag variables (based on geographic distance, language, and religion) and non-missing SCCS variables.

The function doMI(evm,nimp,maxit) takes 3 arguments. The first argument is a list of variable names—all of these must be found in the SCCS (which is why it was necessary to add the transformed variables to the SCCS). These will be the data used in model building. One should include all data one thinks might be useful, including all transformed data, but no additional data. The second argument is the number of imputed datasets to create: between 5 and 10 imputed data sets are considered adequate, but there is no harm in choosing more; the default is 10. The third argument is the number of iterations to perform in creating each imputed dataset; the default is 7.

smi<-doMI(evm,6,5)

The above command creates a new dataframe called smi, which contains six imputed datasets stacked on top of each other, indexed by a variable called “.imp”. It is not usually necessary to examine this new dataframe—it is used in estimating the model, but is not in itself that interesting. Nevertheless, some output is automatically written to the console as it executes, in order to provide some information about the clusters to which the variables have been assigned, and the covariates selected for each cluster. For each cluster, the names of the members are printed, followed by a double dash and then the names of non-missing covariates selected for that cluster. Prefixes “L”, “E”, “R”, and “D” indicate spatial lags for, respectively, linguistic, ecological, religious, and geographic proximity. Finally, there is a table, showing—for each imputed variable—summary statistics for the absolute value of within-cluster Pearson correlation coefficients. This table can be safely ignored by most users.

Identify role of variables in model

The next step in the workflow is to identify the dependent and independent variables for the model. There are two lists of independent variables: one for the initial unrestricted model, containing all variables which theory suggests should explain variation in the dependent variable, and one for the final restricted model, containing only those independent variables that prove significant. There is also a vector of exogenous variables, to be used in some of the estimation steps.

The most time-consuming part of building a model is the selection of independent variables. Each variable should have a sound theoretical justification that fits well with the story one is trying to tell about the dependent variable. In most cases, researchers will spend weeks thinking about results, trying new independent variables or new specifications (such as quadratic terms for some independent variables), and revising their story to fit estimation results. Researchers jokingly speak of “running a thousand regressions”, and while this can be carried on too far, it is nevertheless advisable to set up the workflow so that independent variables are easily changed and model results readily examined.

Estimate regression model

The function doOLSestimates OLS models and provides common diagnostics for the sets of independent variables. The command

h<-doOLS(smi,depvar,indpv,othexog,rindpv)

will create a list h, containing 10 objects: (1)Identification of dependent variable; (2)Coefficient estimates from the unrestricted model; (3)Coefficient estimates from the restricted model; (4)Coefficient estimates from the restricted model with robust SEs; (5)Regression diagnostics; (6) Composite weight matrix weights; (7) R2 for all models; (8) Descriptive statistics for MI data; (9) Influential observations for dfbetas; (10) Potential new variables from add1. The list can be displayed in the console, or it can be written to a *.csv file, using the function CSVwrite.

Some output is printed to the console: The unrestricted model p-values and VIFs are displayed in ascending order of p-value. This allows one to see, at a glance, the most and least significant independent variables, to better decide which should be retained in the restricted model. A similar display is also shown of the restricted model. Next, there is a display of variables that, when added singly to the restricted model equation, proved significant in at least half the estimations; most of these will be squared terms or interaction terms of variables in the unrestricted model. Finally, there appears a display of p-values for the regression diagnostics.

The suggested way of running this script is to run the doOLScommand, look at the displays, decide what variables to drop from the restricted model, and run the command again. If adding a new variable, it must be present in the imputed data set: the squared terms and interaction terms are not—they must be created before running the doMIcommand again.

R file: S:\TEFF\662\R\r06a.R

#--Dow & Eff simple functions--

setwd("e:/Dropbox/Abhradeep") # change working directory to your own

rm(list=ls(all=TRUE))

options(echo=TRUE)

library(mice)

library(foreign)

library(stringr)

library(psych)

library(AER)

sessionInfo()

# ======

# --bring in functions and data--

# ======

load(url("

ls() #-can see the objects contained in DE2.Rdata

# ======

#--list and modify variables for use in model--

# ======

# --make new variables--

sccsA$valchild<-(sccsA$v473+sccsA$v474+sccsA$v475+sccsA$v476)

sccsA$labor<-(3-(sccsA$v1009<=3)+(-1+sccsA$v1009>=5))

table(tt$v1009,sccsA$labor,useNA="ifany")

# --create descriptions for new variables--

nvbs<-c("labor","valchild")

nvbsdes<-c("Development of labor markets","Degree to which society values children")

addesc(nvbs,nvbsdes)

# --create new dummy variables--

mkdummy("v233",4)

mkdummy("v233",5)

mkdummy("v244",2)

mkdummy("v244",7)

mkdummy("v245",2)

mkdummy("v899",1)

# --identify variables to keep for model building--

evm<-c("socname","v203","v204","v1685","v156","v72","v234","v236","v238","v1648",

"v155","v233d4","v233d5","v244d2","v244d7","v245d2","v899d1","labor",

"valchild","v1260","mht.name")

# ======

# --make imputed data--

# ======

smi<-doMI(evm,nimp=2,maxit=3);dim(smi)

smi[1:2,]

# ======

# --identify role of variables in model--

# ======

#-independent variables in UNrestricted model--

iv<-c("v1260","v203","v204","v1685","v156","v72","v234","v236","v238","v1648",

"v155","v233d4","v233d5","v244d2","v244d7","v245d2","v899d1","labor")

#--independent variables in restricted model--

riv<-c("v1260","v155","v233d4","v233d5","v244d2","v244d7","v245d2","v899d1","labor")

#--can add additional exogenous variables (use in Hausman tests)--

# oex<-c("v1260")

# ======

# --estimate regression model-----

# ======

h<-doOLS(smi,depvar="valchild",indpv=iv,othexog=NULL)

print(h)

# ======

# --print output to csv file--

# ======

CSVwrite(names(h),"z",FALSE)

sapply(1:length(h),function(x) CSVwrite(h[[x]],"z",TRUE))