------

help for ice, uvis Patrick Royston

------

Multiple imputation by the MICE system of chained equations

ice mainvarlist using filename[.dta] [if exp] [in range] [weight] , [ boot[(varlist)]

cc(varlist) cmd(cmdlist) cycles(#) dropmissing dryrun eq(eqlist)

genmiss(string) id(string) interval(intlist) m(#) match[(varlist)]

noconstant noshoweq on(varlist) orderasis passive(passivelist)

substitute(sublist) replace seed(#) trace(filename) ]

uvis regression_cmd {yvar|llvar ulvar} xvarlist [if exp] [in range] [weight] ,

gen(newvarname) [ boot match replace seed(#) ]

where

regression_cmd may be intreg, logistic, logit, mlogit, ologit, or regress. llvar

ulvar are required with intreg.

All weight types supported by regression_cmd are allowed; see weights.

Description

ice imputes missing values in mainvarlist by using switching regression, an iterative

multivariable regression technique. The abbreviation MICE means multiple imputation by

chained equations, and was apparently coined by Steff van Buuren. ice implements MICE for

Stata. Sets of imputed and non-imputed variables are stored to a new file called

filename. Any number of complete imputations may be created. The original data are stored

in filename as "imputation number 0" and the new variable _mj is set to 0 for these

observations.

uvis (univariate imputation sampling) imputes missing values in the single variable yvar

based on multiple regression on xvarlist. uvis is called repeatedly by ice in a

regression switching mode to perform multivariate imputation.

The missing observations are assumed to be "missing at random" (MAR) or "missing

completely at random" (MCAR), according to the jargon. See for example van Buuren et al

(1999) for an explanation of these concepts.

Please note that ice and uvis require Stata 8.0 or higher. There have been

incompatibility issues with Stata 7 or lower.

Options for ice

boot[(varlist)] instructs that each member of varlist, a subset of mainvarlist, be

imputed with the boot option of uvis activated. If (varlist) is omitted then all

members of mainvarlist with missing observations are imputed using the boot option of

uvis.

cc(varlist) prevents imputation of missing data in mainvarlist for cases in which any

member of varlist has a missing value. "cc" signifies "complete case". Note that

members of varlist are used for imputation if they appear in mainvarlist, but not

otherwise. Use of this option is equivalent to entering if ~missing(var1) &

~missing(var2) ..., where var1, var2, ... denote the members of varlist.

cmd(cmdlist) defines the regression commands to be used for each variable in mainvarlist,

when it becomes the dependent variable in the switching regression procedure used by

uvis (see Remarks). The first item in cmdlist may be a command such as regress or

may have the syntax varlist:cmd, specifying that command cmd applies to all the

variables in varlist. Subsequent items in cmdlist must follow the latter syntax, and

each item should be followed by a comma.

The default cmd for a variable is logit when there are two distinct values, mlogit

when there are 3-5 and regress otherwise.

Example: cmd(regress) specifies that all variables are to be imputed by regress,

over-riding the defaults

Example: cmd(x1 x2:logit, x3:regress) specifies that x1 and x2 are to be imputed by

logit, x3 by regress and all others by their default choices

cycles(#) determines the number of cycles of regression switching to be carried out.

Default # is 10.

dropmissing is a feature designed to save memory when using the file of imputed data

created by ice. It omits from filename all observations which are not in the

estimation sample, that is for which either (i) they are filtered out by if or in, or

a non-positive weight, or (ii) the values of all variables in mainvarlist are

missing. This option provides a "clean" analysis file of imputations, with no

missing values. Note that the observations not in the estimation sample are omitted

also from the original data, stored as imputation #0 in filename.

dryrun does a "dry run", that is, ice reports the prediction equations it has constructed

from the various inputs. No imputation is done and no files are created. It is not

mandatory to specify an output file with using for a dry run. Sometimes the

prediction equation set-up needs to be carefully checked before running what may be a

lengthy imputation process.

eq(eqlist) allows one to define customised prediction equations for any subset of

variables in mainvarlist. The option, particularly when used with passive(), allows

great flexibility in the possible imputation schemes. The syntax of eqlist is

varname1:varlist1 [,varname2:varlist2 ...] where each varname# (or varlist#) is a

member (or subset) of mainvarlist. It is your responsibility to ensure that each

equation is sensible. ice places no restrictions except to check that all variables

mentioned are indeed in mainvarlist, and that an equation is not defined for a

variable specified to be passively imputed (see the passive() option. Note that eq()

takes precedence over all default definitions and assumptions about the way a given

variable in mainvarlist will be imputed. The default, if the passive() and

substitute() options are not invoked, is that each variable in mainvarlist with any

missing data is imputed from all the other variables in mainvarlist.

genmiss(string) creates an indicator variable for the missingness of data in any variable

in mainvarlist for which at least one value has been imputed. The indicator variable

is set to missing for observations excluded by if, in, etc. The indicator variable

for xvar is named stringxvar. This option is left for backwards compatibility, but

now that the original data are stored in the output file, it is no longer really

needed. The information on missingness is implicit in the original data stored as

"imputation 0".

id(string) creates a variable called string containing the original sort order of the

data. Default string: _mi.

interval(intlist) imputes interval-censored variables. An interval-censored value is one

which is known to lie in an interval [a, b], where a may be finite or minus infinity,

b may be finite or plus infinity, and a <= b. When either a or b is infinite we have

left or right censoring, respectively. intlist has the syntax varname:llvar ulvar [,

varname:it:llvar ulvar ...], where each varname is an interval-censored variable,

each llvar contains the lower bound (a) for varname and each ulvar contains the upper

bound (b) for varname (or a missing value to represent plus or minus infinity). The

supplied values of varname are irrelevant since they will be replaced anyway; it is

only required that varname exist. Observations with llvar missing and ulvar present

are left-censored for varname. Observations with llvar present and ulvar missing are

right-censored for varname. Observations with llvar = ulvar are complete, and no

imputation is done for them. Observations with both llvar and ulvar missing are

imputed assuming an uncensored normal distribution. See Remarks for further

information.

m(#) defines # as the number of imputations required (minimum 1, no upper limit). The

default # is 1.

match[(varlist)] instructs that each member of varlist be imputed with the match option

of uvis. This provides prediction matching for each member of varlist. If (varlist)

is omitted then all relevant variables are imputed with the match option of uvis. The

default, if match() is not specified, is to draw from the posterior predictive

distribution of each variable requiring imputation.

noshoweq suppresses the presentation of the prediction equations.

noconstant suppresses the regression constant in all regressions.

on(varlist) changes the operation of ice in a major way. With this option, uvis imputes

each member of mainvarlist univariately on varlist. This provides a convenient way of

producing multiple imputations when imputation for each variable in mainvarlist is to

be done univariately on a set of complete predictors.

orderasis enters the variables in mainvarlist into the MICE algorithm in the order given.

The default is to order them according to the number of missing values: the variable

with least missingness gets imputed first, and so on.

passive(passivelist) allows the use of "passive" imputation of variables that depend on

other variables, some of which are imputed. The syntax of passivelist is varname:exp

[\varname:exp ...]. Notice the requirement to use "\" as a separator between items in

passivelist, rather than the usual comma; the reason is that a comma may be a valid

part of an expression. The option is most easily explained by example. Suppose x1 is

a categorical variable with 3 levels, and that two dummy variables x1a, x1b have been

created by the commands

. generate byte x1a=(x1==2)

. generate byte x1b=(x1==3)

Now suppose that x1 is to be imputed by the mlogit command, and is to be treated as

the two dummy variables x1a and x1b when predicting other variables. Use of mlogit

is achieved by the option cmd(x1:mlogit). When x1 is imputed, we want x1a and x1b to

be updated with new values which depend on the imputed values of x1. This may be

achieved by specifying passive(x1a:x1==2 \ x1b:x1==3). It is necessary also to remove

x1 from the list of predictors when variables other than x1 are being imputed, and

this is done by using the substitute() option; in the present example, you would

specify substitute(x1:x1a x1b).

Note that although in this example x1a will take the (possibly unintended) value of 0

when x1 is missing, ice is careful to ensure that x1a (and x1b) inherit the

missingness of x1, and are passively imputed following active imputation of missing

values of x1. If this were not done, incorrect results could occur. The

responsibility of the user is to create x1a and x1b before running ice such that

their missing values are identical to those of x1.

A second example is multiplicative interactions between variables, for example,

between x1 and x2 (e.g. x12=x1*x2); this could be entered as passive(x12:x1*x2). It

would cause the interaction term x12 to be omitted when either x1 or x2 was being

imputed, since it would make no sense to impute x1 from its interaction with x2.

substitute() is not needed here.

It should be stressed that variables to be imputed passively must already exist and

must be included in mainvarlist, otherwise they will not be recognised.

replace permits filename to be overwritten with new data. replace may not be

abbreviated.

seed(#) sets the random number seed to #. To reproduce a set of imputations, the same

random number seed should be used. Default #: 0, meaning no seed is set by the

program.

substitute(sublist) is typically used with the passive() option to represent multilevel

categorical variables as dummy variables in models for predicting other variables.

See passive() for more details. The syntax of sublist is varname:dummyvarlist

[,varname:dummyvarlist ...] where varname is the name of a variable to be substituted

and dummyvarlist is the list of dummy variables representing it.

Note, however, the following important convenience feature: substitute() may be used

without corresponding expressions in passive() to recreate dummy variables

automatically. If the values of variables in dummyvarlist are NOT defined through

expressions involving varname in the passive() option, then the variables in

dummyvarlist are calculated according to the actual range of values of varname. For

example, suppose the options passive(x1a:x1==2 \ x1b:x1==3) and

{cmd:substitute(x1:x1a x1b) were specified. Provided that all the non-missing values

of x1 were 2 when x1a==1 and all the non-missing values of x1 were 3 when x1b==1,

then passive(x1a:x1==2 \ x1b:x1==3) is implied by substitute(x1:x1a x1b) and can be

omitted. The rule applied by substitute(x:dummy1 [dummy2...]) for defining dummy

variables dummy1, dummy2, ... is as follows:

1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0.

2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise.

2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise.

3. Repeat steps 1 and 2a,b for dummy2, dummy3, ... as necessary.

With many such categorical variables this feature can save a lot of typing.

trace(filename) monitors the convergence of the imputation algorithm. For each original

variable with missing values, the mean of the imputed values is stored as a variable

in filename, together with the cycle number at which that mean was calculated. The

results are stored only for the final imputation. For diagnostic purposes, it is

sensible to run trace() with m(1) and a large number of cycles, such as cycles(100).

When the run is complete, it is helpful to load filename into memory and plot the

mean for each imputed variable against the cycle number. If necessary, smoothing may

be applied to clarify any apparent pattern. Convergence is judged to have occurred

when the pattern of the imputed means is random. It is usually obvious from the

appearance of the plot how many cycles are needed for convergence.

Options for uvis

boot invokes a bootstrap method for creating imputed values (see Remarks).

gen(newvar) is not optional. newvar contains original (non-missing) and imputed

(originally missing) values of yvar.

match creates imputations by prediction matching. The default is to draw imputations at

random from the posterior distribution of the missing values of yvar, conditional on

the observed values and the members of xvarlist. See Remarks for further details.

noconstant suppresses the regression constant in all regressions.

replace permits newvar (see gen(newvar)) to be overwritten with new data. replace may not

be abbreviated.

seed(#) sets the random number seed to #. See Remarks for comments on how to ensure

reproducible imputations by using the seed() option. Default #: 0, meaning no seed

is set by the program.

Remarks

uvis imputes yvar from xvarlist according to the following algorithm (see van Buuren et

al (1999) section 3.2 for further technical details):

1. Estimate the vector of coefficients (beta) and the residual variance by regressing

the non-missing values of yvar on the current "completed" version of xvarlist.

Predict the fitted values etaobs at the non-missing observations of yvar.

2. Draw at random a value (sigma_star) from the posterior distribution of the

residual standard deviation.

3. Draw at random a value (beta_star) from the posterior distribution of beta,

allowing, through sigma_star, for uncertainty in beta.

4. Use beta_star to predict the fitted values etamis at the missing observations of

yvar.

5. The imputed values are predicted directly from beta_star, sigma_star and the

covariates. When imputation is by linear regression (regress command), this step

assumes that yvar is Normally distributed, given the covariates. For other types

of imputation, samples are drawn from the appropriate distribution.

With the match option, step 5 is replaced by the following. For each missing observation

of yvar with prediction etamis, find the non-missing observation of yvar whose prediction

(etaobs) on observed data is closest to etamis. This closest non-missing observation is

used to impute the missing value of yvar.

The default draw method is not robust to departures from Normality and may produce

implausible imputations. For example, if the original distribution is skew and

positive-valued, the imputed distribution will not necessarily have the appropriate

amount of skewness, nor will all the imputed values necessarily be positive. Log

transformation of positive variables may greatly improve the appropriateness of the

imputations.

The alternative match method is recommended only for continuous variables when the

Normality assumption is clearly untenable, even approximately. It is not necessary, nor

is it recommended, for binary, ordered categorical or nominal variables. match may work

well when the distribution of a continuous variable is very non-Normal, but it may

sometimes result in biased imputations.

With the boot option, steps 2-4 are replaced by a bootstrap estimation of beta_star;

beta_star is estimated by regressing yvar on xvarlist after taking a bootstrap sample of

the non-missing observations. This has the advantage of robustness since the distribution

of beta is no longer assumed to be multivariate normal.

Note that uvis will not impute observations for which a value of a variable in xvarlist

is missing. However, all original (missing or non-missing) observations of yvar will be

copied into newvarname in such cases. This is a change from the first release of uvis

(with mvis). Previously, newvarname would be set to missing whenever a value of a

variable in xvarlist was missing, irrespective of the value of yvar.

Missing data for ordered (or unordered) categorical covariates should be imputed by using

the ologit (or mlogit) command. In these cases, prediction matching is done on the scale

of the mean absolute difference in the predicted class probabilities, preceded by logit

transformation.

ice carries out multivariate imputation in mainvarlist using regression switching (van

Buuren et al 1999) as follows:

1. Ignore any observations for which mainvarlist has only missing values, or if the

ccvarlist(varlist) option has been specified, for which any member of varlist has

a missing value.

2. For each variable in mainvarlist with any missing data, randomly order that

variable and replicate the observed values across the missing cases. This step

initialises the iterative procedure by ensuing that no relevant values are

missing.

3. For each variable in mainvarlist in turn, impute missing values by applying uvis

with the remaining variables as covariates.

4. Repeat step 3 cycles() times, replacing the imputed values with updated values at

the end of each cycle.

A single imputation sample is created for each variable with any relevant missing values.

Van Buuren recommends cycles(20) but goes on to say that 10 or even 5 iterations are

probably sufficient. We have chosen a compromise default of 10.

"Multiple imputation" (MI) implies the creation and analysis of several imputed datasets.

To do this, one would run ice with m set to a suitable number, for example 5. To obtain

final estimates of the parameters of interest and their standard errors, one would fit a

model in each imputation and carry out the appropriate post-MI averaging procedure on the

results from the m separate imputations. A suitable estimation tool for this purpose is

micombine.

Handling categorical variables

Binary variables present no difficulty: by default, in the MICE procedure, when such a

variable is the response, it is predicted from other variables by using logistic

regression; when it is a covariate, it is modelled in the only way possible, effectively