Duration (Survival)Models for Time to Event Data

DURATION (SURVIVAL)MODELS FOR TIME TO EVENT DATA

INTRODUCTION

A relatively new area in econometrics is the analysis of duration data (also called time to event data). The econometrics literature on the analysis of duration data draws heavily from statistical methods that have been developed by industrial engineers and biomedical researchers, who use these methods to analyze such phenomena as the useful lives of machines and survival times of patients after a particular type of operation.

Dependent Variable

In duration analysis, the dependent variable being studied is a duration. Duration is defined as:

i)The amount of time that elapses until some event occurs,

ii)The amount of time that elapses until measurement is taken before the event actually occurs.

Duration is often called time to event (e.g., time to death, time to machine failure, time to employment). If an observed duration corresponds to i, it is said to be uncensored. If the observed duration corresponds to ii, it is said to be censored. The following points should be noted about a duration variable:

1. A duration variable is always measured in units of time (e.g. minutes, days, weeks, months).

2. A duration variable must be non-negative (you can’t have a negative time to event).

Censoring

It is usually the case that some of the observations on a duration variable are censored. An observation is said to be censored when it is measured from the beginning of the period of interest until some point before the event takes place. For example, suppose that the duration variable is time to death after a heart transplant. Suppose that this variable is measured for a sample of 30 persons. Suppose that when the measurement is taken 20 of these individuals have died, but 10 are still alive. The 10 observations for the individuals still alive are censored observations.

Duration Data

No Censoring

Let the variable duration be denoted by T. Duration is a random variable that measures time to event. Because T is a random variable, its behavior can be described by a probability distribution (T). Let

t1, t2, …tn be a random sample of n-observations on the random variable T. The sample will usually consist of a cross-section of n times to event (durations) on individuals, firms, machines, etc.

Censoring

Let T* be the value of duration in the absence of censoring. Let T be the observed value of duration. Let c be the value of duration when it is censored at time c. The observed value of duration is given by

T = T* if T* < c

T = c if T*  c

The censoring time, c, can either be a known constant or a random variable. If c is a random variable, then it must be independent of T*. To indicate whether an observation is censored, a censor statusvariable is usually created. This variable is an indicator variable that takes a value of 1 if the observation is not censored and 0 if the observation is censored.

Approaches to Analyzing Duration Data

Three alternative approaches can be used to analyze duration data. These are:

Parametric approach
Semiparametric approach
Nonparametric approach

The parametric approach makes assumptions about the probability distribution of T. This allows you to analyze duration data using regression models or regression-like models. The semiparametric approach makes only minimal assumptions about the probability distribution of T. The nonparametric approach makes no assumptions about the probability distribution of T.

PARAMETRIC APPROACH

There are two major types of parametric models of duration. These are:

Regression models
Regression-like models

Regression Models

A regression model is the appropriate model to use when your objective is to better understand how a set of variables, X1, X2, …, Xn influence the expected (average) time to event, E(T).

Example

Let T be the amount of time an individual is unemployed measured in weeks. Thus, the event of interest is finding a job. Let X1 be the level of unemployment benefits in hundreds of dollars per month, X2 be years of work experience, and X3 be marital status; X3 =1 if single, X3 = 0 if married. You have a sample of 200 individuals. Some of these observations are censored at T = 80 weeks. Suppose your objective is to better understand how the level of unemployment benefits, work experience, and marital status influence the average amount of time an individual is unemployed. One way to proceed is to estimate the following classical linear regression model,

T = 0 + 1X1 + 2X2 + 3X3 + 

The coefficient 1 measures the effect of a one unit change ($100) in unemployment benefits on the average amount of time an individual is unemployed. The coefficients 2 and 3 have similar interpretations. However, there are 3 potential problems with this model.

The observations on T are censored. As a result, the OLS estimator will yield estimates of the coefficients that are biased and inconsistent. Thus, the appropriate model would be the censored regression model, which accounts for censored observations in the estimation procedure.
The classical linear regression model assumes that T has a normal distribution. There are a number of reasons to believe that duration (such as length of unemployment) does not have a normal distribution (the most obvious reason is that T is positive by construction). Jeffrey Wooldridge suggests that one way to deal with this problem is to use the logarithm of duration as the dependent variable; that is ln(T). This is because ln(T) usually has a distribution that is closer to a normal distribution than T itself. In this case, the slope coefficients (multiplied by 100) measures the approximate percentage change in T for a one unit change in X.
The dependent variable, T, measures a process that takes place over the length of time (0, t). Regression analysis assumes that the value of X does not change during the period that T is being observed. For example, suppose that an individual is unemployed for 12 months. Regression analysis assumes the level of unemployment benefits he received (X1) and his marital status (X3) did not change during this period of time. If either of these variables does change during the time an individual is unemployed, this greatly complicates the analysis.

Regression-Like Models

A regression-like model is the appropriate model to use when analyzing duration data if your objective is any of the following.

The probability that an event will occur before time t.
The probability that an event will occur after time t.
The probability that an event will occur between time t and time t+1.
The probability that an event will occur between time t and time t+1, given that it has not occurred up to time t.

Notice that we are not interested in expected duration, rather we are interested in the probability of duration. However, a regression-like model can also be used to analyze average duration or median duration.

Example

Let T be the amount of time an individual is unemployed. We might be interested in the following questions.

What is the probability that an individual will be unemployed for 6 months or less?
What is the probability that an individual will be unemployed for more than 6 months?
What is the probability that an individual will be unemployed between 6 and 7 months?
Given that an individual has been unemployed for 6 months, what is the probability that he will find a job within the next month?
Will the probability that an individual finds a job increase or decrease the longer he is unemployed?
What is the average or median amount of time an individual is unemployed?

Probability Distributions for a Duration Variable

A continuous duration random variable, T, can be described by four alternative probability distributions. These are:

Probability density function
Cumulative distribution function
Survival function
Hazard function

Once you choose a particular type of probability density function (e.g., normal, exponential, Weibull, etc.) you can derive the other three functions. Thus, all four functions have the same parameters and are simply different ways of describing the same system of probabilities.

Probability Density Function

Let T be a continuous duration random variable. Let t be a specific value of the random variable T. Let T have a probability density function given by (t), where t is a specific value of T. The probability density function (t) allows you to calculate the probability that T will fall in the interval between t1 and t2; that is,

Pr(t1 T  t2) =  (t)dt

Thus, the probability that T will fall in the interval between t1 and t2 is equal to the area under the curve of (t) between the values t1 and t2. For example, if T is length of unemployment in weeks and t1=40 and t2=42, then you can find the probability that an individual will be unemployed between 40 and 42 weeks.

Cumulative Distribution Function

Given the probability density function (t), the cumulative distribution function F(t) can be derived as follows, t

F(t) = Pr(T  t) =  (t)dt

Thus, the probability that T will take a value that is less than or equal to t is equal to the area under the curve of (t) between 0 and t. For example, if T is the length of unemployment in weeks and t=52, then you can find the probability that an individual will be unemployed for 52 weeks or less.

Survival Function

Given the cumulative distribution function F(t), the survival function S(t) can be derived as follows,

S(t) = Pr(T  t) = 1 – F(t)

Thus, the probability that T will take a value that is greater than or equal to t is equal to one minus the area of the curve of (t) between 0 and t. This is equal to the area under the curve of (t) between t and the maximum value of t. For example, if T is the length of unemployment in weeks and t=52, then you can find the probability that an individual will be unemployed for 52 weeks or more; that is, you can find the probability that an individual will be unemployed for at least 52 weeks.

Hazard Function

Given the probability density function (t) and the survival function S(t), the hazard function h(t) can be derived as follows,

h(t) = (t) / S(t)

The hazard function is a particular type of conditional probability function. It tells you the probability that an event will occur in the next short interval of time, given that it has not occurred up to time t. Roughly speaking, it tells you the rate at which the event will occur at time t. For example, if T is length of unemployment in weeks and t = 52, then you can find the probability that individual will find employment during the next week, given that he has been unemployed for 52 weeks. That is, the hazard function tells you the rate at which individuals who have been unemployed for 52 weeks are finding jobs. For example, a hazard rate of 0.05 at t = 52 implies that 5 of 100 individuals who are unemployed for 52 weeks are expected to find a job shortly after that time.

The Hazard Function and Duration Dependence

Often times we are interested in questions like the following.

Is it more likely, less likely, or equally likely that an individual will find a job the longer he is unemployed?
Is it more likely, less likely, or equally likely that an a strike will end the longer it lasts?
Is it more likely, less likely, or equally likely that a patient will die the longer he has survived after open heart surgery?

The answer to these questions depends upon the slope of the hazard function. We have the following definitions.

If the hazard function has a positive slope, then the distribution of the duration variable has positive duration dependence. In this case, the longer the duration (e.g., unemployment) the greater the probability the event will occur in the next short period (e.g., the greater the probability the individual will find employment).
If the hazard function has a negative slope, then the distribution of the duration variable has negative duration dependence. In this case, the longer the duration the smaller the probability the event will occur in the next short period.
If the hazard function has a constant slope, then the probability that an event will occur in the next short period does not depend upon duration. In this case, the event is said to have no memory.

Duration Dependence and Functional Form

If economic theory unequivocally indicates that a duration variable has, for example, negative duration dependence, then you should choose a functional form for the probability distribution that imposes this structure on the data. However, if economic theory allows a duration variable to have, for example, positive or negative duration dependence, then you should not choose a functional form for the probability distribution that imposes either positive or negative duration dependence on the data. If you do you create a fete accompli. In this case, you must choose a functional form for the probability distribution that is flexible enough to allow for both positive and negative duration dependence, and allow the data to determine the outcome.

Choosing a Functional Form for the Probability Distribution of a Duration Variable

To specify a parametric duration model, and estimate the parameters of this model, you must choose a particular functional form for the distribution of the duration variable. The 2 distributions chosen most often are the following:

1. Exponential distribution

2. Weibull distribution

Exponential Distribution

The exponential distribution is used frequently to specify a parametric duration model. This is because it is easy to work with, easy to interpret, and often times can be justified as a reasonable approximation of the data generation process. The probability density function, cumulative distribution function, survival function, and hazard function for the exponential distribution are as follows.

(t) = exp(-t)

F(t) = 1 – exp(-t)

S(t) = exp(-t)

h(t) = 

where the parameter  > 0. The probability density function, cumulative distribution function, survival function and hazard function for the parameter value =1 are illustrated in Figure 3. Note the following.

The exponential distribution has only one parameter, .
Both the mean and variance of the distribution are given by 1/; that is E(t) = 1/ and Var(t) = 1/. The median of the distribution is given by (0.69314718)(1/).
The hazard function is constant and equal to . Thus, this distribution imposes the restriction of no duration dependence. Because of this characteristic, the exponential distribution is sometimes call memoryless.
The major shortcoming of the exponential distribution is that it depends on only one parameter: . The family of distributions obtained by varying the value of  is not very flexible. Because both the mean and variance are given by 1/, they cannot be adjusted separately. Thus, the exponential distribution will not be a good approximation of the data generation process if the sample contains both very long and very short durations.

Weibull Distribution

The distribution that is probably used most often to specify a parametric duration model is the Weibull distribution. The Weibull distribution is a generalization of the exponential distribution and has the latter as a special case. The probability density function, cumulative distribution function, survival function, and hazard function for the Weibull distribution are as follows

(t) = (t)-1exp[-(t)]

F(t) = 1 – exp[-(t)]

S(t) = exp[-(t)]

h(t) = (t)-1 = t-1

where  > 0 and  > 0. The probability density function, cumulative distribution function, survival function and hazard function for the parameter values =1 and  = 0.5 are illustrated in Figure 1 and for parameter values =1 and  = 3 in Figure 2. Note the following.

The Weibull distribution has two parameters,  and .
The Weibull distribution collapses to the exponential distribution when  =1. Thus, the exponential distribution is a special case of the more general Weibull distribution.
The hazard function can be either monotonically increasing, monotonically decreasing, or constant, depending on whether the parameter  is greater than one, less than one, or equal to one. Thus, the Weibull distribution has positive duration dependence if  >1, negative duration dependence if
 <1, and no duration dependence if  =1.
The parameter  represents the shape of the distribution and the parameter  represents the location of the distribution.
The median of the distribution is given by: Median = (0.69314718)1/(1 / ). Because the Weibull and exponential distributions are skewed to the right, the median may be a better measure of central tendency than the mean.

Adding Explanatory Variables to a Parametric Duration Model

The duration model developed above is a univariate duration model; it includes one variable: the dependent variable. However, one or more explanatory variables can also be included in the duration model. It is possible to allow changes in the explanatory variables to influence the probability distribution of the dependent variable in various ways. Many parametric duration models allow changes in the explanatory variables to change the probability density, cumulative distribution, survival, and hazard functions by rescaling the horizontal axis. This is called an accelerated failure time model. The coefficients of the explanatory variables are relatively easy to interpret for most distributions and sometimes have a regression-like interpretation.

Adding Explanatory Variables to the Weibull Duration Model

The Weibull duration model is an accelerated failure time model. To include explanatory variables in the Weibull model, which has the exponential model as a special case, proceed as follows. Let T be a duration random variable that has a Weibull distribution. This distribution is described by two parameters:  and . The parameter  represents the shape of the distribution and the parameter  represents the location of the distribution. If  increases (decreases), the distribution shifts to the left (right). Assume  is a constant. Let  be a function of the explanatory variables. For the unemployment example,

 = g(X1, X2, X3)

where the X’s are the explanatory variables. Because the parameter  is a function of the explanatory variables, whenever an explanatory variable changes this will rescale the T-axis, thereby changing the location of the distribution. To simplify the estimation procedure, let g be an exponential function,

 = exp[-(0 + 1X1 + 2X2 + 3X3)]

Estimation

The Weibull model can be specified in survival function form as follows,

S(t) = exp[-(t)]

where

 = exp[-(0 + 1X1 + 2X2 + 3X3)]

To obtain estimates of the parameters of the Weibull model, 0,1, 2, 3 and , the maximum likelihood estimation procedure is used. The estimates of 0,1, 2, …n and  are the values that maximize the likelihood function for the sample of observations. The likelihood function accounts for both uncensored and censored observations.

Interpretation of Parameter Estimates

The Weibull model has two interpretations. 1) The effects of the explanatory variables on median duration. 2) The effects of the explanatory variables on the hazard rate.