Survival Analysis(Chapter 25)
Survival analysis is interested in either modeling the time to failure, death or some other event. It is typically assumed that all subjects will eventually “fail” whereas logistic regression is use more for data where we want to know the probability of failure at a set time point. Usually survival analysis will try to evaluate the effects of different “treatments” on the time to failure. A common complication of survival analysis is censoring; i.e., there is missing information as to either when the exact time of the event occurred or when the “clock” started.
Examples:
- Operating times until a type of machine breaks.
- Times until patients die due to a particular cancer.
- Times until children learn to read.
- Months of work until prostitutes acquire a particular disease.
- Times until after exposure to the HIV virus that people express AIDS symptoms.
Key concepts:
- Let T be a random variable which represents time to failure (T≥0). Example: A person dies on their 70th birthday, then T=70.
- Density function of t is f(t) which is the probability of failure at time period t. This can follow many probability distributions.
- F(t), the cumulative distribution function of T, gives P(T≤t); i.e., failure occurs at or before time t.
- Survival function, S(t)=1-F(t)=P(T>t); i.e., failure occurs after time t.
- Hazard function: Probability of failing during the immediate moment of time after time t given that the failure had not occurred before time t. =P( t < T < t+dt | T>t )=. For example, suppose the probability of a 70 year old dying before his 71st birthday is 0.07 and if the precision is recorded by year, then
Kaplan-Meier Estimator (aka product limit estimator).
The Kaplan-Meier estimator is a non-parametric estimator of the survival function.
Data: 1, 2, 2+, 3+, 6 where the "+" signs mean that the patient was still alive at the end of his or her follow-up but was no further information collected; that is, the patient was censored at that time.
General idea: Calculate a running proportion of the number alive at a given time while adjusting for censored data. We assume non-informative (random) right censoring.
Interval (Start-End) / # At Risk at Start of Interval / # Censored During Interval / # At Risk at End of Interval / # Who Died at End of Interval / Proportion Surviving This Interval / Cumulative Survival at End of Interval(0-1] / 5 / 0 / 4 / 1 / 4/5 = 0.8 / 0.80
(1-2] / 4 / 1 / 2 / 1 / 3/4 = 0.75 / 0.80 * 0.75 = 0.60
(2-3] / 2 / 1 / 1 / 0 / 2/2 = 1.00 / 0.60*1.00= 0.60
(3-6] / 1 / 0 / 0 / 1 / 0/1 = 0 / 0.6*0 = 0
You need to be very careful about whether an interval includes or only goes up to a time t. This may vary from stat package to stat package. If we had created the table with the intervals: [0-1), [1,2), etc. we would have gotten a slightly different table.
Data: 1, 2, 2+, 3+, 6
> library( survival )
> # time of death or censoring
> time.end <- c(1,2,2,3,6)
> # 1= death, 0=censored
> dead <- c(1,1,0,0,1)
> # create survival object
> surv.object <- Surv( time.end, dead)
> surv.object
[1] 1 2 2+ 3+ 6
> # kaplan-meier estimates
> kaplan1 <- survfit( surv.object ~ 1 ) # ~1 is because there are no covariates
> names( kaplan1 )
[1] "n" "time" "n.risk" "n.event" "n.censor" "surv"
[7] "type" "std.err" "upper" "lower" "conf.type" "conf.int"
[13] "call"
> cbind( kaplan1$time,kaplan1$n.event,kaplan1$surv)
[,1] [,2] [,3]
[1,] 1 1 0.8
[2,] 2 1 0.6
[3,] 3 0 0.6
[4,] 6 1 0.0
> plot(kaplan1, ylab="S(t)",xlab="t")
> # Note: Kaplan-Meier method is also called
> # the product-limit estimator
> 1*.8
[1] 0.8
> 1*.8*.75
[1] 0.6
> 1*.8*.75*0
[1] 0
A more involved example
> # Get Afifi's text book lung cancer data
> Lung <- read.table("
+ header=F, na.strings=".")
> names(Lung) <-
+ c("ID","STAGET","Stagen","Hist","Treat","Perfbl","Poinf","Smokfu","Smokbl","Days","Death")
> # Death: 0 = Alive(censored), 1=Dead
> # STAGET: Tumor size: 0=small, 1=large
> # Hist: Histology: 1=squamous cells, 2=other types of cells
> # Treat: 0=control (saline), 1=BCG
> attach( Lung )
> # First 8 rows of data
> Lung[1:8,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
1 1 1 0 1 0 0 0 2 1 2926 0
2 2 1 0 1 1 0 0 1 1 590 1
3 3 0 0 1 0 0 0 1 2 2803 0
4 4 1 0 1 1 0 0 2 1 2762 0
5 5 1 0 1 0 0 0 2 1 2616 0
6 6 0 0 1 1 0 0 NA 1 2716 0
7 7 0 0 1 1 0 0 1 1 1485 1
8 8 1 0 1 0 0 0 NA 1 2456 0
> # First 10 people who died or were censored
> Lung[order(Days),][1:10,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
281 281 0 0 2 1 0 1 NA 1 4 1
228 228 1 0 1 0 0 1 NA 1 7 1
139 139 0 0 2 0 1 0 NA 1 8 0
120 120 1 0 1 0 1 1 NA 2 29 1
186 186 1 0 2 0 0 0 NA 1 29 1
149 149 0 0 2 1 0 1 NA 1 31 1
344 344 0 0 2 1 1 0 NA 2 40 1
300 300 1 0 2 0 0 0 NA 1 65 1
199 199 1 0 2 1 0 0 NA 2 69 1
393 393 0 0 1 1 1 0 NA 2 78 1
> # Demo of order() function
> order( c(9,7,6,8,8) )
[1] 3 2 4 5 1
> #see help(Surv)
> #see help(with) used instead of "attach"
> Lsurv <- with( Lung, Surv(Days,Death)) # makes survival analysis data
> # A '+' denotes censored
> Lsurv[1:7]
[1] 2926+ 590 2803+ 2762+ 2616+ 2716+ 1485
> # Compare to last values of Lung data frame
> Lung[1:7,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
1 1 1 0 1 0 0 0 2 1 2926 0
2 2 1 0 1 1 0 0 1 1 590 1
3 3 0 0 1 0 0 0 1 2 2803 0
4 4 1 0 1 1 0 0 2 1 2762 0
5 5 1 0 1 0 0 0 2 1 2616 0
6 6 0 0 1 1 0 0 NA 1 2716 0
7 7 0 0 1 1 0 0 1 1 1485 1
> # Kaplan-Meir estimate of Survival function for all data
> kmfit <- survfit( Lsurv ~ 1 )
> kmfit
Call: survfit(formula = Lsurv ~ 1)
records n.max n.start events median 0.95LCL 0.95UCL
401 401 401 204 2300 2017 3116
> Lung[order(Days),][1:7,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
281 281 0 0 2 1 0 1 NA 1 4 1
228 228 1 0 1 0 0 1 NA 1 7 1
139 139 0 0 2 0 1 0 NA 1 8 0
120 120 1 0 1 0 1 1 NA 2 29 1
186 186 1 0 2 0 0 0 NA 1 29 1
149 149 0 0 2 1 0 1 NA 1 31 1
344 344 0 0 2 1 1 0 NA 2 40 1
> # summary( ) produces values of survival function
> # Only part shown.
> summary( kmfit )
Call: survfit(formula = Lsurv ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
4 401 1 0.998 0.00249 0.993 1.000
7 400 1 0.995 0.00352 0.988 1.000
29 398 2 0.990 0.00497 0.980 1.000
31 396 1 0.988 0.00555 0.977 0.998
40 395 1 0.985 0.00607 0.973 0.997
65 394 1 0.983 0.00655 0.970 0.995
:: ::: :: ::::: ::::: ::: ::::
> plot( kmfit ) # show survival curve
> title( "All data together")
> kmfit.bystaget <- survfit( Lsurv ~ STAGET )
> kmfit.bystaget
Call: survfit(formula = Lsurv ~ STAGET)
records n.max n.start events median 0.95LCL 0.95UCL
STAGET=0 213 213 213 91 NA 2516 NA
STAGET=1 188 188 188 113 1689 1294 2236
> plot( kmfit.bystaget, lty=c(1,2))
> legend( 500, .2, legend=c("Small","Large"), lty=c(1,2) )
> title("Grouped by stage of tumor")
> kmfit.byTreat <- survfit( Lsurv ~ Treat )
> kmfit.byTreat
Call: survfit(formula = Lsurv ~ Treat)
records n.max n.start events median 0.95LCL 0.95UCL
Treat=0 195 195 195 96 2347 1915 NA
Treat=1 206 206 206 108 2253 1695 NA
> plot( kmfit.byTreat, lty=c(1,2))
> legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2) )
> title("Grouped by Treatment")
> kmfit.byHist <- survfit( Lsurv ~ Hist )
> kmfit.byHist
Call: survfit(formula = Lsurv ~ Hist)
records n.max n.start events median 0.95LCL 0.95UCL
Hist=1 196 196 196 93 2580 2084 NA
Hist=2 205 205 205 111 2089 1548 3116
> plot( kmfit.byHist, lty=c(1,2) )
> legend( 500, .2, legend=c("Squamous","Other"), lty=c(1,2) )
> title("Grouped by histology")
> plot( kmfit.byTreat, conf.int=T, col=c("black","grey"),lty=1:2)
> legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2), col=c("black","grey") )
> survdiff( Lsurv ~ Treat ) # log-rank test for difference between groups
Call:
survdiff(formula = Lsurv ~ Treat)
N Observed Expected (O-E)^2/E (O-E)^2/V
Treat=0 195 96 99.8 0.148 0.289
Treat=1 206 108 104.2 0.142 0.289
Chisq= 0.3 on 1 degrees of freedom, p= 0.591
> survdiff( Lsurv ~ STAGET )
Call:
survdiff(formula = Lsurv ~ STAGET)
N Observed Expected (O-E)^2/E (O-E)^2/V
STAGET=0 213 91 116.1 5.43 12.6
STAGET=1 188 113 87.9 7.18 12.6
Chisq= 12.6 on 1 degrees of freedom, p= 0.000379
> survdiff( Lsurv ~ Hist )
Call:
survdiff(formula = Lsurv ~ Hist)
N Observed Expected (O-E)^2/E (O-E)^2/V
Hist=1 196 93 104 1.07 2.16
Hist=2 205 111 100 1.10 2.16
Chisq= 2.2 on 1 degrees of freedom, p= 0.141
> ?survdiff
1