Case Study 3- Surviving the Titanic Disaster
Description
The Titanic disaster, which occurred on the 31st of May 1911, still captures the interest of the film producers, historians and other scientists. The carefully designed boat became the tomb of more than 1500 people. Several characteristics of the passengers are recorded in the dataset used for this analysis. The dataset contains 2201 subjects. The data available are coded as follows:
SURVIVED: 0 = Not survived
1 =Survived
AGE:0 = Child
1 = Adult
GENDER: 0 = Male
1 = Female
CLASS: 1 = First class
2 = Second class
3 = Third class
4 = Crew members
Questions of interest:
- Which factor(s) is most important in predicting survival rate?
- Which subgroup has the highest survival rate? For instance,
Is “Women and children first?” true in emergencies?
Did crew leave the boat last resulting in low survival rate in this group?
Suggested approaches:
Approach / Reason / Type of questions addressedData Restructuring
Create a new variable for women and children
Create dummy variables for class variable / To compare the survival rates of adult males with the combination of women and children
To use in regression / “Is it women and children first?”
Summary statistics
Survival rates for each group (eg male vs female, or first class vs second class)
Odds ratio between survival status and age; between survival status and gender
Cross-table of survival status and class / To compare survival rates of different groups
To quantify the association between survival and other variables / “Which subgroup has the highest survival rate?”
“Is the survival independent of characteristic of passenger?”
Visual displays
Barcharts of survival for all variables
Mosaic plots of all categorical variables / To compare survival rates of different subgroups
To check independence among variables, and to explore multivariate relations / “Which subgroup has the highest survival rate?”
“Are variables independent?” “Are there any unusually small or large subgroups?”
Regression
Logistic regression of survival status on other variables / To determine the most significant factors in survival from Titanic / “Which factor(s) is most important in predicting/ estimating survival rate?”
Partial solutions
Percentages (Raw counts)
Not survived / Survived68% (1490) / 32% (711)
adult / child
95% (2092) / 5% (109)
female / male
21% (470) / 79% (1731)
1st / 2nd / 3rd / crew
15% (325) / 13% (285) / 32% (706) / 40% (885)
Survival rate(Total) / 1st / 2nd / 3rd / Crew / Total
Total / 0.625 (325) / 0.414 (285) / 0.252 (706) / 0.24 (885)
Women & Children / 0.973 (150) / 0.89 (117) / 0.422 (244) / 0.87 (23) / 0.698 (534)
Adult Males / 0.326 (175) / 0.083 (168) / 0.162 (462) / 0.223 (862) / 0.203 (1667)
Adult Female / 0.972 (144) / 0.86 (93) / 0.46 (165) / 0.869 (23) / 0.744 (425)
Children / 1 (6) / 1 (24) / 0.34 (79) / (0) / 0.523 (109)
Male Adults / Women & Children / Total
Survived / 338 / 373 / 711
Not survived / 1329 / 161 / 1490
Total / 1667 / 534 / 2201
= = 0.11
Odds of male adults surviving the Titanic disaster was 89% less likely compared to odds of women & children surviving.
Logistic regression
logit(P(Yi=1))=β0+ β1* I(Women&Child)
where I(Women&Child)=0 for adult male
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.36914 0.06092 -22.48 <2e-16 ***
I(Women&Child) 2.20931 0.11226 19.68 <2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
AIC: 2338.8
logit(P(Yi=1))=β0+ β1*AGE + β2*SEX + β3 * I (c1) + β4 * I (c2) +β5 * I (c3)
where c1= 0 if crew and 1 o.w.;
c2 = 0 if 2nd class or crew and 1 o.w.;
c3 = 0 if 3rd class or crew and 1 o.w
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1724 0.2567 -0.671 0.502
titanic2[, "AGE"] -1.0615 0.2440 -4.350 1.36e-05 ***
titanic2[, "SEX"] 2.4201 0.1404 17.236 < 2e-16 ***
titanic2[, c1] -1.9382 0.2535 -7.645 2.09e-14 ***
titanic2[, c2] 1.0181 0.1960 5.194 2.05e-07 ***
titanic2[, c3] 1.7778 0.1716 10.362 < 2e-16 ***
AIC: 2222.1
Akaike Information Criterion (AIC)
This measure indicates a better fit when it is smaller. The measure is not standardized and is not interpreted for a given model. For two models estimated from the same data set, the model with the smaller AIC is to be preferred.
this is a better fit than the previous one.
As age increases, the odds of survival decreases, or equivalently, probability of survival decreases. (odds=p/(1-p), so odds ↓ imply p ↓ and (1-p) ↑)
So, an adult has lower odds of survival compared to a child.
Odds ratio of survival for females to males = exp(2.42)=11.25
females were 11.25 times more likely to survive titanic compared to males
Compare odds of survival for 1st class with 2nd class: OR=exp(1.018)=2.7676
Compare odds of survival for 2nd class with 3rd class: OR=exp(1.778-1.018)=2.14
Compare odds of survival for 3rd class passengers with crew: OR=exp(-1.94+1.018)=0.4 ! (3rd class is 60% less likely to survive!)
Why comparison of odds for survival for crew and 3rdclass does not make much sense?
Chisq = 349.9, df = 3, p-value = 1.557e-75
reject Ho: class and sex are independent.
1