STA 6505, Fall 2008, Homework #2 Solutions
Ch.2: Exercises 2.6, 2.7, 2.8, 2.12, 2.15, 2.18
2.6. A newspaper article preceding the 1994 World Cup semifinal match between Italy and Bulgaria stated that “Italy is favored 10-11 to beat Bulgaria, which is rated 10-3 to reach the final.” Suppose that this means that the odds that Italy wins are 1.1, and the odds that Bulgaria wins are 0.3 Find the probability that each team wins, and comment.
The probability may be found from the odds as . Hence, the probability that Italy wins is , and the probability that Bulgaria wins is . These two probabilities do not sum to 1. Hence, unless there is the possibility of a tie (not in a World Cup semifinal), the two odds quoted do not agree.
2.7. In the United States, the estimated annual probability that a woman over the age of 35 dies of lung cancer equals 0.001304 for current smokers and 0.000121 for nonsmokers (M. Pagano and K. Gauveau, Principles of Biostatistics, Duxbury Press, Pacific Grove, CA. 1993, p. 134).
a) Find and interpret the difference of proportions and the relative risk. Which measure is more informative for this data? Why?
The difference of proportions is 0.001304 – 0.000121 = 0.0012. This says that the difference between the proportions of female smokers over 35 who die of lung cancer and the proportion of female nonsmokers over 35 who die of lung cancer is 0.12%, a very small fraction.
The relative risk is R.R. = 0.001304/0.000121 = 10.78. The likelihood of a woman over 35 dying of lung cancer is 10.78 times as high for smokers as for nonsmokers. The relative risk makes more sense in interpreting this data, since the difference of proportions makes it appear there isno association. The event under consideration, dying of lung cancer, is a rarely occurring event, so we would expect that the difference of proportions would seem relatively small, while the relative risk would seem “large.”
b) Find and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values.
The odds ratio is = (.001304/.998696)/(.000121/.999879) = 10.79. The odds of a woman over 35 dying of lung cancer if she is a smoker are 10.79 times as large as the odds of a woman over 35 dying of lung cancer if she is a nonsmoker. The odds ratio and the relative risk are close in value for rarely occurring events. This happens when the proportion inthe first category (dying of lung cancer) is close to zero.
2.8. For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender (female, male) and survival (yes, no) was 11.4. (For data, see R. J. M. Dawson, Journal of Statistics Education, 3, 1995).
a) What is wrong with the interpretation, “The probability of survival for females was 11.4 times that for males?”
The correct interpretation would be “The odds of survival for females was 11.4 times that for males.” The odds ratio and the relative risk would not be nearly the same in this case, since the event in question (survival) was not a sufficiently rare event.
b) The odds of survival for females equaled 2.9. For each gender, find the proportion who survived.
Since the odds ratio was , and , then . The proportions of females who survived was then . The proportion of males who survived was .
2.12. Table 2.10 refers to applicants to graduate school at the University of California at Berkeley, for fall, 1973. It presents admissions decisions by gender of applicant for the six largest graduate departments. Denote the three variables by A = whether admitted, G = gender, and D = Department. Find the sample AG conditional odds ratios and the marginal odds ratio. Interpret, and explain why they give such different indications of AG association.
Let the variable A be coded 1 = “Yes”, 2 = “No”. Let the variable G be coded 1 = “Male”, 2 = “Female”.
For Department A, the conditional odds ratio is
.
For Department B, the conditional odds ratio is
.
For Department C, the conditional odds ratio is
.
For Department D, the conditional odds ratio is
.
For Department E, the conditional odds ratio is
.
For Department F, the conditional odds ratio is
.
The marginal odds ratio is
.
The marginal odds ratio is greater than any of the conditional odds ratios. There is a relatively strong association between the variable D (department) and each of the other two variables, A = Admitted? and G = Gender, that accounts for the discrepancy between the conditional odds ratios and the marginal odds ratio. Some departments have higher proportions of females admitted; others have higher proportions of males admitted.
2.15. At each age level, the death rate is higher in South Carolina than in Maine, but overall, the death rate is higher in Maine. Explain how this could be possible. (For data, see H. Wainer, Chance, 12:44, 1999). The age distribution is relatively higher in Maine.
2.18. Table 2.11 refers to a retrospective study of lung cancer and tobacco smoking among patients in several English hospitals. The table compares male lung cancer patients with control patients having other diseases, according to the average number of cigarettes smoked daily over a 10-year period preceding the onset of the disease. The lung cancer group has n = 1357, and the control group has n = 1357.
a) Find the sample odds of lung cancer at each smoking level and the five odds ratios that pair each level of smoking with no smoking. As smoking increases, is there a trend? Interpret.
Daily Avg. No. of Cigarettes / Odds of Lung Cancer / Ln(Odds)None / 0.114754 / -2.164964
< 5 / 0.426357 / -0.852479
5 – 14 / 0.857895 / -0.153274
15 – 24 / 1.102088 / 0.097207
25 – 49 / 1.902597 / 0.643220
50 + / 3.166667 / 1.152680
The odds ratios are:
For < 5 v. None, the odds ratio is
.
For 5 – 14 v. None, the odds ratio is
.
For 15 – 24 v. None, the odds ratio is
.
For 25 – 49 v. None, the odds ratio is
.
For 50 + v. None, the odds ratio is
.
These odds ratios must be interpreted carefully, due to the retrospective nature of the study. Consider the random experiment of randomly selecting a patient from a subset of the study group consisting of two levels of smoking. As the level of smoking increases, the odds that the patient is in the lung cancer group, rather than the control group, increases.
b) If the log odds of lung cancer is linearly related to smoking level, the log-odds in row I satisfies . Show that this implies that the local odds ratios are identical.
If the log-odds is a linear function of smoking level, i, then the log-odds ratio between successive smoking levels is , regardless of the value of i.
Hence, the odds ratio between successive levels of smoking is
. Thus the local odds ratios are identical.
It can be seen from the graph below that the above linear relationship holds, approximately. Hence, we may conclude that the condition of local independence also holds.
c) Using these data, can you estimate the probability of lung cancer at each level of smoking? Are the estimated odds ratios in part (a) meaningful? Explain.
There is a 1-1 correspondence between the odds and the probability:
. Then, from the odds in the previous table, we calculate the following probabilities:
Daily Avg. No. of Cigarettes / Odds of Lung Cancer / ProbabilityNone / 0.114754 / 0.102941
< 5 / 0.426357 / 0.298913
5 – 14 / 0.857895 / 0.461756
15 – 24 / 1.102088 / 0.524283
25 – 49 / 1.902597 / 0.655481
50 + / 3.166667 / 0.760000
These, however, are not the probabilities of occurrence of lung cancer for given smoking levels. The data are from an artificial, retrospective study, in which a group of lung cancer patients at each level of smoking were matched with a group of patients without lung cancer at the same level of smoking. Each of the probabilities then represents the probability that, when randomly selecting a patient from the smoking level group, the patient selected will be from the lung cancer group. Likewise, the odds ratios from part (a) relate to random selection of a patient from each pair of smoking levels. It is not proper to interpret the probabilities as the probabilities of having lung cancer for given smoking levels.
d) Show that the disease groups are stochastically ordered with respect to their distributions on smoking of cigarettes (see Problem 2.34 and Section 7.3.4). Interpret. We calculate the empirical distribution function for the lung cancer patients (1st column below) and the empirical distribution function for the control patients (2nd column below).
Daily Avg. No. of Cigarettes / Lung Cancer, Probability / Control, ProbabilityNone / 0.005158 / 0.044952
< 5 / 0.045689 / 0.140015
5 – 14 / 0.406043 / 0.560059
15 – 24 / 0.756080 / 0.877671
25 – 49 / 0.971997 / 0.991157
50 + / 1.000000 / 1.000000
The two distributions are stochastically ordered, since every number in column 2 is no greater than the corresponding number in column 3.