Chapter 3-15. Homework Problems
Logging Results
These homework problems needs to be turned in if you are taking the course for credit. Please send the log file as an e-mail attachment to the course TA:
@hsc.utah.edu (if taking class at U of Utah)
That is, log the contents of the Results window while working on this, and e-mail the log file.
To begin logging:
Click on the scroll icon (4th from left on 2nd row) on the menu bar
File name: homework1.smcl < for example >
Save
This begins logging. When you exit Stata, everything you did that session will be saved in this file. (note: the graphs will not show up in the log file, but the commands that created them will, and that is all that is needed to tell that you did it correctly.)
(Homework #1)
3-1. Introduction to epidemiologic thinking
Problem 1) Read article.
Read the article, Gary Taubes, Epidemiology faces its limits, Science 1995;269(Jul 14):164-169, which is on the course CD in the articles subdirectory.
This is a fun article to get you thinking about epidemiology. One thing you will notice is the importance that epidemiologists frequently place on the size of the effect, even though Rothman has repeatedly reminded epidemologists that “strength of a cause cannot be equated to the biology of causation.”
Problem 2) Email the course TA simply stating you read the assigned article.
3-2. Sufficient/component cause theory of disease
No problems.
3-3. Hill’s causal criteria
No problems.
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.
3-4. Logic and errors
No problems.
(Homework #2)
3-5. Effect measures
Problem 1) Read in the data file
Open the file evans.dta inside Stata, which is the same dataset used in Chapter 3-5. (we did this in Chapter 3-5, p. 33)
Problem 2) Compute a risk ratio
Compute risk ratio for high blood pressure as the exposure variable and CHD as the disease variable. (hint: we did something similar in Chapter 3-5, p. 34, and variable name descriptions are on p. 32)
Problem 3) Compute an attributable fraction
Using the display command, compute the attributable fraction exposed, to gain some practice with the formula. Your answer should agree with the “Attr. frac. ex.” line of the output for Problem 2. (hint: the formula for AF is shown in Chapter 3-5, on page 28. An example of a display command, although not the one you need for this problem, is shown in Chapter 3-5, on page 34.)
(Homework #3)
3-6. Study designs
Problem 1) Computing a Prevalence Odds Ratio
Look at Table 3 in Lee et al (2006)[on course CD]. Compute a prevalance odds ratio (POR) for increased LV mass (LV hypertrophy), the line showing 177 (17.0) and 123 (26.9), with “no parental heart failure” as the nonexposed, or referent, group.
[hint: this is done the same way that an odds ratio is calculated for a case-control study, Chapter 3-6, page 19]
Just to give you immediate feedback, your answer should be POR = 1.79.
3-7. Randomization using Excel
No problems.
(Homework #4)
3-8. Bias and confounding
Problem 1) Assessing Confounding
Using the evans.dta dataset, determine if the smoking-CHD association is confounded by cholesterol (simply using cholesterol as a continuous variable). Do this by computing an unadjusted logistic regression and then an adjusted logistic regression. Finally, use a display command to see if the effect changed by more than 10%.
[hint: something similar was done in Chapter 3-8, pages 21 and 22]
3-9. Random error and statistics
No problems.
3-10. Crude analysis
No problems.
(Homework #5)
3-11. Stratified analysis
Evans County Dataset (evans.dta)
The data are from a cohort study in which n=609 white males were followed for 7 years, with coronary heart disease as the outcome of interest.
Source
dataset to accompany Kleinbaum and Klein (K&K chapter 2)
http://www.sph.emory.edu/~dkleinb/logreg2.htm#data
Brief Description
Data are from a cohort study in which n=609 white males were followed
for 7 years, with coronary heart disease as the outcome of interest.
Codebook
outcome
chd coronary heart disease (1=presence, 0=absence)
predictors
cat catecholamine level (1=high, 0=normal)
age age in years (continuous)
chl cholesterol (continuous)
smk smoker (1=ever smoked, 0=never smoked)
ecg electrocardiogram abnormality (1=presence, 0=absence)
dbp diastolic blood pressure (continuous)
sbp systolic blood pressure (continuous)
hpt high blood pressure (1=presence, 0=absence)
defined as: DBP ³ 160 or SBP ³ 95
data management
id subject identifier
Problem 1) Compute the Mantel-Haenzsel summary risk ratio with multiple stratification
variables
Although not done in Chapter 3-11, you can ask for a set of stratification variables, not just one, when computing the summary risk ratio. This will give the summary Mantel-Haenzsel risk ratio, along with all of strata created by the possible combinations of categories of the stratification variables. Compute the summary risk ratio for the smk-CHD association, controlling for cat, hbp, and ecg. [Hint: an example is given in Chapter 3-11, page 5, where one stratification variable was specified—you will simply need to provide three stratification variables.]
Problem 2) Compute a Modified Poisson Regression
Fit a modified Poisson regression to get the adjusted risk ratio for the smk-CHD association, controlling (adjusted) for cat, hbp, and ecg. [Hint: an example is given in Chapter 3-11, page 12, where one covariate, or potential confounder, variable was specified.]
Problem 3) Assessing confounding
Fit a modified Poisson regression to get the risk ratio for the smk-CHD association, without the three covariates.
Look at the crude and adjusted RRs, using either the two Poission models, or with the table from problem 1. Using the 10% change in effect rule, decide if the smk-CHD association was confounded by cat, hbp, and ecg. [Hint: something like this was done in Chapter 3-8, page 22.]
Enter you decision into the Stata log file by adding a comment to the log file, done by running either of the following comment lines in the Command window.
* Yes, confounded
or
*No, not confounded
3-12. Standardization
No problems.
(Homework #6)
3-13. Sensitivity (bias) analysis
We will use the Evans County Dataset (evans.dta) to conduct a sensitivity analysis. The codebook is shown above for the Chapter 3-11 homework problem.
Problem 1) logistic regression model
Suppose, or pretend, we are interested in how important age and high blood pressure are as predictors of an electrocardiogram abnormality. Open the data file evans.dta inside Stata. Fit the following logisitic regression model.
logistic ecg age hpt
Problem 2) logistic regression model with sensitivity analysis
Suppose we are not confident that the ecg was read correctly. As one scenario of a sensitivity analysis, we are interested in the effect of misclassification error for the ecg, where we consider that 10% of the time the ecg reader did not detect an abnormality when it was actually present, and 5% of the time the ecg reader detected an abnormality when it was actually absent.
Translate these numbers into a sensitivity and specificity and re-fit the logistic regression model that adjusts for these misclassification errors. That is, fit the following model with the ?’s replaced with proportions representing sensitivity and specificity of the ecg for this scenario.
logitem ecg age hpt ,sens(?) spec(?)
hints: 1) use the definitions for sensitivity and specificity of the disease
misclassication (Chapter 3-13, page 1) to help you think about it
2) review the Madger and Hughes example at the top of page 10 in
Chapter 3-13.
(Homework #7)
3-14. Case-cohort study design
Case-Cohort Study with Time to Event Data (density case-control design)
The assignment is to conduct a case-cohort study of the Framingham Heart Study dataset, 2.20.Framingham.dta.
The dataset comes from a long-term follow-up study of cardiovascular risk factors on 4699 patients living in the town of Framingham, Massachusetts. The patients were free of coronary heart disease at their baseline exam (recruitment of patients started in 1948).
Data Codebook
Baseline exam:
sdp systolic blood pressure (SBP) in mm Hg
dbp diastolic blood pressure (DBP) in mm Hg
age age in years
scl serum cholesterol (SCL) in mg/100ml
bmi body mass index (BMI) = weight/height2 in kg/m2
sex gender (1=male, 2=female)
month month of year in which baseline exam occurred
id patient identification variable (numbered 1 to 4699)
Follow-up information on coronary heart disease:
followup follow-up in days
chdfate CHD outcome (1=patient develops CHD at the end of follow-up,
0=otherwise)
Problem 1) read in data
Read in the dataset 2.20.Framingham.dta, which is in the datasets & do-files subdirectory of the course CD.
Problem 2) Cox regression model using total dataset (no case-cohort sampling)
Fit a Cox regression to the total sample using,
followup as the time variable
chdfate as the event variable
age and sbp as the list of predictor variables
[hint: see page 13 of Chapter 3-14]
Problem 3) update Stata to get needed commands for this study design
While connected to the Internet, update your Stata to include the commands needed for case-cohort analysis [see page 31 of Chapter 3-14]
Problem 4) case-cohort sampling and analysis
We can see from the following frequency table that there are 1,473 cases and 3,226 controls in the full cohort.
. tab chdfate
Coronary |
Heart |
Disease | Freq. Percent Cum.
------+------
Censored | 3,226 68.65 68.65
CHD | 1,473 31.35 100.00
------+------
Total | 4,699 100.00
Sample 31.35% of the total sample for controls (a one-to-one sampling ratio of cases to controls), and fit a Cox regression model similar to that computed above, but with appropriate variance estimators. The required commands are:
stset followup , failure(chdfate==1) id(id)
stcascoh, alpha(.3135) seed(999)
stselpre age sbp
Notice how close the estimate is to what it should be (compared to problem 1).
Chapter 3-15 (revised 16 May 2010) p. 1