Biostatistics and Epidemiology Using Stata: A Course Manual
Table of Contents
Section 1. Stata: Data Management, Graphics, and Programming
1-1 Installing Stata and recovering Stata windows
1. installing Stata
1. adding an icon to the desktop (PC Windows)
2. run Stata to finish setup
3. updating Stata after setup
3. recoving Windows: load factory settings
1-2 Getting data into Stata and some other basics
1. opening a Stata formatted data file: 1) clicking on file icon
1. showing full dictory path in Windows Explorer
1. showing file extensions in Windows Explorer
2. opening a Stata formatted data file: 2) File icon on menu bar
3. opening a Stata formatted data file: 3)
change directory (cd) command
directory list (dir) command
read in Stata data file (use) command
4. scrolling in Stata’s Results window
5. general syntax, or structure, of Stata commands
6. Stata help facility, help command
7. Stata Manuals
7. books on Stata
7. setting file attributes in Windows (turn of Read Only)
8. using do-files
9. suggested do-file structure
10. increasing memory size of Stata’s workspace: set memory (set mem) command
10. importing Excel file into Stata
11. reading in a *.csv or *.txt formatted file: insheet command
11. saving a Stata formatted data file: save command
12. saving a Stata formatted data file compatable with Stata version 8 or 9:
saveold command
1-3 Cleaning data
1. listing data: list command
2. block comment: /* … */
2. deleting variables: drop command
2. inline comment “//”
3. tabulation of values of variables using frequency table: tabulate (tab) command
3. examining 4 smallest and 4 largest values: summarize, detail (sum) command
3. replacing value of variable: replace command
assignment “=” and logical equals “==”
4. recoding values of a variable with generate (gen) and replace commands
5. keeping a command from crashing the do-file: capture command, e.g.,
capture drop
5. the “0 observations” result: attempting to do arithmetic on a string variable
6. describing variables: decribe command
6. variable storage types
7. missing value for string variable: the null string ""
7. converting a string variable to numeric, destring command
8. recoding values of a variable: recode command
9. converting to all upper or lower case: upper and lower string functions
9. renaming the variable name: rename command
1-4 Merging files
1. adding a file to bottom of file in memory: append command
2. adding a file in rightmost columns of file in memory, one-to-one merge
without matching on some variable: merge command
3. merging files while matching on some variable such as a subject ID,
match merge: merge command
4. checking how well the matching worked: Stata’s _match variable (values 1 to 3)
5. non-overwrite feature of merge command (the default)
6. non-overwrite of missing values feature of merge command (the default)
7. updating file in memory with another file by replacing missing values only:
update option
8. updating file in memory with another file by replacing both missing and
nonmissing values, update with replace: replace option
8. checking how well the update with replaced worked: Stata’s _match variable
(values 1 to 5)
1-5 Labeling variables and values
2. adding label to a variable: label variable command
3. adding labels to the values of a variable: label define and label values commands
3. listing value labels: label list command
4. suspending value labels in data browser and outputs, nolabel option
6. removing variables labels
7. removing value labels: label drop command
8. removing value labels: capture label drop command
9. displaying values and value labels
1-6 Basic graphics
1. using graphs from Stata version 7: graph7 and version 7 commands
1. redisplaying a graph: graph display command
2. scatterplot: graph twoway scatter command
3. appreviated scatterplot commands: twoway scatter and scatter commands
3. side by side graph: by option
3. linear regression line graph: lfit command
3. overlaying graphs: “||” operator
3. overlaying linear regression line on scatterplot using || operator
4. overlaying graphs using binding notation: ( ) ( )
4. overlaying linear regression line on scatterplot using ( ) approach
5. generating a variable with rounding: round function
5. generating a mean across data rows for subgroups:
by specification with egen command with mean function
5. listing variables: list command
5. extending a command across several lines in do-file editor: #delimit command
5. line graph: line command
6. requirement to sort on x variable before plotting a line graph: sort command
7. table of descriptive statistics for a two variable crossclassification: table command
7. smooth line graph using fractional polynomial fit: fpfit command
7. fractional polynomial fit with covariates: fracpoly command
8. adding title to graph: title command
8. adding subtitle to graph: subtitle command
8. adding axis titles to graph: ytitle and xtitle commands
8. adding footnote to graph: note command
9. adding more tick marks and labels to axes: ylabel and xlabel commands
9. better labels for legend: legend command
10. list of choices for line graph line widths: graph query linewidthstyle command
10. changing connect line width of line graph: clwidth option
11. list of choices for graph scheme: graph query, schemes command
11. changing default graph scheme for current session or permanently:
set scheme command
11. chaning graph scheme just for current graph: scheme option
12. basic black-and-white scheme for manuscripts: scheme(s1mono) option
13. eliminating border around graph: plotregion(style(none)) option
14. adding text to graph: text option
15. placement options for positioning text: placement option
16. adding space between x-axis title and x-axis tick labels: height(5) option in xtitle
17. changing color of connect line of line graph: clcolor option
17. turning off legend: legend(off) option
19. reading in graph data by putting data in do-file: input and end commands
19. adding error bars to graph: rcap command
20. overlaying errors bars on scatterplot to get symbol with error bars:
twoway (rcap…) (scatter…) commands
21. adding white space to left and right side of graph: xlabel command
21. change tick mark labels to more descriptive labels: xlabel command
22. drop tick marks from graph while retaining labels: noticks option
23. adding horizontal or vertical reference lines: yline and xline options
24. list of choices for colors: help colorstyle command
24. list of choices for symbols: help symbolstyle command
24. changing marker symbol for scatterplot: msymbol option
24. changing color to marker symbol border line and inside fill:
mlcolor and mfcolor options
1-7 Looping, collapsing, and reshaping
1-8 Operators, ifs, dates, and times
1-9 More graphics: popular scientific graphs
1-10 Programming Stata
1-11 Compilation of frequently used variable generation and modifying
commands (a chapter for quick look up)
1-12 Homework problems
Section 2. Biostatistics
2-1 Describing variables, levels of measurement, and vhoice of descriptive
statistics
Describing a variable (distribution):
with tables: frequency tables
with graphs: histogram, boxplots
with descriptive statistics: mean, standard deviation, etc.
Levels of measurement (nominal, ordinal, ... categorical, continuous ...)
How to decide what descriptive statistic to use to describe a variable in the
“Table 1. Patient Characteristics” table of an article.
2-2 Logic of significance tests
What a probability distribution is
Logic of a significance test (same logic as a laboratory reference range)
Chance, randomness, sampling variability
Statistical regularity (the basis of statistical theory)
Strong Law of Large Numbers (formal statement of statistical regularity).
Deriving the form of statistical test (significance test) intuitively
Sampling distribution
p value
2-3 Choice of significance test
2-4 Comparison of two independent groups
Role of p values in a Table 1 Patients Characteristics table
Confounding variables
chi-square test
Fisher’s exact test
Asymptotic vs exact tests (parametric vs nonparametric tests)
Minimum expected frequency rule for choosing between chi-square test and
Fisher’s exact test
Barnard’s unconditional exact test
Fisher-Freeman-Halton test
Wilcoxon-Mann-Whitney test
Fisher-Pitman Permutation Test for Independent Samples
Central Limit Theorem
Levene’s test for equality of variances
t test (both equal and unequal variances)
Shapiro-Wilks test for normality
Reporting styles
Outliers
Prespecification of analysis
2-5 Basics of power analysis
definition of power
power increases as sample size increases
decision errors of significance tests [ Type I error (alpha), Type II error (beta) ]
Type II error and sample size paragraph in journal article
conclusions of equivalence
power of a significance test
effect of one- or two-sided comparison on power
effect of choice of alpha on power
effect of choice of minimum detectable effect size on power
effect of size of assumed standard deviation (SD) on power – coming up with a
SD estimate
effect of sample size on power
sample size and power calculations for an interval scaled outcome variable
what to do if you don’t know anything (no effect size or standard deviation
estimates):
the standard deviation units approach, Cohen’s d.
sample size calculation when a multiple comparison adjustment is planned
overfitting
switching the dependent and independent variables
sample size based on precision (desired width of confidence interval)
excessive power (sample size very large)
two group comparison of interval scale outcome sample size paragraph in study
protocol
2-6 More on levels of measurement
sums of ordinal scales produce interval sacles
dichotomous scales are actually interval scales
can statistical tests that require interval scales be used with ordinal scales ( the
ordinal-interval controversy in statistics)
2-7 Comparison of two paired groups
2-8 Multiplicity and the Comparison of 3+ Groups
multiplicity
multiple comparison problem
p value based multiple comparison procedures: family-wise error rate
(Bonferroni, Holm, Sidak, Holm-Sidak, Hochberg, Finner, Hommel,
Tukey-Ciminera-Heyse)
P value based multiple comparison procedures: false discovery rate
(Benjamini-Hochberg procedure)
how to get away without using multiple-comparison procedures
simultaneous comparison of 3+ groups (includes one-way analysis of variance)
sample size when multiple comparisons are planned
2-9 Correlation
2-10 Linear regression
how linear regression controls for covariates
2-11 Logistic regression and dummy variables
linear regression estimates risk difference (difference between proportions), but is
criticized because it can estimate predicted probabilities outside of the 0-1 range
logistic regression is designed to constrain the predicted probability between 0
and 1
definition of an odds ratio
assessing linearity of effect
dummy variables (indicator variables)
2-12 Survival analysis: Kaplan-Meier graphs, Log-rank Test, and Cox regression
life tables
Kaplan-Meier survival probabilies & Kaplan-Meier curves
log-rank test
Cox regression
assessing goodness of fit with c-statistic (ROC area)
interpreting the c-statistic
testing proportional hazards assumption of Cox regression
2-13 Confidence intervals versus p values and trends toward significance
2-14 Pearson correlation coefficient with clustered data
2-15 Equivalence and noninferiority tests
2-16 Validity and reliability
2-17 Methods comparison studies
2-18 One sample tests
2-19 Homework problems
Section 3. Epidemiology
3-1 Introduction to epidemiologic thinking
3-2 Sufficient/component cause theory of disease
3-3 Hill’s causal criteria
3-4 Logic and errors
3-5 Effect measures
3-6 Study designs
3-7 Randomization using Excel
3-8 Bias and confounding
3-9 Random error and statistics
3-10 Crude analysis
3-11 Stratified analysis
3-12 Standardization
3-13 Sensitivity (bias) analysis
3-14 Case-cohort study design
3-15 Homework problems
Section 4. Power Analysis
Chapter 4-1. Sample Size Determination and Power Analysis for Specific Applications
two independent group comparison of means (independent groups t test)
linear regression: comparing two groups adjusted for covariates
two independent groups comparison of dichotomous outcome variable (chi-square test,
Fisher’s exact test)
two indendpent groups comparison of a nominal outcome variable (chi-square test and
Fisher-Freeman-Halton test)
two independent groups comparison of ordinal outcome variable (Wilcoxon-Mann-
Whitney test)
paired ordinal outcome variable (Wilcoxon signed ranks tests)
repeated measurements or clustered studies (GEE, mixed, mulilevel, hierarchial models)
power analysis using Monte Carlo simulation (independent samples t test)
power analysis using Monte Carlo simulation (2 × 2 table chi-square test)
power analysis using Monte Carlo simulation (Poisson regression with person-time)
power analysis using Monte Carlo simulation (2-way ANOVA, both factors with 2 levels,
neither of which is a repeated measurement)
logrank test
Section 5. Regression Models
5-1 What regression is and curvilinear correlation
5-2 Holding constant
5-3 Dichotomous predictor variables
5-4 Adjusted means, Analysis of Variance (ANOVA), and interaction
5-5 Deriving logistic regression
5-6 Exact logistic regression
5-7 Introducing Cox regression and Kaplan-Meier plots
5-8 Interaction
5-9 Missing data imputation
5-10 Linear regression robust to assumptions
5-11 Linear regression diagnostics and transformations
5-12 Variable selection and collinearity
5-13 Monte Carlo Simulation and Bootstrapping
5-14 Model Validation
5-15 Response feature (summary measure) analysis
5-16 Analysis of covariance (ANCOVA) versus change analysis
5-17 Conditional logistic regression
5-18 Repeated measures analysis of variance
5-19 Generalized estimating equations (GEE)
5-20 Multilevel (mixed effects) models
5-21 Regression post tests
5-22 Modeling cost
5-23 Cox regression proportional hazards assumption
5-24 Cluster analysis
5-25 Multilevel (mixed effects) logistic regression
5-26 Trend tests
5-27 Homework problems
Appendix 1. Dataset Descriptions
births.dta Concerns 500 mothers who had singleton births in a large London
hospital.
evans.dta From a cohort study in which n=609 white males were followed for 7
years, with coronary heart disease as the outcome of interest.
2.20.Framingham.dta The dataset comes from a long-term follow-up study of cardiovascular risk
factors on 4699 patients living in the town of Framingham, Massachusetts.
LeeLife.dta Concerns male patients with localized cancer of the rectum diagnosed in
Connecticut from 1935 to 1954. The research question is whether survival
improved for the 1945-1954 cohort of patients (cohort = 1) relative to the
earlier 1935-1944 cohort (cohort = 0).
mi.dta From a 1:2 matched case-control study in which n=117 subjects are
formed into 39 matched strata.
rmr.dta Data published by Nawata et al (2004)(on course CD). The data were