Biostatistics and Epidemiology Using Stata: a Course Manual

Biostatistics and Epidemiology Using Stata: A Course Manual

Table of Contents

Section 1. Stata: Data Management, Graphics, and Programming

1-1 Installing Stata and recovering Stata windows

1. installing Stata

1. adding an icon to the desktop (PC Windows)

2. run Stata to finish setup

3. updating Stata after setup

3. recoving Windows: load factory settings

1-2 Getting data into Stata and some other basics

1. opening a Stata formatted data file: 1) clicking on file icon

1. showing full dictory path in Windows Explorer

1. showing file extensions in Windows Explorer

2. opening a Stata formatted data file: 2) File icon on menu bar

3. opening a Stata formatted data file: 3)

change directory (cd) command

directory list (dir) command

read in Stata data file (use) command

4. scrolling in Stata’s Results window

5. general syntax, or structure, of Stata commands

6. Stata help facility, help command

7. Stata Manuals

7. books on Stata

7. setting file attributes in Windows (turn of Read Only)

8. using do-files

9. suggested do-file structure

10. increasing memory size of Stata’s workspace: set memory (set mem) command

10. importing Excel file into Stata

11. reading in a *.csv or *.txt formatted file: insheet command

11. saving a Stata formatted data file: save command

12. saving a Stata formatted data file compatable with Stata version 8 or 9:

saveold command

1-3 Cleaning data

1. listing data: list command

2. block comment: /* … */

2. deleting variables: drop command

2. inline comment “//”

3. tabulation of values of variables using frequency table: tabulate (tab) command

3. examining 4 smallest and 4 largest values: summarize, detail (sum) command

3. replacing value of variable: replace command

assignment “=” and logical equals “==”

4. recoding values of a variable with generate (gen) and replace commands

5. keeping a command from crashing the do-file: capture command, e.g.,

capture drop

5. the “0 observations” result: attempting to do arithmetic on a string variable

6. describing variables: decribe command

6. variable storage types

7. missing value for string variable: the null string ""

7. converting a string variable to numeric, destring command

8. recoding values of a variable: recode command

9. converting to all upper or lower case: upper and lower string functions

9. renaming the variable name: rename command

1-4 Merging files

1. adding a file to bottom of file in memory: append command

2. adding a file in rightmost columns of file in memory, one-to-one merge

without matching on some variable: merge command

3. merging files while matching on some variable such as a subject ID,

match merge: merge command

4. checking how well the matching worked: Stata’s _match variable (values 1 to 3)

5. non-overwrite feature of merge command (the default)

6. non-overwrite of missing values feature of merge command (the default)

7. updating file in memory with another file by replacing missing values only:

update option

8. updating file in memory with another file by replacing both missing and

nonmissing values, update with replace: replace option

8. checking how well the update with replaced worked: Stata’s _match variable

(values 1 to 5)

1-5 Labeling variables and values

2. adding label to a variable: label variable command

3. adding labels to the values of a variable: label define and label values commands

3. listing value labels: label list command

4. suspending value labels in data browser and outputs, nolabel option

6. removing variables labels

7. removing value labels: label drop command

8. removing value labels: capture label drop command

9. displaying values and value labels

1-6 Basic graphics

1. using graphs from Stata version 7: graph7 and version 7 commands

1. redisplaying a graph: graph display command

2. scatterplot: graph twoway scatter command

3. appreviated scatterplot commands: twoway scatter and scatter commands

3. side by side graph: by option

3. linear regression line graph: lfit command

3. overlaying graphs: “||” operator

3. overlaying linear regression line on scatterplot using || operator

4. overlaying graphs using binding notation: ( ) ( )

4. overlaying linear regression line on scatterplot using ( ) approach

5. generating a variable with rounding: round function

5. generating a mean across data rows for subgroups:

by specification with egen command with mean function

5. listing variables: list command

5. extending a command across several lines in do-file editor: #delimit command

5. line graph: line command

6. requirement to sort on x variable before plotting a line graph: sort command

7. table of descriptive statistics for a two variable crossclassification: table command

7. smooth line graph using fractional polynomial fit: fpfit command

7. fractional polynomial fit with covariates: fracpoly command

8. adding title to graph: title command

8. adding subtitle to graph: subtitle command

8. adding axis titles to graph: ytitle and xtitle commands

8. adding footnote to graph: note command

9. adding more tick marks and labels to axes: ylabel and xlabel commands

9. better labels for legend: legend command

10. list of choices for line graph line widths: graph query linewidthstyle command

10. changing connect line width of line graph: clwidth option

11. list of choices for graph scheme: graph query, schemes command

11. changing default graph scheme for current session or permanently:

set scheme command

11. chaning graph scheme just for current graph: scheme option

12. basic black-and-white scheme for manuscripts: scheme(s1mono) option

13. eliminating border around graph: plotregion(style(none)) option

14. adding text to graph: text option

15. placement options for positioning text: placement option

16. adding space between x-axis title and x-axis tick labels: height(5) option in xtitle

17. changing color of connect line of line graph: clcolor option

17. turning off legend: legend(off) option

19. reading in graph data by putting data in do-file: input and end commands

19. adding error bars to graph: rcap command

20. overlaying errors bars on scatterplot to get symbol with error bars:

twoway (rcap…) (scatter…) commands

21. adding white space to left and right side of graph: xlabel command

21. change tick mark labels to more descriptive labels: xlabel command

22. drop tick marks from graph while retaining labels: noticks option

23. adding horizontal or vertical reference lines: yline and xline options

24. list of choices for colors: help colorstyle command

24. list of choices for symbols: help symbolstyle command

24. changing marker symbol for scatterplot: msymbol option

24. changing color to marker symbol border line and inside fill:

mlcolor and mfcolor options

1-7 Looping, collapsing, and reshaping

1-8 Operators, ifs, dates, and times

1-9 More graphics: popular scientific graphs

1-10 Programming Stata

1-11 Compilation of frequently used variable generation and modifying

commands (a chapter for quick look up)

1-12 Homework problems

Section 2. Biostatistics

2-1 Describing variables, levels of measurement, and vhoice of descriptive

statistics

Describing a variable (distribution):

with tables: frequency tables

with graphs: histogram, boxplots

with descriptive statistics: mean, standard deviation, etc.

Levels of measurement (nominal, ordinal, ... categorical, continuous ...)

How to decide what descriptive statistic to use to describe a variable in the

“Table 1. Patient Characteristics” table of an article.

2-2 Logic of significance tests

What a probability distribution is

Logic of a significance test (same logic as a laboratory reference range)

Chance, randomness, sampling variability

Statistical regularity (the basis of statistical theory)

Strong Law of Large Numbers (formal statement of statistical regularity).

Deriving the form of statistical test (significance test) intuitively

Sampling distribution

p value

2-3 Choice of significance test

2-4 Comparison of two independent groups

Role of p values in a Table 1 Patients Characteristics table

Confounding variables

chi-square test

Fisher’s exact test

Asymptotic vs exact tests (parametric vs nonparametric tests)

Minimum expected frequency rule for choosing between chi-square test and

Fisher’s exact test

Barnard’s unconditional exact test

Fisher-Freeman-Halton test

Wilcoxon-Mann-Whitney test

Fisher-Pitman Permutation Test for Independent Samples

Central Limit Theorem

Levene’s test for equality of variances

t test (both equal and unequal variances)

Shapiro-Wilks test for normality

Reporting styles

Outliers

Prespecification of analysis

2-5 Basics of power analysis

definition of power

power increases as sample size increases

decision errors of significance tests [ Type I error (alpha), Type II error (beta) ]

Type II error and sample size paragraph in journal article

conclusions of equivalence

power of a significance test

effect of one- or two-sided comparison on power

effect of choice of alpha on power

effect of choice of minimum detectable effect size on power

effect of size of assumed standard deviation (SD) on power – coming up with a

SD estimate

effect of sample size on power

sample size and power calculations for an interval scaled outcome variable

what to do if you don’t know anything (no effect size or standard deviation

estimates):

the standard deviation units approach, Cohen’s d.

sample size calculation when a multiple comparison adjustment is planned

overfitting

switching the dependent and independent variables

sample size based on precision (desired width of confidence interval)

excessive power (sample size very large)

two group comparison of interval scale outcome sample size paragraph in study

protocol

2-6 More on levels of measurement

sums of ordinal scales produce interval sacles

dichotomous scales are actually interval scales

can statistical tests that require interval scales be used with ordinal scales ( the

ordinal-interval controversy in statistics)

2-7 Comparison of two paired groups

2-8 Multiplicity and the Comparison of 3+ Groups

multiplicity

multiple comparison problem

p value based multiple comparison procedures: family-wise error rate

(Bonferroni, Holm, Sidak, Holm-Sidak, Hochberg, Finner, Hommel,

Tukey-Ciminera-Heyse)

P value based multiple comparison procedures: false discovery rate

(Benjamini-Hochberg procedure)

how to get away without using multiple-comparison procedures

simultaneous comparison of 3+ groups (includes one-way analysis of variance)

sample size when multiple comparisons are planned

2-9 Correlation

2-10 Linear regression

how linear regression controls for covariates

2-11 Logistic regression and dummy variables

linear regression estimates risk difference (difference between proportions), but is

criticized because it can estimate predicted probabilities outside of the 0-1 range

logistic regression is designed to constrain the predicted probability between 0

and 1

definition of an odds ratio

assessing linearity of effect

dummy variables (indicator variables)

2-12 Survival analysis: Kaplan-Meier graphs, Log-rank Test, and Cox regression

life tables

Kaplan-Meier survival probabilies & Kaplan-Meier curves

log-rank test

Cox regression

assessing goodness of fit with c-statistic (ROC area)

interpreting the c-statistic

testing proportional hazards assumption of Cox regression

2-13 Confidence intervals versus p values and trends toward significance

2-14 Pearson correlation coefficient with clustered data

2-15 Equivalence and noninferiority tests

2-16 Validity and reliability

2-17 Methods comparison studies

2-18 One sample tests

2-19 Homework problems

Section 3. Epidemiology

3-1 Introduction to epidemiologic thinking

3-2 Sufficient/component cause theory of disease

3-3 Hill’s causal criteria

3-4 Logic and errors

3-5 Effect measures

3-6 Study designs

3-7 Randomization using Excel

3-8 Bias and confounding

3-9 Random error and statistics

3-10 Crude analysis

3-11 Stratified analysis

3-12 Standardization

3-13 Sensitivity (bias) analysis

3-14 Case-cohort study design

3-15 Homework problems

Section 4. Power Analysis

Chapter 4-1. Sample Size Determination and Power Analysis for Specific Applications

two independent group comparison of means (independent groups t test)

linear regression: comparing two groups adjusted for covariates

two independent groups comparison of dichotomous outcome variable (chi-square test,

Fisher’s exact test)

two indendpent groups comparison of a nominal outcome variable (chi-square test and

Fisher-Freeman-Halton test)

two independent groups comparison of ordinal outcome variable (Wilcoxon-Mann-

Whitney test)

paired ordinal outcome variable (Wilcoxon signed ranks tests)

repeated measurements or clustered studies (GEE, mixed, mulilevel, hierarchial models)

power analysis using Monte Carlo simulation (independent samples t test)

power analysis using Monte Carlo simulation (2 × 2 table chi-square test)

power analysis using Monte Carlo simulation (Poisson regression with person-time)

power analysis using Monte Carlo simulation (2-way ANOVA, both factors with 2 levels,

neither of which is a repeated measurement)

logrank test

Section 5. Regression Models

5-1 What regression is and curvilinear correlation

5-2 Holding constant

5-3 Dichotomous predictor variables

5-4 Adjusted means, Analysis of Variance (ANOVA), and interaction

5-5 Deriving logistic regression

5-6 Exact logistic regression

5-7 Introducing Cox regression and Kaplan-Meier plots

5-8 Interaction

5-9 Missing data imputation

5-10 Linear regression robust to assumptions

5-11 Linear regression diagnostics and transformations

5-12 Variable selection and collinearity

5-13 Monte Carlo Simulation and Bootstrapping

5-14 Model Validation

5-15 Response feature (summary measure) analysis

5-16 Analysis of covariance (ANCOVA) versus change analysis

5-17 Conditional logistic regression

5-18 Repeated measures analysis of variance

5-19 Generalized estimating equations (GEE)

5-20 Multilevel (mixed effects) models

5-21 Regression post tests

5-22 Modeling cost

5-23 Cox regression proportional hazards assumption

5-24 Cluster analysis

5-25 Multilevel (mixed effects) logistic regression

5-26 Trend tests

5-27 Homework problems

Appendix 1. Dataset Descriptions

births.dta Concerns 500 mothers who had singleton births in a large London

hospital.

evans.dta From a cohort study in which n=609 white males were followed for 7

years, with coronary heart disease as the outcome of interest.

2.20.Framingham.dta The dataset comes from a long-term follow-up study of cardiovascular risk

factors on 4699 patients living in the town of Framingham, Massachusetts.

LeeLife.dta Concerns male patients with localized cancer of the rectum diagnosed in

Connecticut from 1935 to 1954. The research question is whether survival

improved for the 1945-1954 cohort of patients (cohort = 1) relative to the

earlier 1935-1944 cohort (cohort = 0).

mi.dta From a 1:2 matched case-control study in which n=117 subjects are

formed into 39 matched strata.

rmr.dta Data published by Nawata et al (2004)(on course CD). The data were