BIOSTATISTICS 201B FINALSTUDYGUIDE:

Here are some notes and study suggestions for the final exam:

(1)What To Bring:You should bring a writing instrument, a calculator (one capable of taking logs and exponentials), your cheat sheets, and of course a well-rested brain! I will provide anything else you need including the pages to write the problem solutions, scratch paper and any relevant printouts or statistical tables.

(2)Cheat Sheet Guidelines: The exam is closed book, closed notes. However you may bring four 8.5 by 11 pieces of paper with anything you want on the front and back (8 total sides of paper). My intent here is for you to use your cheat sheets from the midterm plus two additional sheets for the final but you are free to redo them if you want to. I don’t care what format the sheets are in, how small you write, etc. The purpose of making the exam closed book is not to deprive you of formulas you may need but rather to (a) force you to synthesize the material, (b) minimize the amount of rustling and page shuffling during the exam and (c) to allow me to make the exam easier 

(3)Time and Location: The exam is Wednesday, March 16th, from 3:00-6:00p.m.in our usual class room, CHS 43-105A. I will make the final about the same length as the midterm but you will have an extra hour to do it which should take off some of the time pressure.

(4)Exam Review: There will be a review session Monday, March 14th from 1:00-3:00 pm in CHS 51-279. I will also be available much of the day on Tuesday and of course Nadia and I will be happy to answer questions via e-mail.

(4)(5)What is Covered:The final exam is what I call “technically cumulative” by which I mean that the emphasis will be heavily on the material that was not covered on the midterm (i.e. lectures 16-26 and HWs 4-5). This includes, case-control and conditional logistic, multinomial and ordinal logistic, probit, Poisson and negative binomial regression, repeated measures/mixed models, and a brief introduction to survival analysis and weighting methods. However, I reserve the right to ask questions about the early material since much of the recent material builds on the earlier ideas. Those questions are likely to be about key concepts rather than picky details. In particular I might ask again about things people struggled with on the midterm (e.g. confounding; the distinction between permutation testing and the bootstrap; hypotheses for non-parametric tests) so you might want to review the midterm solutions! Below in my summary of the course topics I have tried to indicate what I consider the most important concepts from the first half of the course. The emphasis will be much more on interpretation of printouts and graphs and answering conceptual questions than on hand computations. However I would expect you to be able to do things like calculate a predicted value (e.g. probit probability, mean or rate ratio, etc.) Finally, it will be very important toknow when to use and how to evaluate each of the different types of models we have learned.

(5)(6)Practice Problems:I have posted the 2013 and 2014final exams with solutions for your reference. The middle problem of the 2012 midterm (on multinomial and ordinal logistic regression) is also good practice. The best source of additional practice isthe warm-up problems from the homework. I have also posted some additional problems on survival analysis since we will not have had a turn-in assignment on that topic. Note that the last of these problems (and indeed the problems on the old exam) have more computational detail than I would expect of you on this topic given that we will only get to spend an hour or so on it in lecture but I have left them that way so you see the full conceptual picture. You should focus more on the main ideas behind censoring, its implications and the different kinds of models one could use rather than actually doing the computations or performing the tests. If there is something you are feeling weak on, let me know and I’ll try to provide extra help. Another good technique is to get the output for any of the models/data sets from the homework and make sure you can give the appropriate real-world interpretation for EVERY number on the printout.

(6)(7)Detailed Topic List:Here is a more detailed list of the major course topics with indications of key points to study. I do not guarantee I haven’t missed anything—let me know if there’s something you think I should add.

  • Classical Non-Parametric Statistics: The major methods we learned were Spearman’s rank correlation, the Wilcoxon rank-sum test (for comparing two independent groups), the sign test and Wilcoxon signed-rank test (for comparing two matched groups) and the Kruskal-Wallis test (a rank test for comparing 3 or more groups.) You should be aware of the basic rationale for developing non-parametric tests, the parametric tests to which the above tests correspond, and when you would want to use each of those tests. Note that people had trouble with the precise hypotheses for these tests on the midterm so that would be good to review.
  • Permutation Tests and the Bootstrap:People sometimes get confused about the distinction between these two types of methods. The basic idea behind a permutation test is to generating an appropriate null distribution for your test statistic when you can’t assume you know it. The bootstrap in contrast tells you about the uncertainty in your actual estimate by attempting to simulate the “true” distribution (which would lie in the alternative distribution range if the null is not correct). You should be able to describe the conceptual algorithms you would use for each of these procedures—i.e. permuting group or ID labels for permutation tests and resampling to get a bootstrap estimate, standard error or confidence interval for a statistic of interest. Calculations are generally impractical without a computer except in cases with very small sample sizes.
  • Maximum Likelihood Principle, Transformation Principle and Generalized Linear Models: Understand conceptually what a likelihood function is and why maximizing it should lead to good parameter estimates. (I would not expect you to derive or maximize a likelihood function during the exam except maybe as a bonus problem.) Understand the three major components of the generalized linear model (distribution of Y, link function, systematic component involving the predictors) and how we vary these to get different models. Understand how transformations are used to get between different components of GLMs and how in general they can be used to convert estimates or confidence intervals on one scale to another scale.
  • Logistic Regression:This is the quintessential example of a generalized linear model. I have listed the key issues below. For the final it is particularly important that you know how the features of the logistic model are the same as or differ from those for other GLMs (e.g. as we have seen on the project you can use sets of logistic models to approximate ordinal or multinomial models). Here are the key features: (1) Estimation: Know what the basic model is and how to use it to obtain predicted probabilities; know how to interpret the regression coefficients and the corresponding odds ratios and confidence intervals, especially in special situations such as for interactions or changes of more than one unit (these are two things that caused trouble on the midterm).(2) Tests: Know how to test the overall model (likelihood ratio chi-squared test), the individual variables (LR chi-squared test or Wald test) and how to compare two nested models (LR chi-squared test or Wald test); know what is meant by the terms null and saturated models and deviance and why these concepts are important; (3) Goodness of fit: know how to assess goodness of fit and predictive accuracy using techniques such as the Hosmer-Lemeshow test and ROC curves; know what is meant by calibration, sensitivity and specificity and how these relate to goodness of fit.
  • Multinomial Logistic Regression: Understand when to use a multinomial model and how the definition of odds and hence the corresponding coefficients and odds ratios differ from those in logistic regression. Know how to check whether a variable is significant, both at an individual level of the model and across the levels; know how to test whether the effect of a variable is the same across multiple levels; know how to make predictions using the multinomial logistic model.
  • Ordinal Logistic Regression: Understand how the ordinal logistic model is a constrained version of the multinomial logistic model and what the proportional odds assumption means; understand what categories are being compared at each level of the model and how to interpret and perform tests about the various coefficients and odds ratios (including knowing how the model is parameterized by computer packages so you can read the output correctly!); know what the pros and cons of the ordinal logistic model are and how to compare it to the multinomial logistic model; be aware that there are other formulations of the ordinal logistic model.
  • Probit Regression: Understand how the model arose conceptually (in terms of dose response studies) and how it is both similar to the logistic model (same assumed distribution and predictive structure) and how it is different (inverse normal rather than logit link function). Know how to use it to make predictions and to what extent you can interpret the coefficients. Know how to evaluate model performance for the probit model (basically everything from the logistic model applies directly!) and how the answers it gives you tend to compare to those from a logistic model.
  • Poisson and Negative Binomial Regression: Understand the conceptual framework underlying each of these models for count data (i.e. events happening independently at a steady rate versus number of trials until a particular outcome) and when to use them. Know how these models fit in the GLM framework (distribution (Poisson or NegBin), link (log), systematic component (Xb)). Understand how to make predictions (exponentiating to convert from the log scale to the mean or rate scale) and how to interpret and use an offset or exposure term. Know how to interpret the regression coefficients both on the log scale (our usual change in Y associated with change in X interpretation for the betas) and on the original scale (in terms of mean or rate ratios; these are multiplicative models on the original scale.) Know how to perform tests and obtain confidence intervals for the individual coefficients and model as a whole (i.e. chi-squared tests and Wald tests as with all GLMs). Be aware of the special issues of over-dispersion and zero-inflation: when they are likely to occur, how you can check for them (e.g. compare mean to variance within cells or use the Pearson goodness of fit statistic to calculate an over-dispersion factor for a Poisson model); how you can use a negative binomial model to correct for over-dispersion in a Poisson model (test of over-dispersion factor alpha); how you can fit and interpret zero-inflated versions of both models (i.e. a logistic component for whether or not a person will ever have the event followed by a Poisson or negative binomial model for people who might have the event; Vuong test for presence of zero-inflation)
  • Repeated Measures/Mixed Models: Understand that these models are designed to deal with the situation when you have observations that are not independent. This can occur because (a) you have multiple measurements (either over time or under different experimental conditions) on each person or (b) because the subjects in your study are themselves related (e.g. members of the same family; students in the same classroom; etc.) Understand the various ways of conceptualizing how to handle these correlations: repeated measures (directly model the covariance/correlation between the measurements if every subject is measured at the same points); random or mixed effects (interpret the correlations as induced by subjects having individual systematic effects—either a consistent shift (random intercept) or difference in rate of change (random slope); hierarchical (the correlations are induced by a nested sequence of memberships, e.g. kids in classrooms; classrooms in schools; schools in districts; etc.) . Understand what is meant by the terms fixed and random effects. Be aware of the various correlation structures you might use for a repeated measures model

(e.g. independence, compound symmetry, ar(1), unstructured) and how these are paralleled in a mixed effects formulation (e.g. random intercept only is equivalent to compound symmetry.) Understand graphically what these different scenarios look like (e.g. parallel lines for random intercept, fanning lines for random slope). Know how to interpret the regression printouts, including how to tell whether you needed the random effects or special covariance structures. (The interpretation of the fixed effect coefficients is the same as in OLS regression; the real difference is simply in how you account for the errors).

  • Survival analysis: Understand what time to event data are, what is meant by censoring, and the various ways in which data can be censored. Know what the key quantities are in a survival analysis (the survival curve, S(t), the density function, f(t), the hazard rate, f(t)/S(t), the cumulative hazard , H(t), the mean residual survival time, etc.) Be aware of the approaches for estimating these quantities (e.g. parametric, assuming a survival distribution and estimating the parameters via maximum likelihood; non-parametric using the empirical distribution of the event and censoring times). Understand the basic Kaplan-Meier procedure for estimating the survival curve. I will also show you how to calculate confidence intervals for S(t) at a particular time and how to compare two survival curves (e.g. CIs for the difference at time t or the log rank test) but I wouldn’t expect you to be able to do that on the test other than by looking at the survival curves. Be aware of the approaches for including covariates into a survival model (parametric accelerated failure time model using MLEs; semi-parametric Cox proportional Hazards model) though again I wouldn’t expect you to actually be able to read a printout for these. I have posted some practice basic survival questions for you since we didn’t have an assignment on this topic.
  • Weighting Methods (Optional Bonus Material Only!): Be aware of the sorts of situations in which you might need to computed weight estimates of parameters of interest (e.g. regression with non-constant variance or points that are not measured with equal accuracy or of equal importance; observational studies where points are not sampled proportionally with their distribution in the population; etc.) In particular understand what a propensity score is and how you could use it to help correctly test for treatment effects in an observational study or to help account for missing data. Again, this topic is more conceptual than calculational since we have not had any homework problems about it andonly discussed it briefly in class so I would only ask a bonus question on it.
  • STATA/SAS commands: I will not ask you to give me the exact command to obtain a particular kind of output. However if I give you a printout with the command lines I would expect you to be able to follow it and correctly interpret what the command did and I would expect you to be able to describe in a general way how you would use one of the packages to obtain the necessary information.