WTCHG course in statistical modelling and data analysis

Model Choice

Chris Holmes

1. Selecting multiple markers in an Association study

The data in association_data_genotypes.txt contains information on the genotypes at 986 SNPs for 5003 individuals (1999 cases and 3004 controls of two types) from the WTCCC study. The disease and sample Ids are anonymised.

Previously you used logistic regression to fit a model with additive effects on the log-odds sequentially for all 986 SNPs and you stored the estimated coefficients and p-values.

  • Repeat this exercise to bring up the distribution of p-values and now rank (sort) them.

Now write a function to perform a forward (or stepwise) selection model search using a logistic regression with SNPs as covariates; use the glm() function in R

  • Run the process forwards to add in all 986 SNPs. Look at the order for which the SNPs enter and compare it to the rank under the p-value (single marker) test above. Is there much difference? What circumstances would you expect to see greatest difference?
  • Store the AIC values at each iteration (variable added) and plot them as a function of number of variables in the model. Plot a graph of AIC versus Number-of-Variables. Comment on the shape.
  • Now write a function to perform 10-fold cross-validation. Store the predictive log-likelihood (on the prediction set). Plot a graph of out-of-sample predictive log-likelihood versus Number-of-Variables. How does it compare to the AIC curve

We will now explore the use of the Lasso to perform variable selection.

Install and Load the “lars” package into R. Use help ? to learn about the function lars()

  • Run lars() on the association data above
  • Plot out the coefficient paths [see plot.lars() ] and explore those variables entering first into the model
  • Compare the coefficient paths with the ranking under p-values and Forward Selection above
  • Use cv.lars to explore when the out-of-sample prediction error increases
  • Lars() also has a forward selection mode. Re-run the experiments using forward selection and compare to the lasso. ** Note: Lars fits a least-squares model to the data – which is not strictly what we want, so your results will be different to those from Forward Selection above **

Try also exploring with the Lasso2 package, which does fit glm lasso models. In particular the function g1ce.R which you can enter Binomial likelihood. Compare your answers to lars.

Gil McVeanLast modified 17/11/2008