Practical 17
This practical consists of two parts. In part 1 you are asked to work through an analysis of a medical dataset on the use of contraceptives in Bangladesh. This part is to get you familiar with the use of MCMC in MLwiN for binomial models. In the second part we ask you to answer a few questions on another dataset from veterinary epidemiology.
Part 1
In the Bayesian practicals so far we have mainly considered fitting models where our response is a continuous variable. Of course there are many other possible response types, for example interest often lies in binary, proportion or count data. In this chapter we consider binary data. There are many possible scenarios where binary or 0/1 data occur: in education, exam score responses may often be in the form of a pass or fail; in health care responses are often whether a treatment is successful or not; in political data whether people vote for a particular party or not.
In this practical we will consider an example dataset from the 1988 Bangladesh Fertility Survey. This dataset consists of 1934 women who are grouped in 60 districts and the response of interest is whether these women were using contraceptives at the time of the survey. As predictor variables we will consider effects for the age of the women, the number of children they have and the district they come from. We will also be interested in whether the between-district variation differs for urban and rural areas.
The dataset can be found in the file ‘bang1.ws’.
The Names window will then look as follows:
Here ‘woman’ identifies the individual women, ‘district’ the district they live in and ‘use’ whether they use contraceptives (1) or not (0). The predictors we will consider are ‘age’ which is the woman’s age in years centred around the average age, ‘lc’ which is the number of living children and ‘urban’ which categorises the area (which is a sub-area of the districts) in which they live as either urban (1) or rural (0). The other possible predictors that can be investigated by the reader are ‘educ’ which categorises education level of the women from none (1) to at least secondary education (4), ‘hindu’ which categorises religion into Hindu or Muslim (other religions were excluded), and two district level predictors ‘d_illit’ and ‘d_pray’ which give the proportion of illiterate women and the proportion of women who pray every day, for each district.
Simple Logistic Regression Model
We will start by considering a simple logistic regression model accounting for the age of the women only. Logistic regression models are more complicated to fit than normal models using the IGLS estimation method in MLwiN as the methods use Taylor series expansions to approximate the model at each iteration. This means that the estimates they produce are quasi-likelihood estimates rather than maximum likelihood. Binomial response models also need a special (denominator) column that contains the counts of the number of trials each Binomial is based on. For 0/1 data this will be another columns of ones but for proportion data this will contain the number of units on which each proportion is based.
To set up the basic logistic regression model we need to perform the following commands in the equations window
This will set up the response variable and its type, and the hierarchical structure of the dataset. The window will look as follows:
We now need to set up the denominator column and our predictor variables:
The window now looks as follows:
There are several possible quasi-likelihood methods available in MLwiN but, as we are only using them for starting values and to set up the model, we will use the default methods:
The model should now run in 3 iterations. We now want to run the same model using MCMC.
The window will then appear as shown overleaf. Here we see that for non-Normal responses MLwiN will not allow Gibbs sampling. We have discussed why this is in the last lecture.
If we now run the model using MCMC by
we will get the following results (after clicking twice on the Estimates button):
Here we see that the intercept term is –0.437, which, as ‘age’ is centred, corresponds to the average woman. In order to convert this into a probability we need to transform it onto the probability scale. This can be achieved by the anti-logit operation in the Command Interface window. Typing the command CALC B1=ALOG (-0.437) produces a probability of using contraception of 0.392. The age coefficient is small and of the same magnitude as its standard error, suggesting that there is no real effect of age. It should be noted that unlike the quasi-likelihood methods, we can now find a deviance for our models and use the DIC diagnostic.
The deviance formula for a Binomial model is
where pi is the predicted value for observation i. To calculate piwe will need to use the inverse distribution function that corresponds to the link function, so for the logit we will need to calculate the antilogit for each fitted value as described above for the average woman. This is all performed in the background when the DIC diagnostic is calculated.
The output can be seen below along with that for the simpler model (which we will not fit) with just a constant probability of usage.
->BDIC
Bayesian Deviance Information Criterion (DIC)
Dbar D(thetabar) pD DIC
2591.28 2589.29 1.99 2593.27 (Model with age)
2591.90 2590.91 0.99 2592.88 (Model without age)
We can see here that, as with the Normal case, the DIC diagnostic picks up almost exactly the correct degrees of freedom. The DIC diagnostic also confirms that the model without age is only marginally better as we suspected from the coefficients and standard errors. Note that for Binomial models there are two possible parameterisations for D(thetabar), the mean parameterization which uses the average value of pi from the chain, and the canonical parameterization which uses the average values of the individual βithat form pi. In MLwiN we always use the canonical parameterisation but it should be noted that the different parameterizations will often give different DIC values.
In all the models fitted thus far we have been using the default monitoring run length of 5,000 iterations. This is not to be encouraged and we should always check we have run the MCMC algorithm for long enough particularly for non-Normal models where Metropolis sampling is the default for some parameters.
The diagnostics for the ‘age’ effect will then appear as follows:
Here we can see that the kernel plot shows a large probability of a value less than zero. The Raftery-Lewis diagnostic suggests that we should run for roughly three times our current run length. Note that as we are more interested in whether this effect is zero or not than quoting this parameter to two significant figures we will ignore the Brooks-Draper diagnostic.
We can now run this model for 15,000 iterations to see if this changes our estimates.
If you now look at the DIC diagnostic and the MCMC diagnostics window we see that neither of them have changed much at all. This confirms the results from the shorter run of 5,000 iterations.
Although we have found that ‘age’ is not significant we will, for now, leave it in the model while adding our next predictor, number of living children (‘lc’). This is a categorical variable and so we will add this into the model via the Add Term button
This can be done (after changing estimation method to IGLS) as follows:
Now we need to run the model using MCMC after using the IGLS method.
After estimation has finished the Equations window will (after pressing the + button) look as follows:
Interestingly when the number of children a woman has is added to the model, the age coefficient now becomes significant, and of opposite sign. This is due to the fact that the number of children is correlated with age.
The DIC diagnostic as shown below is greatly reduced in this model with 5 parameters.
->BDIC
Bayesian Deviance Information Criterion (DIC)
Dbar D(thetabar) pD DIC
2519.98 2515.10 4.88 2524.85
2591.28 2589.29 1.99 2593.28 (without number of kids)
Random effects logistic regression model
So far we have concentrated on the fixed effects simple logistic regression models analogous to, for the normal responses, linear regression models. As with Normal responses we can also add random effects to our model to account for different probabilities of contraceptive use for the different districts in which the women live.
To add random effects we use the same procedure as with the normal models.
Here we have run this model using the IGLS method and the results should look as follows:
To now run this model using MCMC
After 5,000 iterations we get the following estimates
We can see here that both the fixed effects and the level 2 variance estimate from MCMC are bigger (in magnitude) than the quasi-likelihood estimates. The 1st order MQL method, which is the default method that we have used is known to give estimates that are biased downwards. We can investigate the level 2 variance in more detail by viewing its MCMC diagnostics.
This will bring up the MCMC diagnostics for the level 2 variance parameter as shown below. Here we see that the kernel plot has a long right-hand tail as expected and consequently the mode, which is the equivalent to the estimate obtained using quasi-likelihood methods, is smaller than the mean. It has an estimated value 0.279, which compares with the value 0.246 from the 1st order MQL method.
We can also see that the Raftery-Lewis diagnostic suggests that we haven’t run the sampler for long enough. Note that if the sampler was run for an extra 15,000 iterations, (which takes a few minutes) this will produce a modal estimate 0.281 which is very similar to the answer after 5,000 iterations.
If we look at the DIC diagnostic for the random effects model we get the following:
->BDIC
Bayesian Deviance Information Criterion (DIC)
Dbar D(thetabar) pD DIC
2395.92 2354.50 41.41 2437.33 (with random effects)
2519.98 2515.10 4.88 2524.85 (without random effects)
So even though we have five fixed effects and sixty random effects due to districts, these sixty district effects are represented by only effectively 36.4 (41.4 - 5) parameters. The DIC diagnostic is reduced by 87.5, so this random effects model is a great improvement, suggesting that there is significantly different contraceptive usage between the sixty districts.
Random coefficients for area type
In the dataset, although we have already split up the country of Bangladesh into 60 districts, within these districts there are both urban and rural areas. We have an indicator urban that defines whether an individual woman lives in a rural (0) or urban (1) area. We can firstly fit this term as a fixed effect in our model and fit our model using MCMC.
Now if we run the model by
We can now see that ‘urban’ has a fixed coefficient of 0.736 (0.123), which suggests that this is a strong predictor and that women in urban areas are more likely to use contraceptives than women in rural areas. This is backed up by the reduction in DIC diagnostic to 2408.15 (a drop of 29), which suggests that this is a better model.
Just like fitting random coefficients in a normal response model, we can now see if the effect of being in an urban area is different for the various districts.
The model we have now fitted will upon convergence be as follows:
Here we see that the variance between districts is different for urban areas and rural areas with the rural areas having a variance of 0.418 and the urban areas having a reduced variance of 0.418-2*0.432+0.738= 0.292. In terms of model fit we are now fitting 6 fixed effects and 2*60 random effects and so if we look at the DIC diagnostic for this model we get:
->BDIC
Bayesian Deviance Information Criterion (DIC)
Dbar D(thetabar) pD DIC
2328.93 2272.31 56.62 2385.56 (with urban random effects)
2369.36 2330.58 38.79 2408.15 (without urban random effects)
Here once again we see that adding the extra terms increases the effective number of parameters but reduces the DIC diagnostic suggesting that this is a better model.
We will return to this example in the next practical when we consider different link functions and running the dataset using WinBUGS.
What you should have learnt from this practical
- How to fit models to binary responses.
- How to fit multilevel random effects linear models.
- How to fit multiple random effects in a logistic model.
Part 2 – The pig adg dataset
Data on both atrophic rhinitis and enzootic pneumonia were recorded on 341 pigs at slaughter. Causally it was assumed that atrophic rhinitis might increase the risk of pneumonia, through destruction of the pig’s air-filtering mechanism (nasal turbinates).
The dataset is stored in the worksheet ‘pigadg.ws’. The response (pneumonia) is stored in the column labelled pn. The data have two levels (pig nested within farm). The atrophic rhinitis score predictor (ar) ranges from 0 to 5. This has also been dichotomised in the variable ar2 which takes value 1 when ar>4.
- Set up a null logistic regression model in MLwiN with no predictors and no random effects and find its DIC value using MCMC. Note you should ensure that the MCMC has been run for long enough!
- Now try adding in the ar predictor and seeing if it is significant i.e. improves the DIC statistic.
- Try the variable ar2 instead of ar and see if it is a better predictor.
- Dohoo, Martin & Stryhn use an alternative dichotomy ar1 = 1 when ar>1. Construct this variable and see if it is a better predictor than ar2.
- Add in the farm level random effects and see what happens to the significance of each of the predictors i.e. try the models in 1-4 with random farm effects.
P17-1