Analysis of sex sequences by means of generalized linear mixed models
Tutorial for the analyses
Behavioral Ecology and Sociobiology
R. AMBROSINI, D. RUBOLINI, N. SAINO
Roberto Ambrosini
Dipartimento di Biotecnologie e Bioscienze, Università degli Studi di Milano-Bicocca, P.zza della Scienza 2, 20126 Milano, Italy;
email:
Introduction
We provide a tutorial for the analysis of sex sequences using Generalized Linear Mixed Models (GLMMs), with step-by-step commented analyses and the associated R code. The code has been produced using R version 3.0.2 (R Core Team 2013), RStudio version 0.97.551 (Racine 2012) and associated libraries. We start by generating an exemplificative data file that will help in testing different hypotheses of variation in sex sequences. Readers should carefully refer to the main text of the article for details of each hypothesis being investigated.
Data preparation
The file we will use for the analyses consists of 50 hypothetical clutches (identified by variable ID) of a hypothetical bird species laying clutches of 3 eggs. The rather small sample size is justified by the exemplificative purpose of this Tutorial. Laying sequence is coded from 1 to 3 (variable order), with egg 1 being the first and egg 3 being the last laid one. Data were generated according to a pattern of sex allocation whereby the sex of an egg influences that of the subsequent one in the laying sequence, with the following parameters: P1 = 0.6, Pi|(i-1)m = 0.8, Pi|(i-1)f = 0.4. There is therefore a rather large effect of the sex of the previous egg on that of the following ones, butthese values are not biologically unreasonable (see the example on eclectus parrots Eclectus roratus below). Sex sequence data were generated by drawing random numbers from a uniform distribution in [0:1] and then assigning the sex of the eggs according to random numbers being smaller or larger than the probability of being male calculated for that egg according to the parameters reported above. We also included 3% of random missing values in the binary dependent variable (sex), representing the sex of the embryo (0 = female, 1 = male).
# generate data file
data <- data.frame(cbind(
ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14, 15, 15,
15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 20, 20, 20, 21, 21, 21,
22, 22, 22, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 26, 27, 27, 27, 28,
28, 28, 29, 29, 29, 30, 30, 30, 31, 31, 31, 32, 32, 32, 33, 33, 33, 34, 34,
34, 35, 35, 35, 36, 36, 36, 37, 37, 37, 38, 38, 38, 39, 39, 39, 40, 40, 40,
41, 41, 41, 42, 42, 42, 43, 43, 43, 44, 44, 44, 45, 45, 45, 46, 46, 46, 47,
47, 47, 48, 48, 48, 49, 49, 49, 50, 50, 50),
order = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
sex = c(0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, NA, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, NA, 1, 0, 0, 1, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, NA, 0, 1, 1)
))
#show dataset
edit(data)
We then generate the AF variable, as well as the variable prevsex, representing the sex of the previous egg (+0.5 = male, -0.5 = female) and assuming the value of 0 for the first laid egg in each clutch, and the variable order0, which is equal to (order - 1).
# ensure that the dataset is sorted by increasing ID and laying sequence
ord <- order(data$ID, data$order)
data <- data[ord, ]
# generate the AF variable
AF <- as.numeric(I(data$order > 1) + 0)
# generate variable “prevsex”
prevsex <- rep(0, nrow(data))
for (j in unique(data$ID)) {
if (max(data$order[data$ID == j]) > 1) {
for (i in 2:(max(data$order[data$ID == j]))) {
prevsex[data$ID == j & data$order == i] <- data$sex[data$ID == j &
data$order == (i - 1)] - 0.5
}
}
}
# generate variable “order0”
order0 <- data$order - 1
data <- cbind(data, AF, prevsex, order0)
# final dataset
edit(data)
We have now generated all the variables needed to fit the binomial GLMMs testing for different patterns of variation in sex sequences (see main text). Hereafter, as in the main text, Pi denotes the probability that a given egg of order i is a male egg. We will start the analyses by exploring the data in order to assess whether patterns of variation in egg sex do appear.
Data exploration
We will produce a set of graphs that should help highlighting patterns of sex allocation. We start by producing two tables that summarize the data and will be useful to build graphs. TAB1 includes the proportion of male eggs along the laying sequence,TAB2 the number of male eggs in first laid eggs and in any following egg preceded by a male or a female egg. These two tables and the graphs showing their results are generated with the aim of visually inspecting the most relevant patterns of variation in egg sex that will be tested in the models.
# tabulate data
TAB1 <- aggregate(data$sex, by = list(data$order), FUN = mean, na.rm = TRUE)
colnames(TAB1) <- c("order", "prop.males")
TAB2 <- xtabs(~ sex + prevsex, data = data)[,c(2,3,1)]
colnames(TAB2) <- c("first", "M", "F")
TAB2 <- TAB2[2,]/apply(TAB2, 2, sum)
The following code will produce a panel of four graphs based on the tables generated so far. The first two graphs represent variation in probability of being male according, respectively, to laying order and to the sex of the preceding egg. The models we will fit are based on the logit transformation. Plotting the logit(Pi) instead of Pi may improve clarity of graphs. The third and the fourth graphs therefore represent the logit-transformed probability of being male according to laying order and sex of the preceding egg.
# graphs
def.par <- par(no.readonly = TRUE) # save default, for resetting
layout(matrix(1:4, 2, 2, byrow = TRUE))
barplot(TAB1$prop.males, names.arg = TAB1$egg,
xlab = "laying order", ylab = "proportion of male eggs")
barplot(TAB2, names.arg = c("first", "M", "F"),
xlab = "sex of the preceding egg", ylab = "proportion of male eggs")
require(boot) # for logit transformation
barplot(logit(TAB1$prop.males), names.arg = TAB1$egg,
xlab = "laying order",
ylab = "logit proportion of male eggs")
barplot(logit(TAB2), names.arg = c("first", "M", "F"),
xlab = "sex of the preceding egg",
ylab = "logit proportion of male eggs")
par(def.par) # reset default values
The upper left graph shows an increase in Pi along the laying sequence. The upper right one indicates that Pi is larger for eggs preceded by a male than a female egg. The lower graphs show the same pattern of variation more clearly because they are based on logit-Pi.
Model interpolation
Information obtained in the explorative analyses suggests that both a sequence effect and a previous-egg effect are plausible. We will therefore fit different models in order to investigate different putative patterns of sex allocation. Models will be fitted using the glmer function in package lme4 version 1.0-5 (Bates et al. 2013) and other functions in the package lmerTest (Kuznetsova et al. 2013) will be used to explore the results. The results obtained by previous version of the same package may slightly differ due to changes in the underlying computation algorithm.
Model testing H0: We start by fitting an intercept only model (null model), testing H0, i.e. that Piis constant along the laying sequence.
require(lme4)
# model testing H0
H0 <- glmer(sex ~ 1 + (1 | ID), family = binomial, na.action = na.omit,
data = data)
summary(H0)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: binomial ( logit )
Formula: sex ~ 1 + (1 | ID)
Data: data
AIC BIC logLik deviance
188.2088 194.1623 -92.1044 184.2088
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1.154 1.074
Number of obs: 145, groups: ID, 50
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7403 0.2393 3.093 0.00198 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1
We first observe that the random intercept variance is rather large (1.154) when compared to the residual variance, which is equal to π2/3 = 3.290 (the glmer algorithm does not fit overdispersed models), which suggests that unmodelled effects affect egg sex. This observation is straightforward if the random intercept variance of this model is compared to that of the model fitting H1 (see below) which is equal to zero. The fact that random intercept variance is lower than the residual variance indicates that most of the variation occurs within clutches, and provides support for the investigation of within-clutches processes of sex allocation. Indeed, in our dataset, random intercept variation in model testing H0 occurs because clutches that happened to have a female first egg likely have more female eggs. However, a large random variance may also be the consequence of processes acting between clutches as, for example, if females in poor conditions lay more female eggs (see also the main text), so that, in a real dataset, it may be worth to investigate also between-clutches processes.
Alternatively, it is possible to test for the significance of the random effect by a likelihood-ratio test which indicate that it is highly significant, thus confirming that there is large variation in the average probability of being male between clutches.
require (lmerTest)
rand(H0)
Analysis of Random effects Table:
Chi.sq Chi.DF p.value
(1 | ID) 18.8 1 1e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning messages: (omitted)
This procedures produced warning messages, that were omitted here. Fixed effects (above) indicate that the intercept (0.74 ± 0.24 SE) significantly differs from 0. There is therefore a larger proportion of male than female eggs. Estimated Pi is 0.68 ± 0.05 SE and the 95% CI for Pi is 0.57-0.77. This calculation was made as follows:
inv.logit(0.7403)
[1] 0.6770615
The logit-transformed SE cannot be directly converted into probabilities by means of an inverse logit transformation. Approximate SE of Pi can be calculated as follows, while 95% CI can be precisely obtained by profiling the likelihood function (Vanzon & Moolgavkar 1988):
# SE
(inv.logit(0.7403 + 0.2393) - inv.logit(0.7403 - 0.2393)) / 2
[1] 0.05216727
CI.H0 <- confint(H0)
CI.H0
2.5 % 97.5 %
.sig01 0.3043934 1.931325
(Intercept) 0.2564502 1.346675
inv.logit(CI.H0[2,])
2.5 % 97.5 %
0.5637635 0.7935856
The confint function provides CIs for all the parameters estimated by the model i.e. the standard deviation of the random effect (1.074) and of the intercept (0.740). Alternatively, approximated CI for the fixed effects can be obtained by the Wald method (shown here only for comparison):
# 95% CI, lower limit
inv.logit(0.7403 + qt(0.025, 50) * 0.2393)
[1] 0.52645507
# 95% CI, upper limit
inv.logit(0.7403 + qt(0.975, 50) * 0.2393)
[1] 0.7722304
where function qt is used to calculate 2.5% and 97.5% quantiles of a t distribution with degree of freedom conservatively set to the number of clutches (50 in our example).
Model testing H1: We then fit a model testing H1, i.e. that the sex of an egg depends on that of the previously laid egg:
# model testing H1
H1 <- glmer(sex ~ AF + prevsex + (1 | ID),
family = binomial, na.action = na.omit, data = data)
summary(H1)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: binomial ( logit )
Formula: sex ~ AF + prevsex + (1 | ID)
Data: data
AIC BIC logLik deviance
178.5898 190.4412 -85.2949 170.5898
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 0 0
Number of obs: 143, groups: ID, 50
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.1603 0.2838 0.565 0.572019
AF 0.6027 0.3743 1.610 0.107382
prevsex 1.7368 0.4883 3.557 0.000375 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) F
AF -0.758
prevsex 0.000 0.075
rand(H1)
Random term (1 | ID) was eliminated because of standard deviation being equal to 0
Analysis of Random effects Table:
Warning messages: (omitted)
First of all, we note that there is no random between-clutches variation in the mean (logit-transformed) probability of being males, as the variance of the random intercept is equal to 0. This occurs because when we generated the dataset we introduced no between-clutches variation in the effect of the sex of the preceding egg on that of the following one (indeed, we always use Pi|(i-1)m = 0.8 and Pi|(i-1)f = 0.4 for all clutches). Significance test for the random effect cannot be computed, because there is no variance of the random term (and the procedure generates warnings).
The fact that random variation greatly decreased in model H1 with respect to model H0 suggests that model H1includes variables accounting for the mechanism generating the amount of variation that was unmodelled in H0 and accounted for by the random intercept. We can expand a little bit on this point here, because we know the actual pattern used to generate egg sexes, i.e. that the sex of an egg is influenced by that of the previous one. This pattern however generates random variation in the average sexratio between clutches: those clutches where the first egg is female also happen to have a larger proportion of females because of the mechanisms that determine egg sex within clutches.
A model not including variables accounting for this mechanisms, like H0, cannot account for this variation in any other way than in the variance in the random effect. Model H0 is therefore suggesting that “sex ratios vary between clutches, but we don’t know why”. Conversely, models H1 precisely accounts for the mechanisms generating egg sex, and there is no other mechanisms acting between clutches because, having accounted for this within-clutch variation, there is no residual between-clutches variation. In other words, model H1suggests that “we know what is going on: the sex of an egg varies according to that of the preceding one. This mechanism acts within clutches, and there is nothing going on between clutches”.
All these pieces of evidence suggest that also the random intercept is redundant in this particular case. However, we refrain from further simplifying the random structure of the model, since this would imply removing the grouping factor ID, which would be questionable in analyses of real datasets.
The results from H1 indicate a statistically significant effect of prevsex, i.e. the sex of an egg varies according to that of the previously laid egg. Specifically, model estimates show that the probability that the first egg is male is 0.54, and that, for eggs of any laying order, the probability of being male for an egg preceded by a female egg is 0.47, while the probability of an egg being male if preceded by a male egg is equal to 0.84. In addition, the probability of being male of a second or third egg, independently from the sex of the preceding egg, is 0.68. These probabilities were calculated as follows (output after each code line):
inv.logit(0.1603)
[1] 0.5399894
inv.logit(0.1603 + 0.6027 + 1.7368 * -0.5)
[1] 0.4736744
inv.logit(0.1603 + 0.6027 + 1.7368 * 0.5)
[1] 0.8363613
inv.logit(0.1603 + 0.6027)
[1] 0.6820047
There is also another way to obtain such estimates. We start by producing a dataset (newdata hereafter), with the same variables entered in the model (ID, AF, and prevsex). We will assign arbitrary values to ID (e.g. 1, 2, 3, 4; this will not affect the results) while the other ones will take the specific values corresponding to the estimate we are interested in. For example, if we are interested in estimating the probability of being male for a first egg, we must set AF = 0 and prevsex = 0 in the newdata dataset because our coding implies that first eggs have 0 in both these variables. If, conversely, we are interested in estimating the same probability for a second or third egg preceded by a female egg we must set AF = 1 and prevsex = -0.5 for these eggs, and AF = 1 and prevsex = +0.5 if we are interested in estimating the probability of being male of an egg preceded by a male one. Finally, we may be interested in estimating the probability of being male for a second or third egg, independently of the sex of the preceding one. This can be done by settingAF = 1 and prevsex = 0. We stress that these values must be entered exactly in the same order as the independent variables appear in the model output. This dataset is produced as follows:
# generate the "newdata" dataset
newdata <- rbind(first = c(1, 0, 0),
later = c(2, 1, 0),
after_f = c(3, 1, -0.5),
after_m = c(4, 1, +0.5))
newdata <- data.frame(newdata)
names(newdata) <- c("ID", "AF", "prevsex")
newdata
ID AF prevsex
first 1 0 0.0
later 2 1 0.0
after_f 3 1 -0.5
after_m 4 1 0.5
We then use the ezPredict function in the ez library (Lawrence 2013) to obtain predicted values from the fixed effects of the GLMMs. This function also uses a bootstrap to generate variances.
# calculate predicted values
require(ez)
P1 <- ezPredict(H1, to_predict = newdata, boot = TRUE, iterations = 1000)
P1$cells
ID AF prevsex value var
first 1 0 0.0 0.1603427 0.08051526
later 2 1 0.0 0.7630281 0.05960125
after_f 3 1 -0.5 -0.1053605 0.10555559
after_m 4 1 0.5 1.6314168 0.13284940
These values are in logit. We can calculate the corresponding probability and SEs and obtain (empirical) 95% CIs for probabilities from the bootstrapped values:
# calculate predicted probabilities and CIs
RES <- P1$cells[,-1]
RES$value <- inv.logit(P1$cells$value)
RES$var <- (inv.logit(P1$cells$value + sqrt(P1$cells$var)) -
inv.logit(P1$cells$value - sqrt(P1$cells$var))) / 2
LL <- UL <- NULL
for(i in P1$cells$ID){
LL <- c(LL, inv.logit(quantile(P1$boot$value[P1$boot$ID == i], 0.025)))
UL <- c(UL, inv.logit(quantile(P1$boot$value[P1$boot$ID == i], 0.975)))
}
RES$LL <- LL
RES$UL <- UL
names(RES)[names(RES) == "value"] <- "prob"
names(RES)[names(RES) == "var"] <- "approx.SE"
RES
AF prevsex prob approx.SE LL UL
first 0 0.0 0.5400000 0.07002378 0.3946451 0.6773813
later 1 0.0 0.6820108 0.05278756 0.5687269 0.7757951
after_f 1 -0.5 0.4736842 0.08029903 0.3228145 0.6239669
after_m 1 0.5 0.8363636 0.05007454 0.7034711 0.9087091
Hence, according to model H1, the probability of being male for the first egg of a clutch is 0.54 (± 0.07 SE) while for the following eggs it is 0.47 (± 0.08 SE) if they are preceded by a female egg, and 0.84 (± 0.05 SE) if preceded by a male egg. 95% CIs for these estimates are, respectively: 0.39-0.68, 0.32-0.62, and 0.70-0.91. The probability that a second or third egg is male, independently from the sex of the preceding egg, is 0.68 (± 0.05 SE), and 95% CIs are 0.57-0.78.
Model testing H2: Let’s now turn to the models testing H2, namely that Pi changes linearly along the laying sequence:
# model testing H2
H2 <- glmer(sex ~ order0 + (1 | ID),
family = binomial, na.action = na.omit, data = data)
summary(H2)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: binomial ( logit )
Formula: sex ~ order0 + (1 | ID)
Data: data
AIC BIC logLik deviance
185.3910 194.3213 -89.6955 179.3910
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1.444 1.202
Number of obs: 145, groups: ID, 50
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2578 0.3285 0.785 0.4327
order0 0.5445 0.2355 2.312 0.0208 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
order0 -0.629
rand(H2)
Analysis of Random effects Table:
Chi.sq Chi.DF p.value
(1 | ID) 24.2 1 9e-07 ***
Warning messages: (omitted)
We observe a rather large and statistically significant variation of random effects that indicates that patterns of sex allocation differ among clutches, and a statistically significant increase in the probability of being male along the laying sequence. We also calculated the probability of being male of each egg in the laying sequence and its associated SE and 95% CI based on bootstrap.
# generate the "new" dataset
newdata <- rbind(first = c(1, 0),
second = c(2, 1),
third = c(3, 2))
newdata <- data.frame(newdata)
names(newdata) <- c("ID", "order0")
newdata
ID order0
first 1 0
second 2 1
third 3 2
# calculate estimated values
P2 <- ezPredict(H2, to_predict = newdata, boot = TRUE, iterations = 1000)
P2$cells
ID order0 value var
first 1 0 0.2577754 0.10792503
second 2 1 0.8022534 0.06609414
third 3 2 1.3467313 0.13519103
RES <- P2$cells[,-1]
RES$value <- inv.logit(P2$cells$value)
RES$var <- (inv.logit(P2$cells$value + sqrt(P2$cells$var)) -
inv.logit(P2$cells$value - sqrt(P2$cells$var))) / 2
LowL <- UpL <- NULL
for(i in P2$cells$ID){
LowL <- c(LowL, inv.logit(quantile(P2$boot$value[P2$boot$ID == i], 0.025)))
UpL <- c(UpL, inv.logit(quantile(P2$boot$value[P2$boot$ID == i], 0.975)))
}
RES$LowL <- LowL
RES$UpL <- UpL
names(RES)[names(RES) == "value"] <- "prob"
names(RES)[names(RES) == "var"] <- "approx.SE"
RES
order0 prob approx.SE LowL UpL
first 0 0.5640894 0.08009660 0.4148640 0.6970521
second 1 0.6904563 0.05477569 0.5695701 0.7801849
third 2 0.7935947 0.06024437 0.6495188 0.8894666
Hence, estimated probability of being male is 0.56 (± 0.08 SE, 95% CI: 0.41-0.70) for first eggs, 0.69 (± 0.05 SE, 95% CI: 0.57-0.78) for second eggs, and 0.79 (± 0.06 SE, 95% CI: 0.65-0.89) for third eggs. The model also shows a regular increase in (logit-transformed) Pi along the laying sequence, with a change in Pi of 0.13 between first and second eggs and of 0.10 between second and third eggs.