Analysis of sex sequences by means of generalized linear mixed models

Tutorial for the analyses

Behavioral Ecology and Sociobiology

R. AMBROSINI, D. RUBOLINI, N. SAINO

 Roberto Ambrosini

Dipartimento di Biotecnologie e Bioscienze, Università degli Studi di Milano-Bicocca, P.zza della Scienza 2, 20126 Milano, Italy;

email:

Introduction

We provide a tutorial for the analysis of sex sequences using Generalized Linear Mixed Models (GLMMs), with step-by-step commented analyses and the associated R code. The code has been produced using R version 3.0.2 (R Core Team 2013), RStudio version 0.97.551 (Racine 2012) and associated libraries. We start by generating an exemplificative data file that will help in testing different hypotheses of variation in sex sequences. Readers should carefully refer to the main text of the article for details of each hypothesis being investigated.

Data preparation

The file we will use for the analyses consists of 50 hypothetical clutches (identified by variable ID) of a hypothetical bird species laying clutches of 3 eggs. The rather small sample size is justified by the exemplificative purpose of this Tutorial. Laying sequence is coded from 1 to 3 (variable order), with egg 1 being the first and egg 3 being the last laid one. Data were generated according to a pattern of sex allocation whereby the sex of an egg influences that of the subsequent one in the laying sequence, with the following parameters: P1 = 0.6, Pi|(i-1)m = 0.8, Pi|(i-1)f = 0.4. There is therefore a rather large effect of the sex of the previous egg on that of the following ones, butthese values are not biologically unreasonable (see the example on eclectus parrots Eclectus roratus below). Sex sequence data were generated by drawing random numbers from a uniform distribution in [0:1] and then assigning the sex of the eggs according to random numbers being smaller or larger than the probability of being male calculated for that egg according to the parameters reported above. We also included 3% of random missing values in the binary dependent variable (sex), representing the sex of the embryo (0 = female, 1 = male).

# generate data file

data <- data.frame(cbind(

ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8,

9, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14, 15, 15,

15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 20, 20, 20, 21, 21, 21,

22, 22, 22, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 26, 27, 27, 27, 28,

28, 28, 29, 29, 29, 30, 30, 30, 31, 31, 31, 32, 32, 32, 33, 33, 33, 34, 34,

34, 35, 35, 35, 36, 36, 36, 37, 37, 37, 38, 38, 38, 39, 39, 39, 40, 40, 40,

41, 41, 41, 42, 42, 42, 43, 43, 43, 44, 44, 44, 45, 45, 45, 46, 46, 46, 47,

47, 47, 48, 48, 48, 49, 49, 49, 50, 50, 50),

order = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,

1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,

3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,

2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,

1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,

3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),

sex = c(0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, NA, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,

1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,

1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,

0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, NA, 1, 0, 0, 1, 0, 0, 1, 1, 1,

0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, NA, 0, 1, 1)

))

#show dataset

edit(data)

We then generate the AF variable, as well as the variable prevsex, representing the sex of the previous egg (+0.5 = male, -0.5 = female) and assuming the value of 0 for the first laid egg in each clutch, and the variable order0, which is equal to (order - 1).

# ensure that the dataset is sorted by increasing ID and laying sequence

ord <- order(data$ID, data$order)

data <- data[ord, ]

# generate the AF variable

AF <- as.numeric(I(data$order > 1) + 0)

# generate variable “prevsex”

prevsex <- rep(0, nrow(data))

for (j in unique(data$ID)) {

if (max(data$order[data$ID == j]) > 1) {

for (i in 2:(max(data$order[data$ID == j]))) {

prevsex[data$ID == j & data$order == i] <- data$sex[data$ID == j &

data$order == (i - 1)] - 0.5

}

}

}

# generate variable “order0”

order0 <- data$order - 1

data <- cbind(data, AF, prevsex, order0)

# final dataset

edit(data)

We have now generated all the variables needed to fit the binomial GLMMs testing for different patterns of variation in sex sequences (see main text). Hereafter, as in the main text, Pi denotes the probability that a given egg of order i is a male egg. We will start the analyses by exploring the data in order to assess whether patterns of variation in egg sex do appear.

Data exploration

We will produce a set of graphs that should help highlighting patterns of sex allocation. We start by producing two tables that summarize the data and will be useful to build graphs. TAB1 includes the proportion of male eggs along the laying sequence,TAB2 the number of male eggs in first laid eggs and in any following egg preceded by a male or a female egg. These two tables and the graphs showing their results are generated with the aim of visually inspecting the most relevant patterns of variation in egg sex that will be tested in the models.

# tabulate data

TAB1 <- aggregate(data$sex, by = list(data$order), FUN = mean, na.rm = TRUE)

colnames(TAB1) <- c("order", "prop.males")

TAB2 <- xtabs(~ sex + prevsex, data = data)[,c(2,3,1)]

colnames(TAB2) <- c("first", "M", "F")

TAB2 <- TAB2[2,]/apply(TAB2, 2, sum)

The following code will produce a panel of four graphs based on the tables generated so far. The first two graphs represent variation in probability of being male according, respectively, to laying order and to the sex of the preceding egg. The models we will fit are based on the logit transformation. Plotting the logit(Pi) instead of Pi may improve clarity of graphs. The third and the fourth graphs therefore represent the logit-transformed probability of being male according to laying order and sex of the preceding egg.

# graphs

def.par <- par(no.readonly = TRUE) # save default, for resetting

layout(matrix(1:4, 2, 2, byrow = TRUE))

barplot(TAB1$prop.males, names.arg = TAB1$egg,

xlab = "laying order", ylab = "proportion of male eggs")

barplot(TAB2, names.arg = c("first", "M", "F"),

xlab = "sex of the preceding egg", ylab = "proportion of male eggs")

require(boot) # for logit transformation

barplot(logit(TAB1$prop.males), names.arg = TAB1$egg,

xlab = "laying order",

ylab = "logit proportion of male eggs")

barplot(logit(TAB2), names.arg = c("first", "M", "F"),

xlab = "sex of the preceding egg",

ylab = "logit proportion of male eggs")

par(def.par) # reset default values

The upper left graph shows an increase in Pi along the laying sequence. The upper right one indicates that Pi is larger for eggs preceded by a male than a female egg. The lower graphs show the same pattern of variation more clearly because they are based on logit-Pi.

Model interpolation

Information obtained in the explorative analyses suggests that both a sequence effect and a previous-egg effect are plausible. We will therefore fit different models in order to investigate different putative patterns of sex allocation. Models will be fitted using the glmer function in package lme4 version 1.0-5 (Bates et al. 2013) and other functions in the package lmerTest (Kuznetsova et al. 2013) will be used to explore the results. The results obtained by previous version of the same package may slightly differ due to changes in the underlying computation algorithm.

Model testing H0: We start by fitting an intercept only model (null model), testing H0, i.e. that Piis constant along the laying sequence.

require(lme4)

# model testing H0

H0 <- glmer(sex ~ 1 + (1 | ID), family = binomial, na.action = na.omit,

data = data)

summary(H0)

Generalized linear mixed model fit by maximum likelihood ['glmerMod']

Family: binomial ( logit )

Formula: sex ~ 1 + (1 | ID)

Data: data

AIC BIC logLik deviance

188.2088 194.1623 -92.1044 184.2088

Random effects:

Groups Name Variance Std.Dev.

ID (Intercept) 1.154 1.074

Number of obs: 145, groups: ID, 50

Fixed effects:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.7403 0.2393 3.093 0.00198 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1

We first observe that the random intercept variance is rather large (1.154) when compared to the residual variance, which is equal to π2/3 = 3.290 (the glmer algorithm does not fit overdispersed models), which suggests that unmodelled effects affect egg sex. This observation is straightforward if the random intercept variance of this model is compared to that of the model fitting H1 (see below) which is equal to zero. The fact that random intercept variance is lower than the residual variance indicates that most of the variation occurs within clutches, and provides support for the investigation of within-clutches processes of sex allocation. Indeed, in our dataset, random intercept variation in model testing H0 occurs because clutches that happened to have a female first egg likely have more female eggs. However, a large random variance may also be the consequence of processes acting between clutches as, for example, if females in poor conditions lay more female eggs (see also the main text), so that, in a real dataset, it may be worth to investigate also between-clutches processes.

Alternatively, it is possible to test for the significance of the random effect by a likelihood-ratio test which indicate that it is highly significant, thus confirming that there is large variation in the average probability of being male between clutches.

require (lmerTest)

rand(H0)

Analysis of Random effects Table:

Chi.sq Chi.DF p.value

(1 | ID) 18.8 1 1e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Warning messages: (omitted)

This procedures produced warning messages, that were omitted here. Fixed effects (above) indicate that the intercept (0.74 ± 0.24 SE) significantly differs from 0. There is therefore a larger proportion of male than female eggs. Estimated Pi is 0.68 ± 0.05 SE and the 95% CI for Pi is 0.57-0.77. This calculation was made as follows:

inv.logit(0.7403)

[1] 0.6770615

The logit-transformed SE cannot be directly converted into probabilities by means of an inverse logit transformation. Approximate SE of Pi can be calculated as follows, while 95% CI can be precisely obtained by profiling the likelihood function (Vanzon & Moolgavkar 1988):

# SE

(inv.logit(0.7403 + 0.2393) - inv.logit(0.7403 - 0.2393)) / 2

[1] 0.05216727

CI.H0 <- confint(H0)

CI.H0

2.5 % 97.5 %

.sig01 0.3043934 1.931325

(Intercept) 0.2564502 1.346675

inv.logit(CI.H0[2,])

2.5 % 97.5 %

0.5637635 0.7935856

The confint function provides CIs for all the parameters estimated by the model i.e. the standard deviation of the random effect (1.074) and of the intercept (0.740). Alternatively, approximated CI for the fixed effects can be obtained by the Wald method (shown here only for comparison):

# 95% CI, lower limit

inv.logit(0.7403 + qt(0.025, 50) * 0.2393)

[1] 0.52645507

# 95% CI, upper limit

inv.logit(0.7403 + qt(0.975, 50) * 0.2393)

[1] 0.7722304

where function qt is used to calculate 2.5% and 97.5% quantiles of a t distribution with degree of freedom conservatively set to the number of clutches (50 in our example).

Model testing H1: We then fit a model testing H1, i.e. that the sex of an egg depends on that of the previously laid egg:

# model testing H1

H1 <- glmer(sex ~ AF + prevsex + (1 | ID),

family = binomial, na.action = na.omit, data = data)

summary(H1)

Generalized linear mixed model fit by maximum likelihood ['glmerMod']

Family: binomial ( logit )

Formula: sex ~ AF + prevsex + (1 | ID)

Data: data

AIC BIC logLik deviance

178.5898 190.4412 -85.2949 170.5898

Random effects:

Groups Name Variance Std.Dev.

ID (Intercept) 0 0

Number of obs: 143, groups: ID, 50

Fixed effects:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.1603 0.2838 0.565 0.572019

AF 0.6027 0.3743 1.610 0.107382

prevsex 1.7368 0.4883 3.557 0.000375 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1

Correlation of Fixed Effects:

(Intr) F

AF -0.758

prevsex 0.000 0.075

rand(H1)

Random term (1 | ID) was eliminated because of standard deviation being equal to 0

Analysis of Random effects Table:

Warning messages: (omitted)

First of all, we note that there is no random between-clutches variation in the mean (logit-transformed) probability of being males, as the variance of the random intercept is equal to 0. This occurs because when we generated the dataset we introduced no between-clutches variation in the effect of the sex of the preceding egg on that of the following one (indeed, we always use Pi|(i-1)m = 0.8 and Pi|(i-1)f = 0.4 for all clutches). Significance test for the random effect cannot be computed, because there is no variance of the random term (and the procedure generates warnings).

The fact that random variation greatly decreased in model H1 with respect to model H0 suggests that model H1includes variables accounting for the mechanism generating the amount of variation that was unmodelled in H0 and accounted for by the random intercept. We can expand a little bit on this point here, because we know the actual pattern used to generate egg sexes, i.e. that the sex of an egg is influenced by that of the previous one. This pattern however generates random variation in the average sexratio between clutches: those clutches where the first egg is female also happen to have a larger proportion of females because of the mechanisms that determine egg sex within clutches.

A model not including variables accounting for this mechanisms, like H0, cannot account for this variation in any other way than in the variance in the random effect. Model H0 is therefore suggesting that “sex ratios vary between clutches, but we don’t know why”. Conversely, models H1 precisely accounts for the mechanisms generating egg sex, and there is no other mechanisms acting between clutches because, having accounted for this within-clutch variation, there is no residual between-clutches variation. In other words, model H1suggests that “we know what is going on: the sex of an egg varies according to that of the preceding one. This mechanism acts within clutches, and there is nothing going on between clutches”.

All these pieces of evidence suggest that also the random intercept is redundant in this particular case. However, we refrain from further simplifying the random structure of the model, since this would imply removing the grouping factor ID, which would be questionable in analyses of real datasets.

The results from H1 indicate a statistically significant effect of prevsex, i.e. the sex of an egg varies according to that of the previously laid egg. Specifically, model estimates show that the probability that the first egg is male is 0.54, and that, for eggs of any laying order, the probability of being male for an egg preceded by a female egg is 0.47, while the probability of an egg being male if preceded by a male egg is equal to 0.84. In addition, the probability of being male of a second or third egg, independently from the sex of the preceding egg, is 0.68. These probabilities were calculated as follows (output after each code line):

inv.logit(0.1603)

[1] 0.5399894

inv.logit(0.1603 + 0.6027 + 1.7368 * -0.5)

[1] 0.4736744

inv.logit(0.1603 + 0.6027 + 1.7368 * 0.5)

[1] 0.8363613

inv.logit(0.1603 + 0.6027)

[1] 0.6820047

There is also another way to obtain such estimates. We start by producing a dataset (newdata hereafter), with the same variables entered in the model (ID, AF, and prevsex). We will assign arbitrary values to ID (e.g. 1, 2, 3, 4; this will not affect the results) while the other ones will take the specific values corresponding to the estimate we are interested in. For example, if we are interested in estimating the probability of being male for a first egg, we must set AF = 0 and prevsex = 0 in the newdata dataset because our coding implies that first eggs have 0 in both these variables. If, conversely, we are interested in estimating the same probability for a second or third egg preceded by a female egg we must set AF = 1 and prevsex = -0.5 for these eggs, and AF = 1 and prevsex = +0.5 if we are interested in estimating the probability of being male of an egg preceded by a male one. Finally, we may be interested in estimating the probability of being male for a second or third egg, independently of the sex of the preceding one. This can be done by settingAF = 1 and prevsex = 0. We stress that these values must be entered exactly in the same order as the independent variables appear in the model output. This dataset is produced as follows:

# generate the "newdata" dataset

newdata <- rbind(first = c(1, 0, 0),

later = c(2, 1, 0),

after_f = c(3, 1, -0.5),

after_m = c(4, 1, +0.5))

newdata <- data.frame(newdata)

names(newdata) <- c("ID", "AF", "prevsex")

newdata

ID AF prevsex

first 1 0 0.0

later 2 1 0.0

after_f 3 1 -0.5

after_m 4 1 0.5

We then use the ezPredict function in the ez library (Lawrence 2013) to obtain predicted values from the fixed effects of the GLMMs. This function also uses a bootstrap to generate variances.

# calculate predicted values

require(ez)

P1 <- ezPredict(H1, to_predict = newdata, boot = TRUE, iterations = 1000)

P1$cells

ID AF prevsex value var

first 1 0 0.0 0.1603427 0.08051526

later 2 1 0.0 0.7630281 0.05960125

after_f 3 1 -0.5 -0.1053605 0.10555559

after_m 4 1 0.5 1.6314168 0.13284940

These values are in logit. We can calculate the corresponding probability and SEs and obtain (empirical) 95% CIs for probabilities from the bootstrapped values:

# calculate predicted probabilities and CIs

RES <- P1$cells[,-1]

RES$value <- inv.logit(P1$cells$value)

RES$var <- (inv.logit(P1$cells$value + sqrt(P1$cells$var)) -

inv.logit(P1$cells$value - sqrt(P1$cells$var))) / 2

LL <- UL <- NULL

for(i in P1$cells$ID){

LL <- c(LL, inv.logit(quantile(P1$boot$value[P1$boot$ID == i], 0.025)))

UL <- c(UL, inv.logit(quantile(P1$boot$value[P1$boot$ID == i], 0.975)))

}

RES$LL <- LL

RES$UL <- UL

names(RES)[names(RES) == "value"] <- "prob"

names(RES)[names(RES) == "var"] <- "approx.SE"

RES

AF prevsex prob approx.SE LL UL

first 0 0.0 0.5400000 0.07002378 0.3946451 0.6773813

later 1 0.0 0.6820108 0.05278756 0.5687269 0.7757951

after_f 1 -0.5 0.4736842 0.08029903 0.3228145 0.6239669

after_m 1 0.5 0.8363636 0.05007454 0.7034711 0.9087091

Hence, according to model H1, the probability of being male for the first egg of a clutch is 0.54 (± 0.07 SE) while for the following eggs it is 0.47 (± 0.08 SE) if they are preceded by a female egg, and 0.84 (± 0.05 SE) if preceded by a male egg. 95% CIs for these estimates are, respectively: 0.39-0.68, 0.32-0.62, and 0.70-0.91. The probability that a second or third egg is male, independently from the sex of the preceding egg, is 0.68 (± 0.05 SE), and 95% CIs are 0.57-0.78.

Model testing H2: Let’s now turn to the models testing H2, namely that Pi changes linearly along the laying sequence:

# model testing H2

H2 <- glmer(sex ~ order0 + (1 | ID),

family = binomial, na.action = na.omit, data = data)

summary(H2)

Generalized linear mixed model fit by maximum likelihood ['glmerMod']

Family: binomial ( logit )

Formula: sex ~ order0 + (1 | ID)

Data: data

AIC BIC logLik deviance

185.3910 194.3213 -89.6955 179.3910

Random effects:

Groups Name Variance Std.Dev.

ID (Intercept) 1.444 1.202

Number of obs: 145, groups: ID, 50

Fixed effects:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.2578 0.3285 0.785 0.4327

order0 0.5445 0.2355 2.312 0.0208 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:

(Intr)

order0 -0.629

rand(H2)

Analysis of Random effects Table:

Chi.sq Chi.DF p.value

(1 | ID) 24.2 1 9e-07 ***

Warning messages: (omitted)

We observe a rather large and statistically significant variation of random effects that indicates that patterns of sex allocation differ among clutches, and a statistically significant increase in the probability of being male along the laying sequence. We also calculated the probability of being male of each egg in the laying sequence and its associated SE and 95% CI based on bootstrap.

# generate the "new" dataset

newdata <- rbind(first = c(1, 0),

second = c(2, 1),

third = c(3, 2))

newdata <- data.frame(newdata)

names(newdata) <- c("ID", "order0")

newdata

ID order0

first 1 0

second 2 1

third 3 2

# calculate estimated values

P2 <- ezPredict(H2, to_predict = newdata, boot = TRUE, iterations = 1000)

P2$cells

ID order0 value var

first 1 0 0.2577754 0.10792503

second 2 1 0.8022534 0.06609414

third 3 2 1.3467313 0.13519103

RES <- P2$cells[,-1]

RES$value <- inv.logit(P2$cells$value)

RES$var <- (inv.logit(P2$cells$value + sqrt(P2$cells$var)) -

inv.logit(P2$cells$value - sqrt(P2$cells$var))) / 2

LowL <- UpL <- NULL

for(i in P2$cells$ID){

LowL <- c(LowL, inv.logit(quantile(P2$boot$value[P2$boot$ID == i], 0.025)))

UpL <- c(UpL, inv.logit(quantile(P2$boot$value[P2$boot$ID == i], 0.975)))

}

RES$LowL <- LowL

RES$UpL <- UpL

names(RES)[names(RES) == "value"] <- "prob"

names(RES)[names(RES) == "var"] <- "approx.SE"

RES

order0 prob approx.SE LowL UpL

first 0 0.5640894 0.08009660 0.4148640 0.6970521

second 1 0.6904563 0.05477569 0.5695701 0.7801849

third 2 0.7935947 0.06024437 0.6495188 0.8894666

Hence, estimated probability of being male is 0.56 (± 0.08 SE, 95% CI: 0.41-0.70) for first eggs, 0.69 (± 0.05 SE, 95% CI: 0.57-0.78) for second eggs, and 0.79 (± 0.06 SE, 95% CI: 0.65-0.89) for third eggs. The model also shows a regular increase in (logit-transformed) Pi along the laying sequence, with a change in Pi of 0.13 between first and second eggs and of 0.10 between second and third eggs.