ELECTRONIC SUPPLEMENTARY MATERIAL TO

A meta-analysis of correlated behaviours with implications for behavioural syndromes: mean effect size, publication bias, phylogenetic effects and the role of mediator variables

László Zsolt Garamszegi1*, Gábor Markó2,3 and Gábor Herczeg2,4

1Department of Evolutionary Ecology, Estación Biológica de Doñana–CSIC, c/Americo Vespucio, s/n, 41092 Seville, Spain

2Behavioural Ecology Group, Department of Systematic Zoology and Ecology, Eötvös Loránd University, Pázmány Péter sétány 1/c, H-1117, Budapest, Hungary

3Department of Plant Pathology, Corvinus University of Budapest, Ménesi út 44, H-1118 Budapest, Hungary

4Ecological Genetics Research Unit, Department of Biosciences, University of Helsinki, P.O. Box 65, FI-00014, Helsinki, Finland

Electronic Supplementary Material

Materials and Methods

Database: literature search

We obtained data on the major vertebrate lineages of non-domestic animals (fishes, amphibians, reptiles, birds and mammals). Using scientific search engines, including Web of Science and Google Scholar, we attempted to locate as many papers as possible on the relationship between behaviours in vertebrates up to 31/08/2011.

Certain behavioural traits –often called as “personality” or “temperament” traits– are of particular importance in behavioural syndrome research (Réale et al 2007). Nevertheless, in a broad sense application of the theory, correlation and non-independence of any two functionally different behaviours can be potentially relevant for behavioural syndrome. Unfortunately, using a literature search based on particular keywords would be a very inefficient way to derive a representative database for correlations between all types of behaviours. Therefore, we were constrained to follow a more narrow sense criterion, in which we focused on those major behavioural axes (each can be approached by measuring several behavioural variables) that are particularly used in the personality literature. Our constraints are reflected in the choice of the keywords used in our literature survey. Accordingly, we performed searches based on the different combinations of terms “behaviour”, “correlation”, “personality”, “ syndrome”, “consistency”, “activity”, “exploration”, “aggression”, “risk-taking” and “boldness” that were performed separately in each taxonomic classes. We also conducted cross-reference searches by consulting previous reviews on animal personality and behavioural syndromes (Sih et al 2004a; Sih et al 2004b; Groothuis and Carere 2005; Réale et al 2007; Sih and Bell 2008; Smith and Blumstein 2008; Bell et al 2009). Although we made an effort to find most of the relevant studies, we cannot rule out the possibility that we missed some of them. However, given that we relied on a large sample of effect sizes and the fact that all studies included are cited in the most important reviews in the topic, our sample can be considered representative under the narrow sense constraints.

Since different studies used different terminologies, we were further forced to develop universal definitions for data extraction so that we could make comparisons across studies. For these general definitions we followed the suggestions of Réale et al.(2007), and classified behavioural traits according to the subsequent criteria. I) Activity: any variable that described the intensity of movements in a familiar environment without any social and environmental challenge. II) Aggression: any variable that depicted the intensity of antagonistic behaviour in a social challenge (i.e. against a conspecific individual of the same sex or against a mirror image) or the approach time/distance towards the social challenge. As an estimate of aggression, we also included measures of social ranks between individuals that were the outcome of pair-wise aggressive interactions (i.e. competitive ranks, hierarchy), which often correlates with aggression (Watts et al 2010; Bergvall et al 2011). III) Exploration: any variable that estimated the intensity of movements in an unfamiliar environment or in the presence of a novel object or the approach time/distance towards the novelty. IV) Risk-taking, any variable that measured the intensity of movements, the latency of resuming normal activity (e.g. resuming feeding, leaving refuge) in or after a situation that could be perceived life threatening for the individual (such as predator –including humans– presence, startle or other major stress factor) or describing actual contact (e.g. inspection) with or distance from a predator or predator dummy. V) In some cases, authors defined combined behavioural traits that involved multiple components from the above traits. We also included these ascombined personality, if it was the combination of any of the above variables, and it contained components that reflect aggression and/or risk taking (including complex behavioural axes that were defined based on the observation of individuals with or without a social or environmental challenge). If a particular paper used a different term for the trait studied, we redefined it along our definitions. Réale et al., (2007) also defined an axis for sociability (i.e. “an individual’s reaction to the presence or absence of conspecifics excluding aggressive behaviour”), but we disregarded this trait from our analysis, since only relatively few studies estimated sociality, and making quantitative summaries from this small sample would be meaningless.

We included studies if they contained statistical tests on the phenotypic correlation between two of the above behavioural traits across individuals, along with explicit details on the direction of the effect and sample size. When necessary statistical details were missing, authors were contacted for additional information (if correspondence was unsuccessful within a reasonable time, we were constrained to remove the study from our analysis). When it was evident that two different studies relied on the measurement of same individuals, we achieved independence by only including those that relied on a larger sample size or was published earlier. In very few instances, we did include estimates from studies that presented effects for the same population from different years assuming that these data are largely independent. In some cases, studies included raw data or we received raw data from authors of articles where formal tests of the correlations were not done, and from these we calculated the correlation using a Spearman’s or a Pearson’s correlation depending on the distribution of data. When multiple results were available from a single study as it provided more than one correlation, we included all of them separately in a complete raw dataset and used procedures described below to deal with pseudoreplication. We also entered few studies that used individuals that were subject to a manipulation with potential influence on their behaviour (e.g. marking or hormonal treatment). The removal of these studies from our database does not qualitatively affect the results that we present below.

Combining the results of our searches, a total of 92 papers including 105 independent studies (see below) were identified, with publication dates ranging from 1976 to 2011. These papers yielded 250 particular effect size estimates from 65 species with sample sizes ranging from 4 to 234 individuals. The dataset included information on 4907 individuals altogether.

Note that in this paper, we aimed to determine the general relationship between any of the investigated behavioural traits and to assess factors that can affect the strength the correlation between them regardless of the situation in which the behaviour was measured. Detailed results that are tabulated separately four each possible comparison of specific personality traits will be reported elsewhere (L.Z. Garamszegi, G. Markó, G. Herczeg unpublished manuscript).

Database

We entered all statistical tests of the relationship between the above defined behavioural traits from each study into a database composed in the program Comprehensive Meta-analysis (hereafter CMA, Borenstein 2010). Statistics describing the association between behaviours were converted to effect sizes in the form of correlation (r) by using the conventional formulas for effect size transformation based on standard methods (Cohen 1992; Walker 2003; Nakagawa and Cuthill 2007). Since F-statistics were not accepted directly by CMA, we used The Meta-Analysis Calculator of Alexis Morgan Lyons-Morris ( that allows conversions from one-way F-statistics to effect size. If we were confronted with non-conventional statistics for which no effect size calculations are known, we used the P and N values and entered them into CMA using “p-value and sample size for”. P values in the form of “P < x” (e.g. “P < 0.05”) were entered as “P = x,” which risks understating but not exaggerating effect sizes. Non-parametric statistics were entered into CMA and treated in the same manner as parametric statistics. In all cases, we required an effect direction to include a study in the meta-analysis. The direction was determined based on expectations from the intensity levels of behaviours that predict positive relationships between aggression, exploration, risk-taking and activity. This does not always mean that measured traits show positive correlation, because latencies and approach distances are, for example, inverse measures of intensity. Hence, in such cases, the sign of the relationship was converted appropriately.

For each effect size, we included the reference of the study it came from, an information regarding the independence of the data depending on degree of overlap between samples (“study”, see below), the tested pair-wise relationship, species identity, taxonomic class, whether the behavioural traits in the tested relationship were assayed in captivity (possible categories: full captivity, wild caught captivity tested, fully wild, and the combination of these if the two traits were assessed in different conditions), two estimates of the contextual overlap between the conditions of measurements of the two traits under comparison i.e. spatial overlap (possible categories: yes, if the two assays were performed in the same physical environment, i.e. in the same experimental room, test chamber, tank or territory; no, if the two behaviours were characterised in different environments), and temporal overlap (possible categories: 0, tests were made immediately each other; 1, tests were made with some pause but on the same day; 2, tests were made on different days but in the same season/life cycle; 3, tests were made indifferent seasons/life cycles but in the same year; 4, tests were made in different years), sex (male, female or both if the sample was mixed), age (immature, if individuals were assayed before their first reproduction phase; mature, if individuals were at least in their first reproductive phase), season (breeding or not breeding). If information regarding the given category was not available, we labelled it as ‘not specified’.

We assigned each effect size entry into a “study” group based on the dependence of samples: if two effect sizes corresponded to an overlapping sample of individuals they were categorised into the same “study” id. This categorisation is not necessarily identical with the sorting based on the reference of the articles, because it often happens that a paper reports effect sizes for different samples (e.g. for different populations, species or sexes) or different papers by the same authors rely on the same sample of individuals. In such cases, we treated effect sizes from the same paper as separate studies, or used results from different sources as outcomes from a single study, respectively. Hereafter, we refer to “study” in this sense, which thus sorts data entries based on the independence of the underlying sample of individuals.

For our analyses, we used different levels of combinations of effect size to deal with different issues. First, we used the raw data of effect size entries for the hierarchical modelling of different random effects (“study” and “species”). Second, we averaged effect sizes within studies by calculating the mean effect size weighted by the sample size. These effect sizes that combined different pair-wise correlations between behaviours at the study level were utilized to assess publication bias and to partition heterogeneities across moderator variables. Third, we also calculated effect sizes at the species level that were used in the phylogenetic meta-analysis. For this purpose, we entered species as a grouping variable into a CMA analysis (see below), and then derived species-specific effect size and variance estimates from the output tables.

Repeatability describes the proportional between-individual variance relative to the total variance, and depicts the importance of inconsistencies in individual behaviour as well as measurement error when estimating individual-specific trait values (Nakagawa and Schielzeth 2010; Wolak et al 2012). Low repeatability implies imprecise measurement and/or high intra-individual flexibility (or indistinguishable variation between individuals), which can have theoretical implications for consistent behaviours than high repeatability (Dingemanse et al 2010b). Therefore, we also entered information on the repeatability of the assayed behaviours, if the corresponding study provided such an estimate based on the repeated measurement of the same traits (i.e. if individuals were scored for the same behaviour more than once). Repeatability was derived in the form of ANOVA-based estimate (Lessells and Boag 1987) or as an intra-class correlation coefficient (Sokal and Rohlf 1995). If repeatability was provided for both traits used in the correlation, we calculated their geometric mean to obtain a single estimate at the unit level (using arithmetic means give similar results).

We constructed a phylogenetic tree of species to describe the non-independence of data due to the evolutionary history of species and to perform phylogenetic meta-analyses. For the phylogenetic tests, we assembled an evolutionary tree of the species in our database (Figure 1). The topology of clade-specific phylogenies originated from different sources (mammals: Bininda-Emonds et al 2007; birds: Davis 2008; fishes: Li et al 2008). Different clades were connected based on the phylogenetic tree of the Tree of Life Project(Maddison et al 2007). Since branch lengths cannot be combined across different sources, we determined branch lengths by considering ages of taxa as being proportional to the number of species they contain (similar to a gradual model of evolution, Pagel 1997).

Analyses

We performed all traditional and phylogenetic meta-analyses tests by using normalized score of r, Fisher’s Z (Borenstein 2010). We generally used random-effects models, because we expected variability in the effects being measured among different species or conditions. Random-effects models are appropriate when heterogeneity is relatively high in the data (DerSimonian and Laird 1986), as was the case in our sample (see Results).

Before running meta-analytic tests, we dealt with issues about pseudoreplication. This was inevitable because the same studies often provided multiple correlations (when more than two behavioural traits were assessed) based on the same individuals, thus some effect sizes were not independent from others. Moreover, the data were also structured non-randomly due to phylogenetic associations. As a consequence, closely related species or different populations of the same species might show more similar phenotypes (including correlations between behaviours) than distantly-related entities (Felsenstein 1985). Therefore we modelled the non-independence of data due to consistent study-specific as well as phylogenetic effects by using different approaches.

First, relying on the raw effect size data including multiple entries per studies, we used mixed-effect meta-analytical modelling with “study” and “species” as random factors by simultaneously incorporating the phylogenetic co-variance structure. For this analysis, we used the Bayesian quantitative genetic framework that applies Markov chain Monte Carlo algorithms for phylogenetic mixed modelsto correct for non-random sampling due to different sources in a meta-analysis(Hadfield and Nakagawa 2010). We created five models to address how much variation in the data can be attributed to different hierarchical levels of interest. The null model was a simple random-effect meta-analytic model based on the raw effect size data that assumes full independence. The second model was its extended version that also included a random term for “study” (i.e. assuming non-independence within studies). The third model was an analogues model but included “species” as random effect without considering their phylogenetic associations (i.e. assuming non-independence within species). The fourth model had the same structure, but also considered phylogenetic variance components. The final model was a complex phylogenetic random effect meta-analytic model that simultaneously included the two random factors. We calculated mean effect size and the associated 95% confidence from each model. The goodness of models was compared by calculating Deviance Information Criterion (DIC), with the lowest value offering relatively the best fit to the data.

Second, we tested if the general correlation between behaviours is study-specific, and the mean values at the study level hold biological information. We examined if the within-study component of variance of effect sizes that describe the relationship between different behaviours is smaller than the variance of effect sizes between studies. These variance components were compared in an ANOVA model that relied on the raw dataset and included ‘study’ as the main factor and sample sizes as statistical weight.

To assess the effect of different moderator variables and publication bias, we used the dataset that combined effect sizes within studies and thus had independent entrieswith no overlap in the sample of individuals.To examine what factors were responsible for the heterogeneity among studies, we conducted heterogeneity tests with the inclusion of moderator variables as grouping variables, and tested whether effect sizes were heterogeneous among and within particular groups, in a similar way to one-way ANOVA. The following moderators were considered as grouping variables: species, taxonomic class, captivity, spatial overlap, sex, age, and season. As temporal overlap can be interpreted along a continuous scale, we also conducted a meta-regression analysis to test for the effect of this variable on the predicted relationship. To investigate if the repeatability of traits has an influence on the strength of the detected correlation between them, we performed a meta-regression, in which the geometric mean of the traits’ repeatability was a predictor and absolute effect size (i.e. regardless of the direction of the effect) was the response variable. For this analyses, we applied Fisher’ Z transformation to the repeatability estimates. Note that the absolute values of effects sizes follow folded normal distribution, thus using such non-normally distributed data in a meta-analysis should be made with caution (see Kingsolver et al 2012).

Publication bias is an important concern in meta-analyses, because the published literature may have an over-abundance of significant results (Palmer 1999; Møller and Jennions 2001; Jennions and Møller 2002). To assess the prevalence of publication bias, we drew funnel plots, which plot the effect size and precision of each study to examine asymmetry in their distribution around the mean(Light and Pillemer 1984; Light et al 1994). The rank correlation between effect size and sample size (or variance) is a mathematical formulation of the funnel plot, and thus can be used to test for the asymmetry of funnel-plot (Begg and Mazumdar 1994). This method relies on the underlying assumption that studies with small sample sizes would be more prone to publication bias, while large studies would be likely to be published regardless of significance of the results. Accordingly, a negative correlation between the effect size estimates and sample size (or a positive relationship between effect size and standard errors) reflects a trend towards larger effect sizes in studies with smaller samples, and is regarded as indicative of publication bias. Therefore, we applied Begg’s method to identify publication bias (Begg and Mazumdar 1994). To calculate mean effect sizes while controlling for publication bias, we applied a Trimfill algorithm, where symmetry in the funnel-plot can be adjusted by filling theoretical missing data points arising from publication bias (Duval and Tweedie 2000). The above analyses of publication bias were done in CMA based on the study-averaged effect size data.