Investigating Mitochondrial DNA Relationships in Neolithic Western Europe Through Serial

Title:

Investigating mitochondrial DNA relationships in Neolithic Western Europe through serial coalescent simulations

Authors:

Maïté Rivollat, Stéphane Rottier, Christine Couture, Marie-Hélène Pemonge, Fanny Mendisco, Mark G. Thomas, Marie-France Deguilloux, Pascale Gerbault

SUPPLEMENTARY MATERIAL

* Descriptive statistics

Under a Wright-Fisher model, at similar effective population size with no migration, pairwise FST measures the genetic differentiation caused by drift between two populations. We consequently used FST as a summary statistic to assess whether levels of population differentiation greater than observed could be obtained under a simple model of population panmixia (i.e. when genetic drift is the only main process at play). We thus calculated 6 pairwise FST as a measure of genetic differences between the four groups. These statistics were determined at the ‘GROUP_LEVEL’ with ARLSUMSTAT version 3.5.1.2(ref. 1), a modified version of Arlequin for computing summary statistics on linux operating system. Observed pairwise FST values are given in Figure 3.

* Serial Coalescent Simulations

Computer simulations are an essential analytical tool that helps to test alternative hypotheses and explore empirical data2. More specifically, the stochasticity of the genealogical process needs to be accounted for when interpreting genetic data3. The coalescent is an extremely efficient mathematical construct that deals precisely with gene genealogy stochasticity. We consequently used serial coalescent simulations that allows sampling of ancient DNA at different point backward in time to test whether genetic drift alone within a single panmictic population could explain observed patterns of mtDNA differentiation between ancient hunter-gatherers (HG), central (Central-F) and southern Neolithic (South-F) farmer samples and Gurgy.

In the serial coalescent simulations modeltwo demographic events are assumed, as in Bramanti et al. 2009. The first one follows an initial colonization of Europe 45000 years ago of female effective population size NUP, sampled from NA, an ancestral non-African effective population size of 5000 females. NA is derived from the commonly used long-term effective human population size of 10000 individuals outside Africa5-8 and assuming a 1:1 female to male ratio. The second demographic event follows the Neolithic transition in Western Europe 5900 years cal. BP of female effective population size NN.

The latter date (5900 years cal. BP) determines the “t0” in our coalescent simulations, since it is the median calibrated C14 date of the youngest ancient mtDNA group we used, see Table S1. We explored 50 values for NUP ranging from 1 to 5000 and 50 values for NN ranging from 10 to 100000 (see Table S2).

Our model differs in two major aspects from the one of Bramanti et al. 2009. First, our demographic model does not integrate a modern population sample and we do not compare any of the ancient groups to modern ones. Second, we extended the ranges for NUP and NN used by Bramanti et al. 2009 ([10 – 5000] and [1000 – 100 000], respectively) towards the lower bound of these ranges in order to account for a wider range of demographic scenarios of both population growth and decline (ranges provided in Table S2).

We used the serial coalescent algorithm fastsimcoal version 2.5.1 (ref. 9) to simulate mitochondrial genealogies of ancient hunter-gatherer and farmer sequences. In each genealogy, simulated sequences were sampled according to the numbers and the median of the radiocarbon date ranges of the sequence groups presented in Table S1 and Figure 2. We used a fixed mutation rate of 5.10-6/bp/generation (ref. 10).

50000 gene genealogies were generated under each of the 2500 combinations of NUP and NN values (Table S2). We computed the six pairwise FST between the simulated population samples(Figure 2) and recorded the proportion of simulated FST values that were greater than those observed per pairwise FST per parameter combination (Figure 3). Results were analysed and plotted using the statistical analysis programming language R, version 2.15.1 (ref. 11).

The top right area delimited by the black horizontal and vertical lines on Figure 3 shows the ranges for NUP and NN previously used in Bramanti et al. 2009 and Rasteiro and Chikhi 2010. These prior ranges do not permit to recover higher mtDNA differentiation observed between the contemporaneous South-F and Central-F groups under the simple assumptions of a panmictic population. However, we show here that smaller values of NN (between 10 and 200 females) would be enough to obtain larger levels of genetic differentiation between these two farmer groups, without invoking population structure12 and/or gene-flow.

* ABC-related approach

Although we quantified the proportion of pairwise FSTthat was greater than observed,it does not show that observed levels of within and between group diversitywere actually recovered in the datasetssimulatedunder a simple panmictic model.In order to assess how close the simulated datasets were to the corresponding observed values, we carried out a rejection algorithm widely used in approximate Bayesian computation (ABC) procedure13,14. Briefly, the ABC rejection algorithm involves calculatingthe difference between observed summary statistics and corresponding simulated datasets (e.g. ref. 15). In this analysis, as well as the between population diversity statistics, i.e. the 6 pairwise FST,we used two within population diversity statistics, i.e. the number of segregating sites (S) and the number of pairwise differences (Pi) within group. Comparison of these observed values to those simulated provides a measure of how close a simulated dataset is to the observed dataset and thereby informs on whether observed data could be recovered under the simple panmictic population model considered.

We used the rejection algorithm of the ‘abc’ package available in R16 to retain the 1 000 simulations that generated simulated datasetsthe closest to the 14observed summary statisticsfor each parameter combination (tolerance 2% of 50 000 simulations per parameter combination). This left us with 2 500 000 retained simulated datasets (Figure S1, solid lines) containing an equal number of each parameter combination. We subsequently repeated the ABC-rejection selection process over these (tolerance 0.5% of 2 500 000) to retain the 12 500closest simulated dataset (Figure S1, dashed lines) to the observed (Figure S1, solid vertical line). Figure S1 shows that the simple panmictic population model does recover both within and between population group diversity levels.

In particular, most of the observed statistics values lie within the 95% confidence interval (dashed vertical lines) of the stringiest ABC-rejection threshold (tolerance 0.5% of 2 500 000). Even though one observed statistics value, i.e. the number of pairwise differences within the Central-F farmer group, lies slightly right of the upper limit of the 95% confidence interval, it is not completely out of range.

For the FST between Central-F and South-F and Hunter-gatherers and South-F, the difference between the mode of the distribution and the observed value is negligible, while in the former the observed value still falls within the retained simulated distribution. This corroborates that this latter pairwise comparison (Central-F and South-F) is the one for which the parameter space explored for NN and NUP is the narrowest, and suggests that alternative models may explain the differentiation between these two contemporaneous groups better, such as for example population structure with gene-flow.

We finally estimated NUP and NN effective population sizes from these 12 500 retained simulations using both the simple rejection and the local linear regression algorithms provided in the ‘abc’ R package (Figure 3). The 95% credible intervals obtained from the rejection algorithm are [5 – 3 500] NUP females (mode=59.8) and [200 – 7 750] NN females (mode=974.2). The estimates using the local linear regression algorithm are in similar ranges, i.e. [5 – 624] females (mode=33.3) and [1 446 – 10 055] females (mode=3 804) for NUP and NN, respectively. We caution against over-interpretation of the NUP and NN estimates we provide here. They are only given as a means of comparison to the ranges shown on Figure 3 and should not be used as global European female effective population size estimates, but rather for investigating further hypotheses of temporary population decline in Europe.Our approach is not really suited to precise estimation of effective population sizes; there is likely insufficient information in the data to make such estimates.

FIGURE S1: Simulated within (S and Pi) and between (pairwise FST) population group diversity levels retained after approximate Bayesian computation procedure. S and Pi stand for the number of segregating sites and the number of pairwise differences within population group, respectively. Solid lines show 2 500 000 retained simulated datasets each containing an equal number of each parameter combination. Dashed lines show the 12 500 closest simulated set of summary statistics retained from those 2 500 000 datasets. Dashed vertical lines show the 95% confidence interval of the dashed distributions (from 12 500 closest simulated datasets). Solid vertical lines show the observed summary statistics values. The fewer the number of simulations retained, i.e. the higher the threshold value (from solid to dashed lines), the closer the simulated datasets are to the observed values. This Figure shows that the simple panmictic population model does recover the observed values.

References:

1Excoffier, L. & Lischer, H. E. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Molecular ecology resources 2010; 10, 564-567.

2Currat, M. & Silva, N. M. Investigating European genetic history through computer simulations. Human heredity 2013; 76, 142-153.

3Barbujani, G. Genetic evidence for prehistoric demographic changes in Europe. Human heredity 2013; 76, 133-141.

4Bramanti, B., Thomas, M. G., Haak, W. et al. Genetic discontinuity between local hunter-gatherers and central Europe’s first farmers. Science 2009; 326, 137-140, doi:10.1126/science.1176869.

5Takahata, N. Allelic genealogy and human evolution. Molecular biology and evolution 1993; 10, 2-22.

6DeGiorgio, M., Degnan, J. H. & Rosenberg, N. A. Coalescence-time distributions in a serial founder model of human evolutionary history. Genetics 2011; 189, 579-593.

7Consortium, T. G. P. A map of human genome variation from population-scale sequencing. Nature 2010; 467, 1061-1073.

8Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 2011; 475, 493-496, doi:

9Excoffier, L. & Foll, M. Fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 2011; 27, 1332-1334.

10Soares, P., Ermini, L., Thomson, N. et al. Correcting for purifying selection: an improved human mitochondrial molecular clock. The American Journal of Human Genetics 2009; 84, 740-759.

11Team, R. D. C. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2008;

12Rasteiro, R. & Chikhi, L. Female and male perspectives on the neolithic transition in Europe: clues from ancient and modern genetic data. PLoS One 2013; 8, e60944, doi:10.1371/journal.pone.0060944.

13Beaumont, M. A. Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution and Systematics 2010; 41, 1.

14Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 2002; 162, 2025-2035.

15Marjoram, P. & Tavaré, S. Modern computational approaches for analysing molecular genetic variation data. Nature Reviews Genetics 2006; 7, 759-770.

16Csilléry, K., Blum, M. G., Gaggiotti, O. E. & François, O. Approximate Bayesian computation (ABC) in practice. Trends in Ecology & Evolution 2010; 25, 410-418.