Convergence Diagnostics forPhaseMCMC Chains

We tested the convergence of the MCMC chains with the Coda library (Best et al., 1995), as implemented for the R v.2.11.1 environment (Plummer et al., 2010; R Development Core Team, 2010).Coda convergence diagnostics taken into account in our analyses included: heterogeneity among chains (Gelman and Rubin, 1992), convergence test for individual chains (Geweke, 1992), stationarity and half-width interval tests (Heidelberger and Welch, 1983), as well as the inspection of the density distributions and the auto- and cross-correlations plots. Point estimates of the parameters in the different applications were calculated as the medians of the second half from all stationary chains.

Haploype Inferences with Phase

Haplotypes were obtained with the coalescent-based approach from Stephens & Donnelly (2003). The use of different priors (c, , f) for the haplotype reconstructions had little effect over the final set of inferred haplotypes. The differences were restricted to a few genotypes that produced alternative pairs of inferred haplotypes under the different models.

The four LGs investigated had different homozygosity and number of markers, which led us to run chains of different lengths and thinning intervals. Burn-in iterations, thinning intervals and final chain lengths of the Bayesian chains that produced the reported results were as follows: LG2 (-X10, 1000 25 250); LG9 (-X10, 1000 25 500); LG10 (-X10, 1000 15 250); LG12 (-X10, 500 10 100). We measured the goodness of fit of the estimated haplotypes to an approximate coalescent with recombination, using the posterior pseudo-likelihood of the data under the model (Stephens and Donnelly, 2003), with the convergence tools available within Coda (not shown).

Recombination Estimates with Phase

Recombination was estimated from population data (inferred haplotypes) using the general recombination model from Li and Stephens (2003) and Crawford et al. (2004), which allows hot and cold-spots of recombination to independently occur in different segments. We used four different sets of priors by combining two recombination probabilities per base-pair (c: the default value and the value obtained from the rate of the oaks genome content to their linkage length) with two priors for the population genetics recombination parameter (µ) and for the difference allowed (f) between the estimated population genetics recombination parameter () and its prior (µ). The default values for the hotspots priors were used in all cases. See Phase documentation for a briefing on the recombination models and the priors.

Our first attempts to calculate the population genetics recombination parameters were made with thinning intervals, final-chain lengths and burn-in iterations shown in Table S5-1. Background recombination point-estimates show that informative priors might be essential to reach convergence, at least with this type of data. Only LG9 and LG10 point estimates obtained with the “oak priors” attained convergence. The Gelman-Rubins shrink factor test and Geweke`s Z scores indicated convergence failure in LG2 and LG12 (Figure S5-1 and Table S5-2). Heidelberger-Welch’s stationarity test failed only for the recombination estimate in the second segment from LG.9 (Table S5-3) The half-width tests failed for LG2 and LG12 (Table S5-3).

Afterwards, we run a longer final set of iterations trying to confirm convergence in LG9 and LG10 and to attain it in LG2 and LG12. We used a thinning interval of 500 and a final chain length of 109after 5x108 burn-in iterations, for all LGs. The results obtained are shown in the main text (Table 2). Convergence was confirmed for LG9 and LG10 (Figure S5-2, Tables S5-4, S5-5) and the recombination parameters point estimates varied only slightly. Furthermore, LG12 also passed the Coda convergence tests, although longer simulations would be needed to obtain accurate point estimates.

Our attempts to obtain recombination estimates with the modifiedrecombination probability per base pair (c) failed for three out the four LGs, even though we used much longer simulations (Table S5-6). Only LG9 simulations obtained with the oak priors seemed to reach convergence, with a point estimate rather close to the one obtained with the default recombination probability value. We did not further pursue convergence in other segments because one single simulation would last far over one month in our computers.

REFERENCES S2

Best MG, Cowles MK, Vines SK (1995) Coda Manual version 0.30. MRC Biostatistics Unit, Cambridge, UK.

Crawford D, Bhangale T, Li N, Hellenthal G, Rieder M, Nickerson D, Stephens M (2004) Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genetics, 36, 700-706.

Gelman A, Rubin DB (1992) Inference from iterative simulations using multiple sequences. Statistical Science, 7, 457-472

Geweke J (1992) Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In Bayesina Statistics 4, (Eds. Bernardo JM, Berger JO, Dawid AP, Smith AFM). Clarendon Press, Oxford, UK

Heidelberger P, Welch P (1983) Simulating run length control in the presence of an initial transient. Operations Research, 31, 1109-1144

Li N, Stephens M (2003) Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165, 2213-2233.

Plummer M, Best N, Cowles K,Vines K (2006) Coda: Output analysis and diagnostics for MCMC. R package version 0.13-5. URL

R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL

Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction. American Journal of Human Genetics, 73, 1162-1169.

Table S5-1 Point estimates of the recombination parameters, together with burn-in periods and final chain lengths, for the first recombination estimates obtained with the default recombination probability per base pair (c=1E-08).

Figure S5-1 Gelman-Rubin-Brooks plots of Gelman and Rubin (1992) shrink factor for the Bayesian chains used in the recombination estimates shown in Table S5-1 (only for the analyses with the “oak priors”). The pseudo-likelihood tests the goodness of fit of the inferred haplotypes to an approximate coalescent with recombination (Stephens and Donnelly, 2003).


Table S5-2Geweke Z-scores for the Bayesian chains used to estimate recombination parameter shown in Table S5-1 (only for the analyses with the ”oak priors”). Tests failures are indicated in bold characters.

Table S5-3 Heidelberger and Welch stationarity and half-width tests for the Bayesian chains used in the recombination estimates shown in Table S5-1 (only for the analyses with the ”oak priors”). Failures to pass the tests are indicated by bold characters.


Figure S5-2 Gelman-Rubin-Brooks plots of Gelman and Rubin (1992) shrink factor for the Bayesian chains used in the reported recombination estimates(Table 2, main text; only for the analyses with the “oak priors”).

Table S5-4 Geweke Z-scores for the Bayesian chains used to estimate recombination parameter shown in Table 2 of the main text (only for the analyses with the “”oak priors”). Tests failures are indicated in bold characters.

Table S5-5 Heidelberger and Welch stationarity and half-width tests for the Bayesian chains used in the recombination estimates shown in Table 2 of the main text (only for the analyses with the ”oak priors”). Failures to pass the tests are indicated by bold characters.

Table S5-6 Recombination estimates, together with burn-in periods and final chain lengths,obtained usingan estimate of the oaks recombination probability per base pair based in published data (ratio of the DNA content to the linkage length). Only the longest simulations for each segment are shown.