Supplemental Material to Hobolth Et Al

Supplemental material to Hobolth et al.

Exclusion of the X chromosome from the analysis
Count pattern analysis
Indel analysis
Candidate regions for incomplete lineage sorting inferred from coalescent HMM

1. Exclusion of the X chromosome from the analysis

Under the molecular clock assumption, the human and chimpanzee branches are expected to have the same average length, even in the presence of incomplete lineage sorting. In order to measure the level of departure from the molecular clock and its subsequent impact on the inference of ILS, we estimated the branch lengths of a ((H,C),O,M) tree using the standard maximum likelihood method from phylogenetics. The four branch lengths of the unrooted tree were estimated using a GTR+Gamma(4) model for each alignment used in the CoalHMM analysis, using the bppML program (Dutheil and Boussau 2008). We measured the departure from the molecular clock by computing the ratio of the chimpanzee branch length over the human branch length and subtracting one

We find a departure of the chimpanzee and human branch length of 0.1 on average (see Fig. S1), showing a general departure from the molecular clock, due to an acceleration on the chimpanzee lineage, and/or a higher sequencing error in the chimpanzee sequence. Not surprisingly we find a strong correlation between the amount of ILS inferred and the departure from molecular clock (Kendal’s tau=0.24, p-value 2.2e-16, Fig.S1 right).

The alignments from the X chromosome display a much higher average departure of 0.5 from the molecular clock (see Fig. S1). This result can be explained by the lower coverage of the X chromosome, due to the sequencing of a male individual. By fitting a standard regression line, we find that the amount of ILS when the molecular clock assumption is valid is significantly different from 0 (Intercept = 0.99%, t-test p-value 2.2e-16). The high values of ILS observed on the X chromosome however suggest that much of the ILS is artificial and due to the higher sequencing error of the chimpanzee. We hence decided to remove the X chromosome in the analysis of the patterns of ILS.

By distinguishing between ILS due to the HO genealogy and CO genealogy, we find that HO correlates better with the departure from molecular clock (Kendal’s tau = 0.33 for HO versus 0.06 for CO), further supporting the effect of a longer chimpanzee branch length, as a higher number of mutations or sequencing error on the chimpanzee lineage should increase the support for HO, and not CO. These results also explain the observed higher frequency of HO compared to CO. We indeed report a highly significant positive correlation between the difference in frequencies between HO and CO and the departure from molecular clock (Kendal’s tau = 0.52, p-value 2.2e-16, Fig. S1 right).

FigureS1. Left: Comparison of the X chromosome and the autosomes. The X chromosome shows a larger departure from the molecular clock (0.5 on average) than the autosomos (0.1 on average). The molecular clock is a key assumption in the hidden Markov model analysis. Right: The departure from molecular clock is proportional to the difference in the number of HO and CO sites. The X chromosome has a higher difference, presumable due to chimpanzee sequencing errors.

2. Count pattern analysis

For the count pattern analysis we used the filtering criteria from (Patterson et al. 2006); only sites where the two flanking columns are non-polymorphic are considered, and all sites with one or more CpG pairs are discarded.

In TableS2, we show a summary of the human-chimpanzee-orangutan-macaque-marmoset (HCOMJ) alignments. We only consider segregating sites with exactly two distinct nucleotides; all other sites are discarded from the summary. The species are closely related and therefore we only remove a small proportion of the segretating sites when we assume binary data.

TableS1. The first columns describe the type of site pattern and corresponding count for the binary site pattern. Two models are considered: In the species tree model, we assume a fixed tree and a molecular clock for the HCO subtree. In the mixture model, we assume a mixture of the species tree and ((H,O),C) and ((C,O),H) trees. The two ’expected’ columns show the expected counts under each model. The two ’rel.err’ columns show the relative errors, defined as . We observe more HO and CO counts than expected under a fixed tree. The lack of fit in these counts is expected under incomplete lineage sorting; we observe that the mixture model provides a better fit for the HO and CO counts.

FigureS2. A. The data in TableS1 is modeled using a two-state continuous time Markov process on the species tree. The estimated branch lengths are indicated on the figure. Branch lengths are expected number of substitutions for 100 sites. A molecular clock is assumed in the HCO subtree. B. Branch lengths for the alternative trees. Each of the three alternative trees are given by equal to (HC,O), (HO,C) or (CO,H). The only new parameters of the mixture model compared to the species tree model are the inner branch length and the probability of a site belonging to an alternative tree. We find the maximum likelihood estimates and .

The human nucleotide is assigned state 0. The value of C,O,M and J can each be of type 0 or 1, giving different site patterns because the constant site pattern is excluded. We assume sites are independent and identically distributed such that the data can be summarized in terms of counts of the various site patterns. The first six columns in TableS1 show the type of site and corresponding observed site pattern count.

We fitted a two-state continuous time Markov model to the binary data. We assume a fixed tree such that the only parameters are the branch lengths and we assume a molecular clock in the HCO part of the tree. The maximum likelihood estimates of the five free parameters are summarized in FigureS2A. Note that the branch length to macaque (2.29) is much longer than the corresponding branch length to either human, chimpanzee or orangutan (1.70), which is why we do not assume a molecular clock for this branch.

From TableS1 it is evident that the HO and CO counts are the most poorly recovered under the species tree model; there are too many HO and CO observed counts than expected under the fixed species tree. ILS between human, chimpanzee and orangutan can explain that the observed counts are higher than the expected counts for the HO and CO sites; the alternative trees ((H,O),C) and (H,(C,O)) predict a higher count of HO and CO counts, respectively.

We therefore fitted a mixture model to the count data that allows for alternative trees. We fixed the branch lengths from the species tree and assumed that a particular site is a realization of this basic tree with probability and of an alternative tree with probability . We constraint the branch lengths of the alternative trees by assuming they are all equal to the species tree except for the inner branches, and we still assume a molecular clock. The alternative trees thereby introduce one new branch length parameter and in total the mixture model has two extra parameters () compared to the species tree model. See FigureS2B for an illustration. We note that the relationship between the parameter and the amount of ILS is not simple. The idea behind the mixture model is simply to investigate if the observed count pattern can be better explained by allowing alternative trees.

It is evident from TableS1 that the mixture model is a better fit to the binary count data. In particular the number of observed and expected HO and CO counts are much more similar in the mixture model than under the species tree model. The maximum log-likelihood values for the species tree and the mixture model are -205,409,832 and -205,407,381, giving a two times log-likelihood ratio of 4902. With only two added parameters, this test statistic is highly significant. The count pattern analysis thus indicates a small but significant amount of ILS between human, chimpanzee and orangutan.

3. Indel analysis

An informative indel is a region in the alignment that groups either human and chimpanzee (HC indel), human and orangutan (HO indel) or chimpanzee and orangutan (CO indel). Generally, an indel is called if it passes the following criteria: i) The indel is at least 5 bp long, ii) The 50 columns to the left and right of the indel are without gaps, iii) In the 200 flanking columns (100 to the left and 100 to the right), there are at least 80% identical columns and at most 10% gap columns (see FigureS3).

Duplications in the ancestral human-chimpanzee-orangutan species can potentially lead to alignments of e.g. a human paralog with a chimpanzee and a orangutan homolog. We removed indels in regions with duplications in more than one of the genomes based on a BLAT analysis (Kent 2002).

In FigureS3 we illustrate the criteria for calling an informative indel and an example of a called HO indel is shown in FigureS4.

Figure S3: The criteria used for calling an informative indel

Figure S4. An example of an informative indel supporting human and orangutan as sister species. There are several HO informative sites in the same region, and more C singletons than either human or orangutan singletons

We conducted multiple analyses of the informative indels, but here we mainly focus on two aspects, namely (i) the number of informative sites, and (ii) the distance to informative sites. In FigureS5, we show the mean number of informative sites in the 100 alignment columns of the flanking indel. We group the informative sites according to each of the three types of informative indels. We observe that HC sites are more likely in flanking regions of HC indels than in flanking regions of HO or CO indels. Similarly, the HO sites are most likely in flanking regions of HO indels, and CO sites are most likely in flanking regions of CO indels. These observations are consistent with ILS. We tested for significance by randomly permuting indel classifications and found the observed pattern significant (P-value=0.030).

FigureS5. Mean number of informative sites for each of the three types of informative indels. HC sites are more likely in flanking regions of HC indels than in flanking regions of HO or CO indels. Similarly, the HO sites are most likely in flanking regions of HO indels, and CO sites are most likely in flanking regions of CO indels.

In FigureS6 we show the mean distance to each of the three informative sites from each of the three informative indels. We note that the shortest mean distance to a HC, HO and CO site is in a HC, HO and CO indel. These observations indicate that the informative indels are true evolutionary insertions or deletions and thus are consistent with the presence of ILS. This pattern was also tested by permutation and again found to be significant (P-value=0.028).

FigureS6. Mean distance to informative site for each of the three types of informative indels. The mean distance to e.g. a HO site is smallest for HO indels. Similarly, the mean distance to a HC site is smallest for HC indels, while the mean distance to a CO site is smallest for the CO indels.

Long indels ( bps) are expected to be unique events and the fact that 2-3% of parsimonously informative indels support alternative genealogies is support for incomplete lineage. Our analysis of the flanking regions around informative indels further shows precisely those signals one would expect in the presence of ILS. We see (i) an excess of HO and CO sites in flanking regions of HO and CO indels, and (ii) a shorter distance to the first informative HO (CO) site from an HO (CO) indel than from a HC indel.

4. Candidate regions for incomplete lineage sorting inferred from coalescent HMM

Scanning for large HO and CO regions

We developed a scanning procedure for identifying large candidate regions supporting either the HO or the CO genealogy. This was done by sliding a 2000 bp window on the genome alignment with a step size of 500 bp. For the HO candidates, the window was considered a candidate for ILS if the following criteria were fulfilled:

The proportion of HO genealogies is greater than 0.5, as estimated from the posterior decoding.
The window has at least one parsimony site supporting HO.
The number of parsimony sites supporting HO is larger than the number of parsimony sites supporting CO.
The number of H singletons is smaller than the number of C singletons.
The number of H singletons is larger than half the number of O singletons.
The human-chimpanzee divergence is smaller than 4%.

The criteria are similar for the CO candidates. Criteria 1 to 3 are simple checks to retrieve regions with a significant proportion of sites of interest (HO in this case). Criterion 4 results from the fact that we expect the chimpanzee branch to be longer than the human branch under the HO genealogy. Similarly, we expect the human branch to be of the same magnitude as the orangutan branch, whereas in the rest of the genome the orangutan branch is longer than both the human and chimpanzee branch (criterion5). Finally, we want to exclude regions with too large human-chimpanzee divergence (criterion 6), as they could be the signature of a duplication, potentially biasing the inference. An example from the scanning is shown in Figure 3B of the main paper.

This complete scanning procedure results in 56 regions with the HO genealogy, and 74 regions with the CO genealogy (see TablesS2 and S3). These regions were essentially non-coding, and do not display any particular chromosomal pattern, apart from an excess on chromosome 4 (see FigureS7). There is also a tendency for the HO and CO regions to be spatially correlated, which is expected as a local increase in the ancestral population size should not favour one of the two alternative topologies.

In FigureS7 we illustrate the position of regions inferred to strongly support the alternative HO or CO topologies across the chromosomes. We observe that ILS is widely spread throughout the genome.

Figure S7. The distribution over the genome of regions with strong support for alternative genealogies due to ILS

In Tables S2 and S3 we provide the coordinates of the regions showing strong support of the HO or CO genealogy.

TableS2. Regions supporting the HO genealogy.

TableS3. Regions supporting the CO genealogy.

Dutheil J, Boussau B. 2008. Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs. BMC evolutionary biology8: 255.

Kent WJ. 2002. BLAT--the BLAST-like alignment tool. Genome Res12(4): 656-664.

Patterson N, Richter DJ, Gnerre S, Lander ES, Reich D. 2006. Genetic evidence for complex speciation of humans and chimpanzees. Nature441(7097): 1103-1108.