2/18/20092/11/2009 8:38:30 AM11:33:26 AM

Supplementary Material:

CYP2C8 evolution

It is not surprising that several historical recombinants must have occurred in the ancestry of the current haplotypes. However, the haplotype frequency distributions argue that none of the genomic regions has a particularly high rate of recombination and almost all crossover products now seen represent “ancient” crossovers that occurred before modern humans expanded out of Africa.

We can identify 17 haplotypes that occur commonly enough in multiple populations that we can be confident of their existence. Undoubtedly many of the individually very rare haplotypes, lumped into the residual class, are also true. However, the occasional occurrence of missing data allows for the possibility of incorrect inference by the statistical programs used, as witness by the fact that different programs (HAPLO, PHASE, fastPHASE) commonly differ somewhat on the occurrences/frequencies of very rare haplotypes but virtually never differ on those found at 5% or greater frequency in at least one population. Clearly, almost all the evolutionary information is present in the more common of the 17 haplotypes and we consider the amino acid variants in that context. The five of the known uncommon amino acid variants we have studied do occur sufficiently frequently that we are confident of their haplotypes.

In order to better understand the evolutionary relationships of these haplotypes we have identified those groups of adjacent SNPs within the haplotype among which we see no definite evidence of recombination. For these groups accumulation of mutations from the ancestral sequence is a sufficient (parsimonious) explanation of all existing haplotypes within a group (or molecular subregion). (This follows the process we have described for ADH7 (Han et al., 2005) and CYP2E1 (Lee et al., 2008). Three such groups exist for the 10 SNPs we have studied (Figure S1). The first group comprises the two SNPs at CYP2C9 and that pair is separated by the longest intermarker distance from the 8 SNPs at CYP2C8. Those CYP2C8 SNPs divide into two groups: one consists of three SNPs (numbers 3, 4, and 5 in Table 1) and the other consists of five SNPs (numbers 6-10 in Table 1). The diagrams on the left of Figure S1 show how mutations accrued within each group with the ancestral sequence numbered “1” within each group. The trees on the right of Figure S1 show how the 17 haplotypes in Figure 2 of the paper are composed of the different subhaplotypes of the three groups. Unfortunately, there is no unambiguous way to temporally order all of the mutation and historical crossover events, though many reasonable inferences can be made.

Most of the presumably recombinant haplotypes appear to be ancient crossovers that became common and not to be common because of frequent ongoing recombination. The implication is that since humans expanded out of Africa each extant copy of each of the 17 haplotypes has a history of evolving by descent independently from every other distinct haplotype that preexisted in Africa. Understanding this recent independent evolution provides a potential guide to the search for additional biomedically relevant variants: such a variant may not have been identified because the haplotype on which it occurs may not have been sufficiently resequenced. What occurs on the background of one haplotype is very unlikely to occur on any other. We note especially haplotypes D (the ancestral haplotype) and M (a triply derived haplotype) that are common in African populations and rare to virtually absent in non-African populations. Those haplotypes have evolved independently from each other and from any common non-African haplotype for at least 100,000 years. Because they represent some of the oldest haplotypes, they have had the most time to accumulate additional functional mutations.

Dataset comparison

Figure S2 gives the CYP2C8 haplotype frequencies around the world for the four SNPs in common in our study and the study of Rodriguez-Antona et al. (2008). As can be seen most of the data collapse into a single haplotype common around the world. The haplotype B in their study is a subset of that common haplotype.

Lee MY, Mukherjee N, Pakstis AJ, Khaliq S, Mohyuddin A, Mehdi SQ, et al. Global patterns of variation in allele and haplotype frequencies and linkage disequilibrium across the CYP2E1 gene. Pharmacogenomics J 2008; 8:349-56.

Han Y, Gu S, Oota H, Osier M, Pakstis AJ, Speed WC, et al. Evidence of positive selection on a Class I ADH locus. Am J Hum Genet 2007, 80:441-456.

Supplementary Figure S1. The most parsimonious mutational histories are presented on the left with #1 corresponding to the ancestral sequence of each group of SNPs. On the right the combinations of specific haplotypes in each group, ordered left to right, are equated to the full 10-SNP lettered haplotypes in Table 2 and Figure 2 of the main text.

1

2/18/20092/11/2009 8:38:30 AM11:33:26 AM

Supplementary Figure S2.

1