Defining the Donor Haplotypes

SUPPLEMENTARY TEXT

Defining the donor haplotypes

The initial method described in Melé et al. (2010) was modified in order to extract the extant sequences carrying the donor haplotypes of the detected recombinations. The donor haplotypes are defined as those neighbouring patterns present at each of the two sides of a recombination breakpoint (Figure 2). The method is based on aggregating several runs of a basic algorithm which tries to identify the recombinant sequences on a set of extant sequences. Using multiple sliding windows of different sizes and adding up the information on the successive runs, we obtain, for each recombination event, a distribution of detections in specific sequences along the SNPs. A threshold based on extensive simulations was established in order to choose which recombinations were taken as high confidence events.

An equivalent approach was taken in order to define the descendant sequences carrying the donor haplotypes. For each recombination event detected, sequences carrying the neighboring patterns present at each of the two sides of a recombination breakpoint were extracted. Then, using multiple sliding windows of different sizes and adding up the information on the successive runs, we obtained, for each recombination event, a set of distributions of putative descendant sequences of the donor haplotypes. A threshold was established in order to determine which sequences were to be taken as high confidence descendants of the donor haplotypes; IRiS considered a sequence descendant of the donor haplotypes if, across the different runs of the algorithm, it had been detected at least half of the times that recombination event had been detected. Finally, since the biggest sliding window size used by IRiS was 20 SNPs, the length of the donor haplotypes was set up to 20 SNPs.

The validation of this method was performed using the cosi simulator. From the whole genealogy created in each simulation, the information of every intermediate sequence in each of the internal nodes was extracted. The variants present in the ancestral sequences just before a recombination event took place were then extracted (the donor haplotype sequence). Finally, any of the extant simulated sequences carrying those variants was considered to be descendant of the donor haplotypes of a specific recombination event. After running over 100 cosi simulations, the average false discovery rate for the donor haplotypes was 8.07%, the sensitivity was 51.14 %, and the false negative rate was 44.6%.

Constructing the subARG

Every detected recombination defines three nodes, the subsequences borne by each of them, and their sets of extant descendants. This relationship can be represented in a local topology as shown in Suppl. Figure 4. Recombinations with genetically neighboring breakpoints, along the chromosome, create nodes bearing neighboring subsequences and shared the set of descendants. The information among genomic neighboring nodes reveals a more recent shared ancestry, and is exploited to improve the resolution of the subARG. It is integrated into the ARG by adding a new vertex for each such overlap (see Suppl. Figure 5 and Figure 2). The subsequence at the new vertex is an integration of the subsequences borne by the vertices creating it. Similarly, the set of descendants of the new vertex is an intersection of the two descendant sets. While creating these additional nodes, a higher precedence is given to the closest neighbor. Nodes located far apart are less likely to have been co-inherited from the same ancestor. Accordingly, a threshold of d = 10 SNPs was conservatively established. Nodes with segments further apart than this threshold were not co-analyzed.

Null distribution of node mapping in validation

To understand the significance of the node mapping during validation, one cosi simulation was randomly chosen for further evaluation. Suppl. Figure 6 shows a scatter plot of the best match Jaccard index for every node in the subARG generated for this simulation. Since very small sets are more likely to map well just by chance, the cardinality of the set is taken into account in the scatter plot. To evaluate the probability of a set mapping well just by chance, 10,000 random descendant sets were generated for each cardinality (varying from 1 to 60), and the Jaccard index for the best match was computed. The line represents the average value of the best match. The plot indicates that the mapped values are significantly higher and unlikely to occur just by chance. Ninety-five percent of the random sets had a best match with a Jaccard index lower than 0.333. Across all simulations less than 0.6% of the nodes mapped below this significance criterion.

Reconciling Structure runs

Structure (Pritchard et al 2000) was used to perform unsupervised Bayesian clustering of samples using four different measures; based on the direct SNP data, fixed length haplotypes, the subARG and probabilistic haplotypes. For the first three methodologies the software was run 5 times, and for probabilistic haplotypes 10 times, per each value of number of clusters. Each runs had a burn-in period of 50,000 iterations followed by 50,000 iterations of sampling using the admixture model. The cluster assignments across the runs, for each method and number of clusters, were aligned using CLUMPP (Jakobsson and Rosenberg 2007). Different replicates were combined into modes such that every pair of replicates sharing a mode had a symmetric similarity coefficient greater than or equal to 0.9. Average ancestry coefficients for the most frequent mode have been displayed in Figure 6. For k=2, two modes appeared across the methods. In each run, two of three main groups (sub-Saharan African, European, or East Asian) were merged. The frequent mode appeared in 60-80% of the runs. The three major clusters appeared consistently for k=3 across all the runs with strong correlation in ancestry coefficients within each method (average SSC 0.94-1.00). For k=4, the added cluster is consistently restricted outside sub-Saharan Africa. There is no further discernable pattern of this new component for SNPs, probabilistic haplotypes and subARG. However for fixed length haplotypes the new component agrees with South East Asian ancestry with average SSC = 0.91 across the runs.

References

Jakobsson M, Rosenberg NA (2007) CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23 (14): 1801-1806

Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155 (2):945-959

SUPPLEMENTARY FIGURES

Figure S1 – 10,000 coalescent simulations were conducted using cosi to empirically estimate the probability distribution of allele age based on its frequency in extant samples. The heat map depicts the probability distribution of allele age at a particular frequency. Red and blue signify the high and low end of the probability spectrum. The colors are spread on a log scale for better visibility.

Figure S2 – The empirical distribution of the true age of nodes recovered by the subARG in validation simulations is plotted.

Figure S3 - Informativeness versus haplotype length (L). Informativeness is defined as the product between average number of haplotypes across all windows and average number of sequences per haplotype across all windows. Standard deviations are calculated by changing the starting position in which haplotypes of length L start to be defined.

Figure S4 – A recombination is detected with a breakpoint ‘brkpt’ and one descendant {a}. The donor haplotypes are shared with extant samples {b, c} left of the breakpoint and samples {d, e} right of the breakpoint respectively. The structure depicts segment demarcation, and the set of descendants of each of the three nodes that this recombination contributes to in the subARG.

Figure S5 –A and B bear neighboring segments and the intersection of their descendant sets is non-empty. A new node C is created in the subARG with descendants equal to the intersection of the two descendant sets, and segment demarcation a union of the two segments.

Figure S6 – Each point in the scatter plot depicts the cardinality of descendants of a subARG node and its best match Jaccard coefficient. To evaluate its significance 10,000 random sets of each set size are also generated and the best match computed for this particular ARG. The average for the random sets is indicated by the line. The subARG data points are significantly higher than the random sets indicating that a good mapping is unlikely to have occurred just by chance.

SUPPLEMENTARY TABLES

Table S1. Information on the samples genotyped

Population Acronym / Population Name / Nº initial samples / Nº samples failed / Nº females / Nº final samples / Nº final sequences
YRI / Yoruba / 53 / 0 / 0 / 53 / 53
MKK / Maasai / 46 / 0 / 0 / 46 / 46
LWK / Luhya / 46 / 0 / 0 / 46 / 46
CHA / Chadian / 46 / 3 / 0 / 43 / 43
ASW / African America / 45 / 0 / 0 / 45 / 45
LEB / Lebanese / 46 / 4 / 0 / 42 / 42
KUW / Kuwaitis / 46 / 3 / 0 / 43 / 43
IRA / Iranian / 46 / 14 / 0 / 32 / 32
EGY / Egyptian / 46 / 0 / 0 / 46 / 46
MOR / Moroccan / 46 / 26 / 0 / 20 / 20
CEU / European ancestry / 45 / 0 / 1 / 45 / 46
BRI / British / 46 / 1 / 13 / 45 / 58
DUT / Dutch / 43 / 14 / 0 / 29 / 29
BAS / Basque / 46 / 1 / 0 / 45 / 45
GYP / Gypsies / 36 / 1 / 11 / 35 / 46
TSI / Toscans in Italy / 46 / 0 / 0 / 46 / 46
ROM / Romanian / 38 / 5 / 0 / 33 / 33
CHE / Chechenian / 46 / 8 / 0 / 38 / 38
RUS / Russian / 46 / 4 / 0 / 42 / 42
TAT / Tatar / 46 / 0 / 0 / 46 / 46
ALT / Altaian / 46 / 16 / 0 / 30 / 30
UIG / Uigur / 46 / 1 / 0 / 45 / 45
GIH / Gujarati / 46 / 0 / 0 / 46 / 46
CAN / Cape Nadar / 50 / 3 / 0 / 47 / 47
NTN / Northern Tamil Nadu / 32 / 0 / 0 / 32 / 32
KAL / Kalita / 41 / 0 / 0 / 41 / 41
ADI / Adi / 33 / 1 / 0 / 32 / 32
TIB / Tibetan / 50 / 3 / 0 / 47 / 47
LAO / Lao / 46 / 2 / 0 / 44 / 44
ATI / Ati / 46 / 27 / 0 / 19 / 19
CHB / Han Chinese / 34 / 0 / 12 / 34 / 46
JPT / Japanese / 35 / 0 / 12 / 35 / 47
MEX / Mexican / 46 / 0 / 0 / 46 / 46
1455 / 137 / 49 / 1318 / 1367

Table S2 .Continental and Ethnic origin of the sampled populations

Population Acronym / Continental group / ancestry region / Ethnicity / sampling place / Latitude / Longitude
YRI / Africa / Ibadan, Nigeria / Yoruba / Ibadan, Nigeria / 7.3964 / 3.9168
MKK / Africa / Kinyawa, Kenya / Maasai / Kinyawya, Kenia / -2.9103 / 37.5248
LWK / Africa / Webuye, Kenya / Luhya / Webuye, Kenia / 0.6167 / 34.7668
CHA / Africa / Southern Chad, Chad / Laal & Sara / Southern Chad / 12.112 / 15.035
ASW / Africa / unknown, unknown / Unknown / Southwest USA / NA / NA
LEB / Middle East and
North Africa / general population, Lebanon / Lebanese / Lebanon / 33.8886 / 35.4955
KUW / Middle East and
North Africa / general population, Kuwait / Kuwaitis / Kuwait / 29.3676 / 47.9764
IRA / Middle East and
North Africa / mostly Kordestan, Iran / Iranian / Iran / 35.6965 / 51.4231
EGY / Middle East and
North Africa / Egypt / Egyptian / Egypt / 30.0647 / 31.2497
MOR / Middle East and
North Africa / Assa-zag, Morocco / moroccan / Morocco / 34.015 / -6.8325
CEU / Europe / North and West Europe,
unknown / european ancestry / Utah, USA / 48.8564 / 2.3516
BRI / Europe / Great Britain, UK / British / UK / 51.5002 / -0.1264
DUT / Europe / Netherlands / Dutch / Netherlands / 52.3741 / 4.891
BAS / Europe / Guipuzcoa, Spain / Basque / Guipuzcoa, Spain / 43.1368 / -2.0737
GYP / Europe / unknown, Spain / Gypsy / La Mina (Sant Adrià del Besòs),
Spain / 41.4189 / 2.2208
TSI / Europe / Toscany, Italy / Italian / Toscany, Italy / 43.7677 / 11.2571
ROM / Europe / Romania / romanian / Romania / 44.4322 / 26.1047
CHE / Europe / dagestan, Russia, Russia / chechenian / dagestan, russia / 43.3102 / 45.6704
RUS / Europe / Arkhangel, kostroma and
pskov regions, Russia / Russian / Arkhangel, kostroma and pskov
regions in Russia / 55.7558 / 37.6185
TAT / Central Eurasia / around kazan city, Russia / Tatar / around kazan city, Russia. / 55.6977 / 49.1082
ALT / Central Eurasia / Onguday, Ulagan, Turochak,
Choisky districts, Russia / Altaian / Onguday, Ulagan, Turochak,
Choisky districts in Russia / 51.9582 / 85.9715
UIG / Central Eurasia / Xinjiam, China / Uigur / Xinjiam, China / 43.8254 / 87.6173
GIH / Southern Asia / Gujarat, India / Gujarati / Houston, Texas, USA / 23.0389 / 72.566
CAN / Southern Asia / Tamil Nadu, India / High caste in Tamil Nadu / Tamil Nadu, India / 8.18 / 77.43
NTN / Southern Asia / Tamil Nadu, India / Low caste in Tamil Nadu / Tamil Nadu, India / 11.94 / 79.5
KAL / Southern Asia / Assam, India / High caste in Assam / Assam, India / 26.76 / 94.21
ADI / Southern Asia / North Eastern Arunachal
Pradesh, Siang region, India / alpine populations / Siang region, North Eastern
Arunachal Pradesh, India / 28.24 / 94.07
TIB / East Asia / Tibet, China / Tibetan / Tibet, China / 29.6455 / 91.1413
LAO / East Asia / Laos / Lao / Laos / 17.9629 / 102.614
ATI / East Asia / Phillipines / Ati / Phillipines, / 11.5584 / 122.794
CHB / East Asia / Beijing, China / Han Chinese / Beijing, China / 39.9047 / 116.409
JPT / East Asia / Tokyo, Japan / Japanese / Tokio, Japan / 35.6895 / 139.693
MEX / America / unknown, Mexico / Mexican / Los Angeles, california, USA / NA / NA

Supplementary Table 3. X-chromosome regions genoptyped. Start, end: coordinates of the extreme points of each region (genome build 36). Length in bp.