Supplementary material 2: Assigning to admixed original populations with DAPCand Bayesian clustering analysis

Classical genetic methods of assignment are based on the multilocus genotype of an individual and the expected probabilities of that genotype occurring in each of the potential sources. These “model-based” methods mostly rely on restrictive explicit assumptions (populations at Hardy-Weinberg equilibrium and linkage equilibrium) and were essentially developed for data on diploid (and haploid) genotypes (Manel et al., 2005). Polyploid microsatellites dataset may be scored as binary data (presence-absence), but it remains unclear how these model-based methods are relevant for treating such kinds of data.

The software Structure (Prichard et al., 1999), a Bayesian model-based clustering algorithm, has been recently modified to handle polyploid data and allele copy number ambiguity, under the assumption of full autopolyploid inheritance (Falush et al., 2007). Commonly used to identify genetic clusters in a dataset, Structure may also be used to assign additional individuals of unknown origin to their source populations. By pre-specifying source populations, the algorithm estimates ancestry for additional individuals, updating allele frequencies using only those from source populations.

The Discriminant Analysis of Principal Components (DAPC), a non model-based method recently developed andimplemented in the adegenet R packages (Jombart, 2008), provides an efficient description of genetic clusters using afew synthetic variables (called the discriminant functions). This multivariate analysis seeks linear combinations of theoriginal variables (alleles) which show differences between groups as best as possible while minimizing variation within clusters. Contrary to traditional methods such as PCA or PCoA, which focus on the entire genetic variation, DAPC yields linear combinations of theoriginal variables (alleles) which maximize differences between groups while minimizing variation within clusters.Based on the retained discriminant functions, the analysis derives probabilities for each individual of membership in each of the different groups. This coefficient can be interpreted as “genetic proximity” of individuals to the different clusters. It is possible to construct the linear model and obtain synthetic variables on a givendataset (the source populations), thenaddsupplementary individuals (that were not used in constructing the model) and derive for each one a membership probability to original source populations. These coefficients might provide an “assignment measure”of individuals to predefined groups, comparable with ancestry value derived by the Structure analysis.

In this study, both methods were performed to investigate the potential source(s) of New Guinean landraces in tropical America. Therefore, we first ran these methods with the tropical America dataset,excluding New Guinean samples, which were added as supplementary individuals. However, both methods require prior groups to be achieved.

K-means clustering and the DAPC method:The adegenet package allows running the sequential K-means clustering algorithm, and comparingthe different clustering solutions using the Bayesian Information Criterion (BIC) (after transforming the data using a PCA,notably to reduce the numbers of variables and speed the clustering algorithm)to identify an optimal number of genetic clusters to describe the data.We ran the K-means clustering algorithm forK=2 to K=10 on the tropical America dataset. Based on this analysis, three genetic clusters were considered optimal to describe the data.

Figure 1: Inference of the number of clusters in the DAPC performed on the tropical America dataset. A K value of 3 (the lowest BIC value) represents the best summary of the data.

The three genetic clustersare geographically restricted: Two clusters (K2 and K3) group mostly individuals from the Northern region (90.6 %) while cluster K1 groups mostly those from the Southern region (81.4 %). The grouping obtained for K=2 also provides an accurate description of the tropical America dataset: the cluster K1mainly contained samples from the Southern genepool (79%), and cluster K2 mostly those from the Northern genepool (91%).

DAPC relies on data transformation using principal component analysis (PCA) as a prior step. Retaining too many PCs can lead to overfitting the discriminant functions,which could model any structure and virtually discriminate any set of clusters. Adegenet proposes an optimization procedure to evaluate the optimal numbers of PCs to retain. The procedure is based on the calculation of the α-score, which measures the difference between the proportion of successful reassignment of the analysis (observed discrimination) and values obtained using random groups (random discrimination). The number of retained PCs can be chosen so as to optimize the a-score.

The optimization α-score graph (Figure 2) shows that only few PCs need to be retained for the assignment analysis. We tested DAPC for both groupings (K=2 and K=3), retaining between 4 and 10PCs,for prior data transformation. Assignment results were globally congruent and we present in the article results obtained with 5 PCs retained (29.3 of the total variance).

Figure 2: optimization α-score graph

Bayesian clustering method: We also used a Bayesian clustering methodon the tropical America dataset to predefine groups in tropical America that maybe used for the assignment of New Guinea accessions.We ran Structurefor K=2 to K=6, using the admixture model, correlated allele frequencies, 50000 burn-in iterations and 150 000 Markov chain-Monte Carlo steps and data coding for handling genotype ambiguity for co-dominant markers in polyploids (Falush et al., 2007).We then plotted the∆K Evanno criterion (Evanno et al., 2005) to identify the optimal number of clusters to describe the data.

a) b)

Figure 3:a) Variation of the posterior log-probability of the data as a function of the number of clusters K. b)Variation of ΔK values

Following this method, the optimal number of clusters to describe the data was unclear. We then retained groupings obtained for K=2 and K=3 (as determined by the BIC criterion) and compared their composition with those obtained with the DAPC method.

Results:

Figure 3:Tropical America landraces membership probabilities (DAPC analysis) or ancestry value (Bayesian analysis) to K1 and K2 (DAPC K2 or Bayesian K2)or K1, K2 and K3 (DAPC K3 or Bayesian K3) clusters. Each individual is represented as a vertical bar, with colours corresponding to probabilities of membership in K1 (black), K2 (dark gray), and K3 (light gray).

Both methods gave globally congruent results:Neotropical sweet potatoes are characterized by two distinct genetic groups, geographically circumscribed: one genetic group corresponds to most of the accessions from the Southern region and the other one to those from the Northern region. ForK=3, a sub-structure is revealed in which Northern region accessions are split into two genetic groups.

However, the phylogeographic pattern is not so clear-cut: indeed, in the Southern region, we detected some accessions clearly attributed to the nuclear cluster(s)characteristic(s) of the Northern region (K2, or K2 and K3)or with a mixed genetic constitution. Also, in the Northern region, we identified several accessions attributed to cluster K1 orwith a mixed composition.As we already discussed in a previous paper (Roullier et al., 2011),this situation suggests that Neotropical sweet potatoes are characterized by two original differentiated genepools (probably related to independent domestications in each region) and that clones were secondarily exchanged between both regions and then recombined with local material. This scenario is also supported by chloroplast data (Roullier et al., 2011). This situation of admixture is well underlined by the Bayesian clustering results, for which most of the neotropical accessions exhibited a mixed genetic constitution.With the DAPC method, individual assignment was more “contrasted”, only few individuals exhibiting a mixed constitution.

Thus, it was difficult to use Bayesian clustering results on the tropical America dataset to predefine groups for assignment of the New Guinean accessions. We preferred to use the genetic grouping (and not the a priori regional grouping) inferred by the K-means clustering for K=2, which provides an accurate and simple summary of the neotropical dataset, to perform assignment of New Guinean accessions.

Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software Structure: a simulation study. Mol Ecol14:2611-2620.

Falush D, Stephens M, Pritchard JK (2007) Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol Ecol Notes 7: 574-578.

Jombart T (2008) Adegenet: R package for the multivariate analysis of genetic markers. Bioinf 24:1403-1405.

Manel S,Gaggiotti OE, Waples RS (2005) Assignments methods: matching biological questions with appropriate techniques. Trends EcolEvol20:136-142.

Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics155:945–959.

Roullier C, Rossel G, Tay D, McKey D, Lebot V (2011) Combining chloroplast and nuclear microsatellites to investigate origin and dispersal of New World sweet potato landraces. Mol Ecol20:3963-3977.

.