Supplementary Text S1

i. Discriminant analysis of principal components (DAPC)

To assess how the study samples clustered into classes and to gauge underlying structures in genotype data, we applied DAPC [1] as a mean of visualizing the level of similarity within differently assembled versions of the dataset. DAPC relies on data transformation using principal component analysis (PCA) as a prior step to discriminant analysis (DA). The method can group observations into homogeneous classes derived from their distances to generate a graphical representation of the relatedness between the inferred clusters. DAPC can implement training set data to establish classification rules and testing set(s) to gauge their efficiency.

Initial DAPC runs on the training set samples used all 63 pigmentation related SNPs (in order to assess the maximum separation of clusters possible from all available genetic data) and then analyses were made of subsets of the most closely associated markers identified by INB assessments in this study applying four and eight different hair colours categories In order to strengthen clustering, we retained the same number of PCs then SNPs analysed and two discriminant functions in each simulation. DAPC calculations were performed using R statistical software [2] (R v.3.0.1, together with the adegenet package (adegenet v.1.4-2, [3,4].

ii. Linkage disequilibrium and haplotype block analysis

The 63 pigmentation related SNPs were analysed for Hardy Weinberg equilibrium (HWE) and plots of inter-SNP linkage disequilibrium (LD) were prepared using Haploview [5]. A default distance of 500 kilobases between markers was selected to compute LD statistics for each chromosome.

iii. Analysis of epistasis with multifactor dimensionality reduction (MDR)

Multifactor dimensionality reduction (MDR) was performed in order to detect and characterise epistasis between SNPs in the final recommended hair colour predictive set of 12 markers. MDR was made for each pairwise phenotype differentiation in the four categories system (e.g. blond vs. non-blond, etc.). MDR is a non-parametric approach permitting interactions to be detected especially in relatively small sample sizes [6]. We applied the MDR analysis module in v.2.0.

REFERENCES

1. Jombart T, Devillard S, Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11:94. doi:10.1186/1471-2156-11-94

2. R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

3. Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24 (11):1403-1405. doi:10.1093/bioinformatics/btn129

4. Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics 27 (21):3070-3071. doi:10.1093/bioinformatics/btr521

5. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21 (2):263-265. doi:10.1093/bioinformatics/bth457

6. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC (2006) A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of theoretical biology 241 (2):252-261. doi:10.1016/j.jtbi.2005.11.036