Theoretical and Applied Genetics
Supplementary Material
An innovative procedure of genome-wide association analysis fits studies on germplasm population and plant breeding
Jianbo He1 • Shan Meng1 • Tuanjie Zhao2,4 • Guangnan Xing2,4 • Shouping Yang3,4 • Yan Li3,4 • Rongzhan Guan4,5 • Jiangjie Lu1 • Yufeng Wang1 • Qiuju Xia6 • Bing Yang6 • Junyi Gai1-5
Jianbo He and Shan Meng contributed equally to this work.
* Junyi Gai
1. Soybean Research Institute, Nanjing Agricultural University, Nanjing 210095, China
2. National Center for Soybean Improvement, Ministry of Agriculture, Nanjing 210095, China
3. Key Laboratory of Biology and Genetic Improvement of Soybean (General), Ministry of Agriculture, Nanjing 210095, China
4. State Key Laboratory for Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
5. Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing Agricultural University, Nanjing 210095, China
6. State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, China
Fig. S1 Heat map representation of the SNPLDB GSC in CSGP with hierarchical clustering tree on the left and the top. The color bar on the left and the top represents the source of each accession, green for wild soybean, blue for soybean landrace and red for released cultivar.
Fig. S2 Simulated decay of linkage disequilibrium and theoretical statistical power of association test. (A) Fitted decay of linkage disequilibrium against physical distance for different inbreeding coefficients in simulated population which evolved from a hybrid F1 for 100 generations. (B) Theoretical statistical power of association test for schemes with different linkage disequilibrium and heritability. The two dashed lines indicate r2 value at 0.3 and 0.5, respectively.
IBC, inbreeding coefficients; h2, heritability; r2, linkage disequilibrium;
Fig. S3 Performance of different GWAS methods with different significance levels in 100-locus model simulations under a heritability of 0.9. (A) The simulation was based on Chinese soybean germplasm population (CSGP), comprising 1,024 individuals, genotyped at 36,952 SNPLDBs; (B) the simulation was based on an ideal population (Ideal) simulated from real SNPLDB genotype data of CSGP, in which the genotype data was randomly shuffled across whole genome. Power, overall detection power of the 100 simulated causal loci; FDR, false discovery rate; Naïve, association test based on simple linear model without control for population structure; PCA, association test based on linear model, and the top 10 eigenvectors with largest eigenvalues of SNPLDB genetic similarity coefficient (GSC) matrix were incorporated as covariates to correct for population structure; LMM, association test based on linear mixed model, and SNPLDB GSC matrix was incorporated as covariance structure of random effect to correct for population structure; HAT, the haplotype-based association test implemented in PLINK software (--hap-assoc); RTM, association test using the proposed restricted two-stage multi-locus multi-allele GWAS method (RTM-GWAS). Nine fixed significance levels without Bonferroni correction were used in comparisons among the four GWAS methods, including the significance levels of 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3 and 1e-2, respectively.
Fig. S4 Q-Q plot of the Naïve, LMM and RTM-GWAS method in GWAS of the 100-seed weight in Chinese soybean germplasm population. Naïve, association test based on simple linear model without control for population structure; LMM, association test based on linear mixed model, and SNPLDB GSC matrix was incorporated as covariance structure of random effect to correct for population structure; RTM, association test using the proposed restricted two-stage multi-locus multi-allele GWAS method (RTM-GWAS). The p-values for all markers in the RTM-GWAS method under the significance level of 0.01 were obtained in the last step of forward selection and backward elimination.
Fig. S5 Candidate gene analysis of 100-seed weight. (A) Gene Ontology distribution based on biological process of the 100-seed weight candidate genes in CSGP. I − V on the top bar represent five groups of biological processes for candidate genes of 100-seed weight, with Group I including primary metabolism and secondary metabolism, Group II including seed development and cell growth, Group III as process of photosynthesis, Group IV including flower development, response to stress and signal transduction and Group V as the other unknown processes. The relatedness between the candidate gene groups and 100-seed weight are arranged in a decreasing order from the close to the distant. (B) Allele distribution among accessions of gene Glyma14g10915 in CSGP. The six black triangles on the gene represent six SNPs inside the gene; the combinations of their bases in CSGP represent the gene’s haplotypes. The a1−a5 are the five alleles for the SNPLDB locus, and the following data in parentheses is the allele effect (unit: g). The number of gene haplotype, i.e. 9, is greater than that of QTL allele, i.e. 5, because the expanding region of ±30 kb was used to search genes and the first two SNPs of Glyma14g10915 was not in the SNPLDB interval.
Fig. S6 Distribution of positive and negative effect alleles at subpopulation and whole population level. WA, wild soybean accessions; LR, soybean landraces; RC, released soybean cultivars.
Fig. S7 Optimal recombination design based on independence assortment model for soybean seed weight. (A) The distribution of predicted seed weight of the simulated progenies of all possible single crosses, with the maximum and minimum (top and bottom horizontal line) values of parental lines. (B) The top 20 superior single crosses. The optimal crosses were identified according to the 99th percentile of the sample as its breeding potential. Each cross was presented with QTL genotype of the two parental lines in one column, while different colors represent the size of QTL allele effect.
Table S1 Estimated decay distance of simulated populations with different inbreeding coefficients corresponding to different linkage disequilibrium thresholds
IBC / N / Distance (kb)r2 = 0.3 / r2 = 0.4 / r2 = 0.5
0 / 98.2 / 238 / 153 / 102
0.1 / 93.8 / 249 / 160 / 107
0.2 / 85.7 / 272 / 175 / 117
0.3 / 79.8 / 293 / 188 / 125
0.4 / 69.8 / 334 / 215 / 143
0.5 / 62.9 / 371 / 238 / 159
0.6 / 53.4 / 437 / 281 / 187
0.7 / 44.2 / 528 / 339 / 226
0.8 / 32.7 / 714 / 459 / 306
0.9 / 18.5 / 1,260 / 810 / 540
0.95 / 10.3 / 2,255 / 1,449 / 966
0.99 / 4.0 / 5,891 / 3,787 / 2,525
1 / 2.6 / 8,919 / 5,733 / 3,822
IBC, the inbreeding coefficient; N, the estimated effective population size; r2, the squared correlation coefficient between two loci; Distance, the physical distance between two loci.
Table S2 Distribution of markers on chromosomes in Chinese soybean germplasm population
Chr. / SNP / SNPLDBNo. / Min.
(bp) / Max.
(kb) / Mean
(kb) / No. / Min.
(bp) / Max.
(kb) / Mean
(kb)
Gm01 / 8,941 / 5 / 156 / 6 / 1,877 / 5 / 248 / 30
Gm02 / 7,936 / 5 / 287 / 7 / 2,121 / 5 / 290 / 24
Gm03 / 7,007 / 5 / 414 / 7 / 1,668 / 5 / 414 / 29
Gm04 / 9,145 / 5 / 163 / 5 / 2,166 / 5 / 259 / 23
Gm05 / 5,146 / 5 / 191 / 8 / 2,029 / 5 / 311 / 21
Gm06 / 7,199 / 5 / 227 / 7 / 1,845 / 5 / 284 / 27
Gm07 / 5,939 / 5 / 332 / 8 / 2,016 / 5 / 332 / 22
Gm08 / 5,900 / 5 / 178 / 8 / 1,916 / 5 / 253 / 24
Gm09 / 6,896 / 5 / 233 / 7 / 1,534 / 5 / 241 / 30
Gm10 / 7,947 / 5 / 199 / 6 / 1,927 / 5 / 287 / 26
Gm11 / 5,702 / 5 / 175 / 7 / 1,564 / 5 / 245 / 25
Gm12 / 5,841 / 5 / 155 / 7 / 1,852 / 5 / 221 / 22
Gm13 / 5,420 / 5 / 767 / 8 / 1,955 / 5 / 767 / 23
Gm14 / 8,765 / 5 / 171 / 6 / 1,910 / 5 / 273 / 26
Gm15 / 8,204 / 5 / 437 / 6 / 1,670 / 5 / 437 / 30
Gm16 / 6,147 / 5 / 819 / 6 / 1,327 / 5 / 819 / 28
Gm17 / 6,433 / 5 / 300 / 7 / 1,400 / 5 / 300 / 30
Gm18 / 10,345 / 5 / 294 / 6 / 2,248 / 5 / 294 / 28
Gm19 / 8,773 / 5 / 390 / 6 / 1,971 / 5 / 402 / 26
Gm20 / 7,872 / 5 / 276 / 6 / 1,956 / 5 / 421 / 24
Total / 145,558 / 5 / 308 / 7 / 36,952 / 5 / 355 / 26
Chr., chromosome; SNPLDB, SNP linkage disequilibrium block; No., the number of markers; Min., Max. and Mean, minimum, maximum and average distance between adjacent markers, respectively.
Table S3 Linkage disequilibrium calculated for SNPs and SNPLDBs
(kb) / SNP / SNPLDB / Mean / Q1 / Q2 / Q3 / Mean / Q1 / Q2 / Q3
0−10 / 476,904 / 39,059 / 0.91 / 0.97 / 1 / 1 / 0.83 / 0.82 / 1 / 1
10−50 / 1,078,857 / 75,751 / 0.86 / 0.89 / 0.99 / 1 / 0.74 / 0.52 / 0.90 / 1
50−100 / 1,298,162 / 81,770 / 0.82 / 0.80 / 0.98 / 1 / 0.69 / 0.41 / 0.82 / 1
100−200 / 2,559,390 / 155,561 / 0.79 / 0.68 / 0.97 / 1 / 0.65 / 0.35 / 0.75 / 0.99
200−500 / 7,500,201 / 478,278 / 0.75 / 0.54 / 0.94 / 1 / 0.62 / 0.30 / 0.69 / 0.98
500−1,000 / 12,418,526 / 764,683 / 0.70 / 0.42 / 0.88 / 1 / 0.58 / 0.26 / 0.61 / 0.95
1,000–2,000 / 24,477,296 / 1,476,893 / 0.66 / 0.33 / 0.78 / 1 / 0.56 / 0.23 / 0.55 / 0.92
2,000−3,000 / 23,940,238 / 1,412,785 / 0.62 / 0.28 / 0.70 / 0.99 / 0.54 / 0.21 / 0.52 / 0.89
3,000–5,000 / 46,679,980 / 2,618,958 / 0.59 / 0.25 / 0.64 / 0.97 / 0.52 / 0.20 / 0.49 / 0.86
SNPLDB, SNP linkage disequilibrium block; Distance, physical distance between two markers according to the Williams 82 Assembly 1 Genomic Sequence (Wm82.a1); No. pairs, the number of marker pairs; Mean, the arithmetic mean of all D′ values within each interval; Q1, Q2 and Q3, 1st, 2nd and 3rd quantile of D′ distribution, respectively.
Table S4 Performance comparison of the association analysis methods in 100-locus model simulations based on real soybean genotype data
Heritability / Method / Power / PVE / FDR / RPO / RPV0.2 / Naive / 0.217 / 0.083 / 0.900 / 0.022 / 0.008
PCA / 0.010 / 0.026 / 0.452 / 0.006 / 0.014
LMM / 0.005 / 0.017 / 0.136 / 0.004 / 0.015
HAT / 0.003 / 0.011 / 0.151 / 0.002 / 0.009
RTM / 0.011 / 0.027 / 0.665 / 0.004 / 0.009
0.5 / Naive / 0.488 / 0.360 / 0.973 / 0.013 / 0.010
PCA / 0.066 / 0.213 / 0.815 / 0.012 / 0.039
LMM / 0.029 / 0.150 / 0.391 / 0.018 / 0.091
HAT / 0.035 / 0.149 / 0.679 / 0.011 / 0.048
RTM / 0.051 / 0.189 / 0.431 / 0.029 / 0.107
0.9 / Naive / 0.685 / 0.779 / 0.979 / 0.014 / 0.016
PCA / 0.164 / 0.587 / 0.915 / 0.014 / 0.050
LMM / 0.080 / 0.456 / 0.524 / 0.038 / 0.217
HAT / 0.095 / 0.455 / 0.878 / 0.012 / 0.056
RTM / 0.175 / 0.615 / 0.357 / 0.112 / 0.395
Power, overall detection power of the 100 simulated causal loci; PVE, phenotypic variation explained by detected causal loci; FDR, false discovery rate; RPO, relative detection power [Power × (1 - FDR)]; RPV, relative PVE [PVE × (1 - FDR)]. A uniform Bonferroni-adjusted significance level of 0.05 was used for Naïve, PCA, LMM, HAT, and the second stage of RTM-GWAS, while a threshold of 0.05 was used for the first stage of RTM-GWAS.
Table S5 Performance comparison of the association analysis methods in 10-locus model simulations based on real soybean genotype data
Heritability / Method / Power / PVE / FDR / RPO / RPV0.2 / Naive / 0.436 / 0.168 / 0.925 / 0.033 / 0.013
PCA / 0.216 / 0.145 / 0.680 / 0.069 / 0.046
LMM / 0.186 / 0.135 / 0.521 / 0.089 / 0.065
HAT / 0.154 / 0.111 / 0.530 / 0.072 / 0.052
RTM / 0.225 / 0.140 / 0.286 / 0.161 / 0.100
0.5 / Naive / 0.696 / 0.470 / 0.990 / 0.007 / 0.005
PCA / 0.404 / 0.446 / 0.901 / 0.040 / 0.044
LMM / 0.326 / 0.420 / 0.744 / 0.083 / 0.107
HAT / 0.314 / 0.401 / 0.835 / 0.052 / 0.066
RTM / 0.437 / 0.440 / 0.138 / 0.377 / 0.379
0.9 / Naive / 0.818 / 0.862 / 0.996 / 0.003 / 0.004
PCA / 0.553 / 0.845 / 0.962 / 0.021 / 0.033
LMM / 0.466 / 0.822 / 0.814 / 0.087 / 0.153
HAT / 0.466 / 0.809 / 0.938 / 0.029 / 0.051
RTM / 0.687 / 0.849 / 0.082 / 0.631 / 0.779
Power, overall detection power of the 10 simulated causal loci; PVE, phenotypic variation explained by detected causal loci; FDR, false discovery rate; RPO, relative detection power [Power × (1 - FDR)]; RPV, relative PVE [PVE × (1 - FDR)]. A uniform Bonferroni-adjusted significance level of 0.05 was used for Naïve, PCA, LMM, HAT, and the second stage of RTM-GWAS, while a threshold of 0.05 was used for the first stage of RTM-GWAS.
Table S6 Performance of the RTM-GWAS method under different sample sizes in 100-locus model simulations based on ideal genotype data
N / Power / PVE / FDR / RPO / RPV200 / 0.022 / 0.213 / 0.326 / 0.015 / 0.144
400 / 0.101 / 0.506 / 0.098 / 0.091 / 0.456
600 / 0.190 / 0.691 / 0.044 / 0.182 / 0.661
800 / 0.256 / 0.762 / 0.023 / 0.250 / 0.745
Power, overall detection power of the 100 simulated causal loci; PVE, phenotypic variation explained by detected causal loci; FDR, false discovery rate; RPO, relative detection power [Power × (1 - FDR)]; RPV, relative PVE [PVE × (1 - FDR)]. A Bonferroni-adjusted significance level of 0.05 was used for the second stage of RTM-GWAS, while a threshold of 0.05 was used for the first stage of RTM-GWAS.