Identification of Transcriptional Regulatory Sites Using Chip Sequencing Data and A

Supplementary Methods

Workflow and software implementations

The genotype information of SNVs and indels in the VCF files were processed with VCFtools(Danecek et al., 2011). The functional impact of SNVs and indels was predicted byANNOVAR (Wang et al., 2010), SIFT (Sim et al., 2012), and SeattleSeq Annotation 138 (Ng et al., 2009). MATLAB was used as our programming environment due to its capability to carry out large-scale matrix computation. We used the MATLAB-based package Logit-Lapnet(Zhang et al., 2013) with Excel for data importing and exporting and CVX, a MATLAB-based convex optimization software, for parameter estimations. We also used loosely coupled parallelism (Schadt et al., 2010) in our implementations to expedite the calculation. After we ranked all candidate genes based on their scores, top-selected genes were treated as seed genes to generate the GsN using SubNet(Lemetre et al., 2013). Cytoscape was used to visualize the SCZ GsN.

Simulation studies

Though Logit-Lapnet method has been verified to perform better than Enet and Lasso in analyzing gene expression data (Zhang et al., 2013), no previous studies have shown that its performance is better than the other two in analyzing the WES data operating on the ratio data matrix that we proposed. To evaluate the performance of Logit-Lapnet in WES data process, we conducted a simulation. We used simulated WES data – i.e., y, X, , and Lin Equation 3 – to compare the performance of Logit-Lapnet, Enet, Lasso, and Logit methods. We simulated four different data sets under various conditions for a comprehensive assessment of method performance. In all simulations, regression coefficients were differentially preset for all the genes. Constant non-zero coefficients were assigned to genes selected as putative 'causal' genes, and zero as coefficients to the rest. As Equation 3 shows, λ and α are two model parameters. Both of them are present in – and thus tuned for – all models except the simple logistic regression model. Comparing the estimated coefficients with their preset values so as to assess the performance of the methods. To generate such synthetic data, we adopted an approach used by Wan et al in ascertaining their Cox patterns(Wan et al., 2013). We simulated WES data for a set of genes with the characteristics observed in real WES data. We used the following parameters and procedures in our four groups of simulation operationalization:

The sample sizes in our simulations vary from 10 to 200. At a given sample size, we construct the ROC and calculate AUC to assess the method performance.
For L, we simulated the sequencing data containing 1,000 genes in total: 100 transcription factors (TFs) with 9 regulated genes for each. The gene network used in our simulation is derived from those employed in former studies (Li and Li, 2008; Wan et al., 2013; Zhang et al., 2013).
For , the first 30 genes – three TFs and the 27 genes that they regulated – were designed as the putative 'causal' genes and thus all assigned 0.32 as their regression contribution coefficients. The coefficients for the rest 970 genes were all set to zeros. Consequently, the first 30 genes determine the genotypic information.
The data matrix X contains the ratios of damaging allele counts to neutral allele counts. If the SNVs in a gene are all neutral, the corresponding row in X will contain only zeros. The percentage of damaging SNVs for the simulated 'causal' genes should not be miniscule; otherwise they will not have effects at all. Using Polyphen2 and SIFT, we estimated ~45% of nonsynonymous SNVs identified in the real SCZ WES data to be damaging. We set the probability of a nonsynonymous SNV being damaging accordingly in all of the four simulations (Supplementary Table 1). For example, in simulation 1 we set 75% of the SNVs in the first 500 genes (including the 30 'causal' genes) to be damaging and a corresponding 15% for the remaining 500 genes. By doing so, we guarantee that the rows in the simulated data matrix associated with the 'causal' genes are dense enough to have effective prediction contributions. Note that the divisions of the two group-settings could be arbitrary and here we just set the number of genes in each group equally without loss of generality. The percentages of damaging SNVs are also set representatively.
The clinical phenotype vector y for the samples is given by:

/ (6)

in which N(0,1) denotes the standard normal distribution.

Using the aforementioned procedure, our simulated data have following properties:

The maximal number of exonic SNVs per gene is ~50. This is a preliminary estimate based on the facts that the number of exons per gene is seldom greater than 50 (Larson and Schaid, 2014) and that SNVs occur nearly 1 in every one thousand base pairs, which is the maximum length of one exon. Indeed, 99.65% of the genes in the real WES data set have less than 50 exonic SNVs and only 0.073% of them contain more than 100 exonic SNVs, which is consistent with our preliminary estimations that are set in prior.
The number of exonic SNVs in each gene in our simulation matches that detected in real WES data, as we randomly choose 1,000 genes from the real WES dataset, counting the number of exonic SNVs in each of these genes, and then set the corresponding genotypes in the simulated gene sets.
The percentages of homozygous-reference, heterozygous and homozygous-alternate are 75%, 10% and 15% respectively, which is consistent with genotype distributions in real WES data.

We compared the performance of Logit-Lapnet, Enet, Lasso, and Logit methods – i.e., the LR with different amount of constraints on the regression coefficients – applied to WES data. For each method, first, from the true coefficient values in and the estimated ones in for all 1,000 genes used in simulation, we derived by applying the logic not to and by dichotomizing with a small ε. Next, we calculated true positive (TP), false positive (FP), true negative (TN), and false negative (FN) as follows (Kim et al.):

TP = + , TN = + , FP = + , FN = + / (7)

In Equation 7, the operation | |+ counts the number of non-zero elements in the vector, and the sign ' · ', between two vectors denotes the element-wise ‘or’ operation of two vectors. The true positive rate (TPR, also known as sensitivity), the false positive rate (FPR = 1 – specificity), and the F1 score are calculated by:

, , /
(8)

Lastly, we plotted the ROCs and calculate the AUCs by varying ε to obtain pairs of corresponding TPRs and FPRs in Supplementary Figure 2 and Supplementary Figure 3.

Supplementary figures

A / B
C / D

Supplementary Figure 1.Comparisons of average ROCs of Logit-Lapnet, Enet and Lasso methods on the regularization path. Here, ROCs are different from those presented in the regular context. We compare three methods, i.e., Logit-Lapnet, Enet and Lasso methods. Logit method does not have parameters that identify regularization path. Therefore the ROCs of that algorithm are missing. Error bars denote standard deviations of the values for all replicates.

A / / B /
C / / D /

Supplementary Figure 2. Average AUCs with simulated samples. (A) – (D). Simulation under scenarios 1 – 4 with different sample sizes. We simulated each of the observations for 40 replicates. The standard deviation is indicated by the error bar.

A / B
C / D

Supplementary Figure 3. Comparisons of the average AUCs of Logit-Lapnet, Enet and Lasso methods on regularization paths. The AUCs of Logit are missing since this method does not have regularization path parameters. Error bar of each sample denotes standard deviation of different replicates calculated.

Supplementary Figure 4. Method comparison using F1 score. All simulations were done with 100 samples. For each sample, the error bar indicated the standard deviation of all the replicates obtained.

Supplementary Figure 5. Relationship between running time andsample size. Different sizes of samples are chosen randomly from the data set, and the running times are calculated to get the trend of running time versus sample sizes.

Supplementary Figure 6. Flowchart of input data processing. After annotating SNPs and indels, we use a damaging-dominant policy to count neutral and damaging alleles. Because longer genes tend to contain more variants, we normalize the damaging allele counts by dividing them by the corresponding neutral allele counts.

Supplementary tables

Supplementary Table 1. Simulation parameter descriptions.
Simulation / Sample size / Percentage of damaging SNVs / Damaging SNV distribution among genes / Percentages of genotypes
1 / 100 / 45% / 1-500: 75%;
501-1000: 15% / - Homozygous reference: 75%
-Heterozygous: 10%
- Homozygous alternate: 15%
2 / 45% / 1-500: 80%;
501-1000: 10%
3 / 46.2% / 1-30: 85%;
31-1000: 45%
4 / 45% / 1-500: 85%;
501-1000: 5%
The percentage of damaging SNVs and the percent of different genotypes are both consistent with the real WES data set that we analyzed.
Supplementary Table 2. Top 20 risk genes for SCZ in the WES study cohort.
Gene / Entrez ID / Randomization probability 1 / In Purcell et al.? / Literature support?
(NCBI PMID) / Network
Hub? 2
RABIF / 5877 / 0 / Yes
CARD10 / 29775 / 0.017 / Yes / 24463508
GPM6A / 2823 / 0 / 18163405
TIMP2 / 7077 / 0.017 / Yes / 24463508
GABRA1 / 2554 / 0.033 / 16172613
PPP2CA / 5515 / 0.033 / Yes / 24463508
GAD1 / 2571 / 0.017 / 17767149
BDNF / 627 / 0.017 / 16581172
CSF2RB / 1439 / 0.017 / 17667962 / Yes
PRKCSH / 5589 / 0.017 / Yes / 24463508 / Yes
CHRFAM7A / 89832 / 0 / 21718690
SMG7 / 9887 / 0.017 / Yes / 24463508
FMR1 / 2332 / 0 / 23838275 / Yes
FABP7 / 2173 / 0.017 / 18001149
TNFSF9 / 8744 / 0.017 / Yes / 24463508
TRIM27 / 5987 / 0 / Yes / 24463508
PTPRB / 5787 / 0.033 / Yes / 24463508
PRKCG / 5582 / 0.033 / Yes / 24463508
UBE2N / 7334 / 0.017 / Yes / 24463508 / Yes
PHC2 / 1912 / 0.017 / 24463507
Notes:
1. After scoring, for each gene we used data randomization to calculate the probability that the rank of this gene based on a random data set is the same as or even better than its rank based on the real data set. These are the top 20 genes with the probability ≤ 0.05.
2. A network hub is a gene in the SCZ GsN with a degree ≥ 6, the top 25th percentile of the degree distribution of the SCZ GsN.
Supplementary Table 3. Literature-supported SCZ genes.
Entrez ID / Gene Name / NCBI PMID
4988 / OPRM1 / 23560613
1742 / DLG4 / 21151988
885 / CCK / 8737954
1191 / CLU / 20738160
2260 / FGFR1 / 23231877
7018 / TF / 18045615
3458 / IFNG / 22623148
1958 / EGR1 / 22691714
2885 / GRB2 / 21195589
2912 / GRM2 / 11317221
3479 / IGF1 / 17898347
7052 / TGM2 / 18561261
324 / APC / 15768050
5743 / PTGS2 / 15041036
1138 / CHRNA5 / 21418140
11113 / CIT / 20084519
9177 / HTR3B / 18807291
1565 / CYP2D6 / 18034624
266727 / MDGA1 / 21146959
2534 / FYN / 23250004
84628 / NTNG2 / 15705354
5327 / PLAT / 21898905
26047 / CNTNAP2 / 23123147
22999 / RIMS1 / 22682706
1739 / DLG1 / 22225629
4099 / MAG / 16039057
7057 / THBS1 / 22311024
134 / ADORA1 / 19820430
775 / CACNA1C / 24262814
9455 / HOMER2 / 19914345
27020 / NPTN / 17123723
4208 / MEF2C / 23380319
3976 / LIF / 19879916
56288 / PARD3 / 22969987
9228 / DLGAP2 / 24416398
1576 / CYP3A4 / 19591893
6925 / TCF4 / 21932083
1012 / CDH13 / 23358160
10718 / NRG3 / 20713722
6720 / SREBF1 / 18936756
23411 / SIRT1 / 20977650
4543 / MTNR1A / 21526376
9369 / NRXN3 / 23306218
783 / CACNB2 / 24901509
288 / ANK3 / 23109352
81565 / NDEL1 / 20084519
2823 / GPM6A / 18163405
688 / KLF5 / 18226501
7531 / YWHAE / 18658164
5141 / PDE4A / 21898905
85439 / STON2 / 21407139
8841 / HDAC3 / 20471694
7434 / VIPR2 / 21346763
2332 / FMR1 / 23838275
6721 / SREBF2 / 18936756
9759 / HDAC4 / 20471694
2173 / FABP7 / 18001149
1577 / CYP3A5 / 19591893
3303 / HSPA1A / 23893339
10280 / SIGMAR1 / 21549171
3563 / IL3RA / 18547720
4340 / MOG / 15653272
9732 / DOCK4 / 21682944
1463 / NCAN / 23795679
2952 / GSTT1 / 23107768
55784 / MCTP2 / 19223264
7444 / VRK2 / 23102693
3927 / LASP1 / 23040864
427 / ASAH1 / 21375364
57188 / ADAMTSL3 / 21239144
3839 / KPNA3 / 22960338
4482 / MSRA / 18506707
9444 / QKI / 16342280
8220 / DGCR14 / 16432632
223117 / SEMA3D / 20684831
55733 / HHAT / 23142968
202333 / CMYA5 / 23778016
9348 / NDST3 / 24253340
203062 / TSNARE1 / 24166486
27087 / B3GAT1 / 12874601
23774 / BRD1 / 19693800
84547 / PGBD1 / 22037552
10400 / PEMT / 17720317
154664 / ABCA13 / 19944402
116285 / ACSM1 / 20185149
23600 / AMACR / 20875727
54805 / CNNM2 / 24160291
63876 / PKNOX2 / 22648509
27257 / LSM1 / 24035562
90273 / CEACAM21 / 21682944
9739 / SETD1A / 24853937