(Supplementary Material)

Discovery of Potent NEK2 inhibitors as Potential Anticancer agents Using Structure-Based Exploration of NEK2 Pharmacophoric Space Coupled with QSAR Analyses

Mohammad A. Khanfarab*, Fahmy Banata, Shada Alabeda, Saja Alqtaishata

aDepartment of Pharmaceutical sciences, Faculty of Pharmacy, The University of Jordan, Amman 11942, Jordan.

bInstitut fuer Pharmazeutische and Medizinische Chemie, Heinrich-Heine-Universitaet Duesseldorf, 40225 Duesseldorf, GERMANY.

*Corresponding Author:

Faculty of Pharmacy, University of Jordan, P.O Box 13140, Amman, 11942, Jordan
Phone: +96265355000, ext. 23339, Fax: 0096265300250, Email:

SM-1 Receiver operating characteristic (ROC) curve analysis

ROC analysis evaluates the ability of a particular pharmacophore model to classify the compounds in a testing list as actives or inactives. The performance of the considered model is judged based on the area under the curve (AUC) of the corresponding ROC as well as the overall specificity, overall accuracy, overall true positive rate, and overall false positive rate.

It was necessarily to perform valid evaluation of structural database (testing set) that consists of an adequate list of decoy compounds in combination with diverse list of known active compounds. The decoy list was selected as depicted by Verdonk and co-workers.

Briefly, the decoy compounds were prepared relying on three basic one-dimensional (1D) characteristics that permit the estimation of distance (D) between two molecules (e.g., i and j), namely: (1) the number of hydrogen-bond donors (NumHBD); (2) number of hydrogen-bond acceptors (NumHBA) and (3) the number of nonpolar atoms (NP, known as the summation of Cl, F, Br, I, S and C atoms in a particular molecule).

For each active compound in the testing set, the distance to the nearest other active compound is detected using their Euclidean Distance (equation A):

…. (A)

The minimum distances are averaged by calculating the mean over all active compounds (Dmin). After that, for each active molecule in the testing set an average of 34 decoys were chosen in random way from the ZINC database. The decoys were selected in such a pattern that they did not exceed Dmin distance from their corresponding active members.

Additionally, to further diversify the actives compounds, i.e., to prevent close likeness among active compounds in the testing set, any active compound having zero distance D(i,j) from other active compound(s) in the testing set was eliminated. Active testing molecules were defined as those possessing NEK2 inhibitors affinities (IC50 values) extent from 0.38 μM to 3.4 μM. The testing set included 8 active compounds and 273 ZINC compounds.

The decoy set (281 compounds) was screened by each pharmacophore for ROC curve analysis utilizing the "Best flexible search" option carried out in HYPOGEN, while the conformational spaces of the compounds were formed using the "Fast conformation generation option" applied in HYPOGEN.

Compounds that are lacking one or more features were excluded from hit lists. Moreover, decoy set compounds were fitted against the selected pharmacophore and their fit values (best fit values) was determined by equation (2).

ROC curves for corresponding pharmacophores were constructed using the hit lists. ROC curve analysis prescribes the sensitivity (Se or true positive rate, Equation B) for any probable change in the number of selected compounds (n) as a function of (1-Sp). Sp is known as specificity or true negative rate (Equation C).

………………….. (B)

………………….. (C)

Where, TP is the number of active compounds that are captured by the virtual screening method (true positives), FN is the number of active compounds ejected by the default screening, TN is the number of excluded decoys (considered inactives), while FP is the number of captured decoys (considered inactives).

A ROC curve is plotted by adjusting the score of the active molecule as the first threshold. After that, the number of decoy compounds within this cut-off is enumerated, and the corresponding Se and Sp pair is detected. This calculation is repeated for the active compound with the second highest score and so on, until the scores of all actives are represented as selection thresholds.

The ROC curve expressing ideal distributions, where overlapping between the scores of active molecules and decoys does not present, starts from the origin to the upper-left corner until all the actives are recovered and Se comes the value of 1. Then, the ideal ROC curve keeps as a horizontal straight line to the upper-right corner where all active compounds and all decoy compounds are retrieved, which indicates by Se = 1 and Sp = 0.

The success of a particular virtual screening workflow can be governed from the following criteria:

1) Area under the ROC curve (AUC): in a successful ROC curve, an AUC value of 1 is perfect. However, AUC value of 0.5 indicates random distributions. Virtual screening that carries out better than a random recognition of active compounds and decoys retrieve an AUC value between 0.5 and 1.

2) Overall accuracy (ACC): indicates the percentage of molecules that were classified correctly by the screening protocol (Equation D). Testing compounds are appointed a binary score value of zero (compound captured) or one (compound not captured).

……………………. (D)

Where, N is the total number of compounds in the testing database, A is the number of true active compounds in the testing database.

3) Overall specificity (SPC): indicates the percentage of excluded inactive compounds by the particular virtual screening workflow. Inactive test compounds are appointed a binary score value of zero (compound captured) or one (compound not captured).

4) Overall true positive rate (TPR or overall sensitivity): describes the fraction percentage of captured active compounds from the total number of active compounds. Active test compounds are assigned a binary score value of zero (compound captured) or one (compound not captured).

5) Overall false negative rate (FNR or overall percentage of excluded actives): describes the fraction percentage of active compounds that were excluded by the virtual screening method. Discarded active test compounds are assigned a binary score value of zero (compound captured) or one (compound not captured).

SM-2 Multiple Linear Regression-Based QSAR Modeling

We implemented a gene-based encoding system. In this scheme, the possible models (chromosomes) differ from one another by the set of independent descriptors variables (genes). If the general number of genes is equal to P (in this particular case, P = 351 variable corresponding to 22 pharmacophore fit values and 329 calculated descriptors), the chromosome corresponding to any model consists of a string of P binary digits (bits) or genes. Each value in the string represents an independent variable (0 = absent, 1 = present). Each chromosome is associated with a fitness value that reflects how good it is compared to other solutions. The followings are parameters used in the GFA-based selection of descriptors:

·  Creating an initial population: Number of starting random chromosomes.

·  Mating operation: two parent chromosomes are combined to generate new solutions (offspring).

·  Mutation operator: A mutation operator changes one or more bits in the chromosome to its complement.

·  Maximum number of generations to exit and complete the algorithm.

The generation of the independent descriptors was carried out as follows: At first, the chemical structures of the inhibitors were brought into Discovery Studio (version 2.5.5) as standard single 3D conformer illustrations in SD format. After that, various descriptor groups were determined for each compound using the C2. DESCRIPTOR module within Discovery Studio. The calculated descriptors involved multiple simple and valence connectivity indices, electrotopological state indices and other molecular descriptors (e.g., logarithm of partition coefficient, polarizability, dipole moment, molecular volume, molecular weight, molecular surface area, etc.). Furthermore, by applying the Best-fit option in HYPOGEN, the training compounds were superimposed toward the best pharmacophores (22 models, Table 1), then, their fit values were added as additional descriptors. The fit value for any compound is determined automatically through equation (2).

GFA was utilized to obtain on the multiple linear regression modeling (MLR), i.e., it was used to detect the optimal possible QSAR regression equation that is able to relating the variations in biological activities of the training molecules with variations in the generated descriptors, i.e., MLR modeling.

GFA was employed to search for the best possible QSAR regression equation capable of correlating the variations in biological activities of the training compounds with variations in the generated descriptors implementing MLR modeling. The fitness function that was utilized herein depends on Friedman’s ʻ lack-of-fit’ (LOF).

The preliminary diagnostic trials proposed the following optimal GFA elements: explore linear, quadratic and spline equations at mating and mutation probabilities of 50%, population size = 500, number of genetic iterations = 10,000 and lack-of-fit (LOF) smoothness parameter = 0.5.

However, to detect the optimal number of explanatory terms (QSAR descriptors), we examined and estimated all possible QSAR models resulting from 4 to 12 explanatory terms.

All QSAR models were validated by applying leave one-out cross-validation (r2LOO), and predictive r2 (r2PRESS) that was calculated from the testing subsets. The predictive r2PRESS is defined as:

r2PRESS = SD-PRESS/SD ………………… (E)

Where, SD is the sum of the squared deviations between both, the biological activities of the testing compounds, and the mean activity of the training set compounds. PRESS is the squared deviations between actual and predicted activity values for each compound in the test set.

Descriptor-scanning identified several QSAR models; the optimal model was selected to predict the NEK2 inhibitory bioactivities of in silico hits.

No. / R / IC50 (µM) / Reference
1 / / 0.87 / 11
2 / / 1.9 / 11
3 / / 1.9 / 11
4 / / 4.2 / 11
5 / / 19.8 / 11
6 / / 7.8 / 11
7 / H / 48.5
No. / R / IC50 (µM) / Reference
8 / / 50 / 11
9 / / 18.3 / 11
10 / / 5.7 / 11
11 / / 2.3 / 11
12 / / 50 / 11
13 / / 2.6 / 11
14 / / 2.4 / 11
15 / / 26.2 / 11
16 / / 16.9 / 11
17 / / 10.8 / 11
18 / / 0.39 / 11
19 / / 0.23 / 11
20 / / 3.4 / 11
No. / R / IC50 (µM) / Reference
21 / / 7.2 / 11
22 / / 4.3 / 11
23 / / 2.1 / 11
24 / / 1.6 / 11
25 / / 3.4 / 11
No. / R1 / R2 / IC50 (µM) / Reference
26 / / / 5.43 / 12
27 / / / 6.57 / 12
28 / / / 1.18 / 12
29 / / OCH3 / 64.4 / 12
30 / H / OCH3 / 86.0 / 12
31 / H / / 50 / 12
32 / / 0.23 / 12
33 / / 0.23 / 12
34 / / 2.56 / 12
35 / / 0.79 / 12
36 / / 2.47 / 12
37 / / 0.12 / 13
38 / / 0.21 / 13
39 / / 2.68 / 13
40 / / 2.63 / 13
41 / / 0.047 / 13
42 / / 0.047 / 13
43 / / 0.058 / 13
44 / / 0.093 / 13
45 / / 0.318 / 13
46 / / 1.8 / 13
47 / / 0.022 / 13
48 / / 9.0 / 13
49 / / 2.26 / 13
50 / / 10.5 / 13

Figure S1. The 1H-NMR spectra of compound 64.

Figure S2. The 1H-NMR spectra of compound 64 (Expanded).

Figure S3. The 1H-NMR spectra of compound 67.

Figure S4. The 1H-NMR spectra of compound 67 (Expanded).