Effect of descriptor selection before splitting and method of splitting (rational and random) on external predictive ability and on behaviour of different statistical parameters of QSAR model

Vijay H. Masand*1, Devidas T. Mahajan1, Gulam M. Nazeruddin2, Taibi Ben Hadda3, Vesna Rastija4, Ahmed M. Alfeefy5

1 Department of Chemistry, Vidya Bharati College, Camp, Amravati, Maharashtra, India.

2 Department of Chemistry, Poona College, Pune, Maharashtra, India

3 Laboratoire Chimie des Matériaux, Université Mohammed Premier, Oujda-60000, Morocco.

4 Department of Chemistry, Faculty of Agriculture, Josip Juraj Strossmayer University of P. Svacica 1d, Osijek, Croatia.

5 Department of Pharmaceutical Chemistry, College of Pharmacy, Salman Bin Abdulaziz University, P.O. Box 173, Alkharj 11942, Saudi Arabia

* To receive all correspondence, E-mail: , Tel: 0091-9403312628

Table S1. Different models for dataset 3

Original model / Residual based model / Random splitting / Sphere ecxclusion
Variable / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95%
Intercept / 3.9144 / 1.3590 / 3.8064 / 0.8128 / 5.5225 / 1.8727 / 5.0915 / 0.4595
GATS1p / -1.1338 / 0.6270 / -0.9150 / 0.3742 / -2.1180 / 0.9037 / -0.3073 / 0.3180
E3u / 1.2790 / 0.8114 / 0.8515 / 0.5179 / 1.1902 / 0.9464 / 0.4314 / 0.3482
E1m / -1.5604 / 0.7155 / -1.0666 / 0.4411 / -2.2686 / 1.1388 / -0.7766 / 0.4010
H6u / 0.4491 / 0.2597 / 0.2670 / 0.1655 / 0.3857 / 0.3198 / 0.5666 / 0.3989
R2e / 1.1763 / 0.4837 / 1.0814 / 0.2999 / 1.0823 / 0.6803 / 1.0258 / 0.4415

Table S2. Different models for dataset 2

Original model / Residual based model / Random splitting / Sphere exclusion
Variable / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95%
Intercept / 7.0072 / 0.2485 / 7.0039 / 0.2147 / 7.1710 / 0.3079 / 7.1051 / 0.3083
Mor13e / 0.2412 / 0.0669 / 0.2205 / 0.0530 / 0.2518 / 0.0798 / 0.2056 / 0.0847
RDF040v / 0.0781 / 0.0295 / 0.0633 / 0.0252 / 0.0555 / 0.0373 / 0.0653 / 0.0368
F06[N-O] / -0.9127 / 0.1067 / -1.0253 / 0.0880 / -0.7891 / 0.1440 / -0.9575 / 0.1386

Table S3. Different models for dataset 1

Original model / Residual based model / Random splitting / Sphere exclusion
Variable / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95%
Intercept / 1.956 / 0.905 / 2.2446 / 0.9751 / 2.0292 / 1.3980 / 1.6496 / 2.0068
F07[C-N] / - 0.322 / 0.125 / -0.3518 / 0.1775 / -0.3609 / 0.3275 / -0.2953 / 0.4938
F05[C-C] / 0.221 / 0.082 / 0.1921 / 0.0608 / 0.1563 / 0.0981 / 0.2595 / 0.1396
Mor29e / 1.277 / 0.582 / 0.7075 / 0.4427 / 1.3869 / 0.6153 / 1.1471 / 1.1180
Mor03m / - 0.167 / 0.101 / -0.0986 / 0.0927 / -0.2936 / 0.1456 / -0.1290 / 0.1826
RDF095v / - 0.153 / 0.098 / -0.0943 / 0.0829 / -0.0981 / 0.1366 / -0.1785 / 0.1930

Statistical symbols with names and explanations:

R2 – correlation coefficient, Q2 – leave-one-out ‘crossvalidated R2’, R2adj - adjusted R2, SEE – standard error of estimates, RMSE - root mean squared error, MAE - mean absolute error, CCC - concordance correlation coefficient, for the training (tr), and test (ex) sets; MCDM all - MultiCriteria Decision Making calculated for fitting cross-validation and external validation; R2LMO and Q2LMO – leave many-out correlation coefficient and cross-validation coefficients; R2Yrand and Q2Yrand – Y- scramble correlation and cross-validation coefficients;

Figure S1. Distribution of training and test sets for different models for data set-1

For residual based model

For random splitting model

For sphere exclusion model

Figure S2. Distribution of training and test sets for different models for data set-2

For residual based model

For random splitting model

For sphere exclusion algorithm

Figure S3. Distribution of training and test sets for different models for data set-3

For residual based model

For random splitting model

For sphere exclusion model