Effect of Descriptor Selection Before Splitting and Method of Splitting (Rational And

Effect of descriptor selection before splitting and method of splitting (rational and random) on external predictive ability and on behaviour of different statistical parameters of QSAR model

Vijay H. Masand*1, Devidas T. Mahajan1, Gulam M. Nazeruddin2, Taibi Ben Hadda3, Vesna Rastija4, Ahmed M. Alfeefy5

1 Department of Chemistry, Vidya Bharati College, Camp, Amravati, Maharashtra, India.

2 Department of Chemistry, Poona College, Pune, Maharashtra, India

3 Laboratoire Chimie des Matériaux, Université Mohammed Premier, Oujda-60000, Morocco.

4 Department of Chemistry, Faculty of Agriculture, Josip Juraj Strossmayer University of P. Svacica 1d, Osijek, Croatia.

5 Department of Pharmaceutical Chemistry, College of Pharmacy, Salman Bin Abdulaziz University, P.O. Box 173, Alkharj 11942, Saudi Arabia

* To receive all correspondence, E-mail: , Tel: 0091-9403312628

Table S1. Different models for dataset 3

Original model / Residual based model / Random splitting / Sphere ecxclusion
Variable / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95%
Intercept / 3.9144 / 1.3590 / 3.8064 / 0.8128 / 5.5225 / 1.8727 / 5.0915 / 0.4595
GATS1p / -1.1338 / 0.6270 / -0.9150 / 0.3742 / -2.1180 / 0.9037 / -0.3073 / 0.3180
E3u / 1.2790 / 0.8114 / 0.8515 / 0.5179 / 1.1902 / 0.9464 / 0.4314 / 0.3482
E1m / -1.5604 / 0.7155 / -1.0666 / 0.4411 / -2.2686 / 1.1388 / -0.7766 / 0.4010
H6u / 0.4491 / 0.2597 / 0.2670 / 0.1655 / 0.3857 / 0.3198 / 0.5666 / 0.3989
R2e / 1.1763 / 0.4837 / 1.0814 / 0.2999 / 1.0823 / 0.6803 / 1.0258 / 0.4415

Table S2. Different models for dataset 2

Original model / Residual based model / Random splitting / Sphere exclusion
Variable / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95%
Intercept / 7.0072 / 0.2485 / 7.0039 / 0.2147 / 7.1710 / 0.3079 / 7.1051 / 0.3083
Mor13e / 0.2412 / 0.0669 / 0.2205 / 0.0530 / 0.2518 / 0.0798 / 0.2056 / 0.0847
RDF040v / 0.0781 / 0.0295 / 0.0633 / 0.0252 / 0.0555 / 0.0373 / 0.0653 / 0.0368
F06[N-O] / -0.9127 / 0.1067 / -1.0253 / 0.0880 / -0.7891 / 0.1440 / -0.9575 / 0.1386

Table S3. Different models for dataset 1

Original model / Residual based model / Random splitting / Sphere exclusion
Variable / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95% / Coeff. / Co. int. 95%
Intercept / 1.956 / 0.905 / 2.2446 / 0.9751 / 2.0292 / 1.3980 / 1.6496 / 2.0068
F07[C-N] / - 0.322 / 0.125 / -0.3518 / 0.1775 / -0.3609 / 0.3275 / -0.2953 / 0.4938
F05[C-C] / 0.221 / 0.082 / 0.1921 / 0.0608 / 0.1563 / 0.0981 / 0.2595 / 0.1396
Mor29e / 1.277 / 0.582 / 0.7075 / 0.4427 / 1.3869 / 0.6153 / 1.1471 / 1.1180
Mor03m / - 0.167 / 0.101 / -0.0986 / 0.0927 / -0.2936 / 0.1456 / -0.1290 / 0.1826
RDF095v / - 0.153 / 0.098 / -0.0943 / 0.0829 / -0.0981 / 0.1366 / -0.1785 / 0.1930

Statistical symbols with names and explanations:

R2 – correlation coefficient, Q2 – leave-one-out ‘crossvalidated R2’, R2adj - adjusted R2, SEE – standard error of estimates, RMSE - root mean squared error, MAE - mean absolute error, CCC - concordance correlation coefficient, for the training (tr), and test (ex) sets; MCDM all - MultiCriteria Decision Making calculated for fitting cross-validation and external validation; R2LMO and Q2LMO – leave many-out correlation coefficient and cross-validation coefficients; R2Yrand and Q2Yrand – Y- scramble correlation and cross-validation coefficients;

Figure S1. Distribution of training and test sets for different models for data set-1

For residual based model

For random splitting model

For sphere exclusion model

Figure S2. Distribution of training and test sets for different models for data set-2

For residual based model

For random splitting model

For sphere exclusion algorithm

Figure S3. Distribution of training and test sets for different models for data set-3

For residual based model

For random splitting model

For sphere exclusion model