Supporting Information for manuscript:

Toward the computer-aided discovery of FabH inhibitors. Do predictive QSAR models ensure high quality virtual screening performance?

Yunierkis Pérez-Castillo* 1,2,3, Maykel Cruz-Monteagudo2,4,5,6,Cosmin Lazar1, Jonatan Taminau1,7, Mathy Froeyen3, Miguel Ángel Cabrera-Pérez2,8,9, Ann Nowé* 1

1 Computational Modeling Lab (CoMo), Department of Computer Sciences, Faculty of Sciences, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium

2 Molecular Simulations and Drug Design Group. Centro de Bioactivos Químicos. Universidad Central “Marta Abreu” de Las Villas. Santa Clara. Cuba

3 Laboratory for Medicinal Chemistry. Rega Institute for Medical Research.KatholiekeUniversiteit Leuven, Minderbroedersstraat 10, B-3000 Leuven, Belgium.

4 CIQ, Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal.

5 REQUIMTE, Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal.

6 Centro de Estudios de QuímicaAplicada (CEQA), Faculty of Chemistry and Pharmacy, Central University of Las Villas, Santa Clara, 54830, Cuba.

7Cartagenia N.V. Technologielaan 3.3001 Leuven.Belgium

8 Department of Engineering, Area of Pharmacy and Pharmaceutical Technology, Miguel Hernández University, 03550 Sant Joan d'Alacant, Alicante, Spain

9 Department of Pharmacy and Pharmaceutical Technology, University of Valencia, Burjassot 46100, Valencia, Spain

Table of Contents

Sections

S1. Detailed description of the QSAR modeling framework

S2 Calculation of the models pairwise distance

S3 Metrics for evaluation the Virtual Screening performance of the models

S4 Dataset preparation

S5 Diversity of the optimal models in both the input and output spaces

S6 Analysis of the quality of the decoys sets obtained using our decoys selection approach

S7 Performance comparison of different ensembles for Virtual Screening

Tables

Table TS1

Table TS2

Table TS3

Table TS4

Table TS5

Table TS6

Table TS7

Table TS8

Table TS9

Figures

Figure FS1

Figure FS2

Figure FS3

Figure FS4

Figure FS5

Figure FS6

References

Sections

S1. Detailed description of the QSAR modeling framework

Feature selection and classifiers training block

Three different feature selection strategies (Genetic Algorithms, Bagged Trees and Features Ranking) were combined with three types of classifiers (Linear Discriminant Analysis, Least-Squares Support Vector Machine and Adaboost Ensemble) to obtain nine different model types. When analyzing the performance of the trained classifiers the following definitions are used:

Classifiers

  • Linear Discriminant Analysis.

LDA is a simple and mathematically robust classification method widely used for the solution of classification problems[1]. To establish the LDA classifier, a linear function is calculated for each class based on a subset of features. The class function yielding the highest score represents the predicted class. The MATLAB implementation of the LDA was used.

  • Least Squares Support Vector Machines.

Least Squares Support Vector Machines (LS-SVM) are a reformulation of the standard Support Vector Machines (SVM).[2] In brief, SVM maps the input data into a high dimensional feature space by using a nonlinear mapping φ(x) of the input data to the feature space where a linear separating hyperplane is determined. The Least Squares Support Vector Machines (LS-SVM) models were developed with the LS-SVMlab Toolbox for MATLAB.[2]Each feature was scaled to the interval [0-1] and the Radial Basis Function (RBF) kernel which is defined by the transformation was selected. The regularization parameter γ and the squared RBF kernel parameter σ2 of the LS-SVM models were optimized using a combination of Coupled Simulated Annealing (CSA) and grid search as implemented in the LS-SVMlab Toolbox for MATLAB. This parameters optimization stage was guided by the misclassification rate of the 10-fold cross-validated training data. The CSA algorithm was first used to find good starting values of the kernel parameters and then the grid search strategy was performed for the fine tuning of them.

In the case of the Genetic Algorithms-LS-SVM classifiers the parameters γ and σ2 for the LS-SVM models were optimized using the whole training data following the same procedure described above. These parameters were kept fixed during the Genetic Algorithm search. For the set of LS-SVM models trained with the features selected using either Bagged Trees or Features Ranking the LS-SVM parameters were optimized for every tested features subset. The detailed setup for each feature selection technique is given later in this section.

  • Adaboost Ensembles.

The other classification algorithm that we employed was the ensemble-based methodology Adaboost. Regardless of the feature selection algorithm being used, the Adaboost ensembles were trained with single feature LDA models. For this reason, one LDA model per input raw feature was trained using the training dataset and the LDA models output was also saved for the selection and external datasets. Afterward the random single feature LDA models, those with either Sensitivity or Specificity on the training set lower than 0.5, were removed. In every case the Adaboost ensembles were trained with these nonrandom single feature LDA models.

At each iteration, the Adaboost algorithm minimizes the weighted error on the training set and returns an ensemble of single feature LDA models. The weighted error of this ensemble is computed and used to update the weights of each training sample. The effect of the changes in the weights of the training samples is to put more weight on the samples misclassified during the previous iteration and reduce the weight of the samples that were correctly classified. The aim of this weights update scheme is to select in the next round a weak classifier that correctly classifies previously misclassified samples and in consequence, at each round, Adaboost tries to solve more difficult classification problems. After the weights of each classifier are determined, the final classification of each sample is calculated according to the linear combination of the product of these weights and each classifier output. A pre-computed classification threshold is used to assign the groups membership. It is worth to note that for virtual screening experiments, these scores can be used to rank positively predicted compounds for synthesis and assay prioritization. The pseudocode of the Adaboost algorithm is shown in Chart S1.1.

Chart S1.1. Adaboost pseudocode
  1. //
  1. // Require: N x T matrix of the predictions of N samples using T classifiers.
  2. //
  3. Initialize weights
  4. For t=1:T
  5. Normalize weights
  6. For each classifier j
  7. Evaluate the weighted error of each classifier, where is the classification result of sample xi using the classifier hj and yi is the observed classification of sample xi.
  8. End
  9. Choose the classifier ht with the lowest error
  10. Update weights, where ei=0if sample xi is correctly classified, ei=1 otherwise and
  11. End
  12. Return the final Adaboost Ensemble:

Feature selection strategies

  • Genetic Algorithms.

The pseudocode for the GA wrapper is presented in Chart S1.2.The GA was run 25 times (IGA=25), each one with a different initial population of 100 individuals and for 1000 generations (Ngen=1000). Since the GA is implemented as a feature selection approach, a bit-string encoding of the chromosomes was used. A 1 at the i-th position means that the i-th feature or LDA model will be considered by the fitness function while 0 means that this feature will be excluded from fitness computation. The initial population is randomly created in such a way that each individual contains at most 10 1-coded genes and the length of each individual is set equal to the number of input raw features or LDA models depending on the classifier used inside the GA wrapper.

Chart S1.2. Genetic Algorithm Setup
  1. //
  1. // Require: Dataset and vector of observed classifications
  2. //
  3. For i=1:IGA
  4. Create initial population.
  5. Score initial population (See Chart S1.3 for fitness function).
  6. For j=1:Ngen
  7. Select elite individuals.
  8. Tournament selection of parents for crossover.
  9. Tournament selection of individuals to mutate.
  10. Perform crossover.
  11. Perform mutation.
  12. Create new population = Elite + crossover offspring + mutated offspring.
  13. Score new population (See Chart S1.3 for fitness function).
  14. End For
  15. Return final population Pi for i-th GA run
  16. End For

The selection function is set to a tournament of size 2 while the crossover operator randomly combines each position of two parents to produce an offspring. In brief, the tournament selection consists on the random selection of n individuals from the population, compare their fitness values and select the one with the lowest fitness (we minimize our objective function) as the winner one. The mutation process randomly selects a 1-codedposition on the parent and set it to 0 and in the same way sets to 1 one of the excluded features or LDA models. The population size is set to 100 individuals, the crossover and mutation rates are set to 0.7 and 0.3 respectively and two elite offspring automatically survive and pass to the next generation. The overfitting problem was controlled in two ways inside the fitness function of our modified GA: by using a 10-fold cross-validation training scheme and by minimizing the Akaike Index (AIC)[3] that aims to keep a balance between the fitting of the model and its complexity. For the GA(M)E-QSAR algorithm the Adaboost Ensemble is trained using the non-random, weak single feature classifiers coded on each individual. On the other hand the, raw features encoded by each individual are used to train either the LS-SVM or LDA classifiers. The pseudocode for this fitness function is shown in Chart S1.3.

Chart S1.3. Fitness function for modified GA
  1. //
  1. // 10-fold cross-validated fitness function
  2. //
  3. For i=1:10
  4. Train a classifier using the training samples not left out (all samples not in the i-th subset)
  5. Predict the left-out training cases (samples in the i-th subset)
  6. End For
  7. Calculate the cross-validated training error based on the prediction of the left-out samples
  8. Return AIC= log(σ2) + #used_features_or_LDA_models/#training_samples ; being
  9. End For

The 10-fold cross-validation strategy is used with the objective to prevent overfitting (one of the main sources of low performing QSAR models) and slow down the algorithm’s speed of convergence towards a local minimum. The Akaike’s index AIC is used to measure the fitness of the individual to obtain models with a good tradeoff between the fitting and the number of features or LDA models. In this sense, among two individuals with the same crossvalidated error value, the one encoding fewer features or LDA models will be better scored and hence will have better opportunities to reproduce and survive during the evolution.

  • Bagged Trees.

The MATLAB[4] implementation for the Bagged Trees was used and the pseudocode for the Bagged Trees feature selection and modeling strategies is shown in Chart S1.4 .For the Bagged Trees-based feature selection[5], an ensemble of 100 classification trees (IBT=100) was trained where each tree was grown using an independent bootstrap sample of the training dataset. The prediction error of the ensemble was determined by computing the predictions for each tree on its Out-Of-The-Bag observations, using the majority vote rule to assign each observation prediction and then comparing the predicted response with the observed class. To build the decision trees we used a minimum number of observations per leaf of 1 and a random subset of features equal to the square root of the total number of variables was considered for each split. The nodes split criterion was set to the Gini’s diversity index[6] which measures the node’s impurity based on the fraction of samples with each class that reaches that node.

Chart S1.4. Bagged Trees Setup
  1. //
  1. // Require a dataset ,a vector of observed classifications and a classifier C
  2. //
  3. // Bagged Tress-based Feature selection
  4. //
  5. For i=1:IBT
  6. Create resampled training Ti and Out-Of-The-Bag OOBisets from training set.
  7. Train classification tree using Ti.
  8. Predict OOBi observations.
  9. End For.
  10. Compute ensemble error based on the OOB predictions.
  11. Consensus selection of the optimal ensemble size.
  12. Compute each feature importance and select the top m’ ones.
  13. Return top m’ features.
  14. //
  15. // Classifiers training
  16. //
  17. For i=1:m’
  18. Train classifier C using features 1 to i
  19. End For
  20. Return set Sofup to m’ models of type C

To determine the optimal size of the ensemble of decision trees, a consensus ranking procedure analogous to that described in the next subsections for the selection of the optimal models derived from each classification strategy was followed. The sole difference is that for the bagged trees the decision makers considered were the accuracy in the prediction of the training set, the accuracy in the prediction of the selection set and the prediction error for the Out-Of-The-Bag samples. It should be noted that the i-th ensemble is that which members are the trees from 1 to i.Using this optimal size of the classification trees ensemble, the importance of each feature on this optimal ensemble was calculated as the increase in the classification error if the values of the variable are permuted across all Out-Of-The-Bag samples. The 25 most important features according to this metric were selected for further classifiers training.

These top 25 features were scaled to the interval [0-1] and then Adaboost, LS-SVM and LDA classifiers were iteratively trained changing the number of features considered to build the classifiers from 1 to 25. This process lead to 25 models (less for the Adaboost Ensembles since random single feature LDA models are removed) that were trained with different number of features for each of the three classification approaches considered (Adaboost Ensemble, LS-SVM and LDA). For the LS-SVM modeling cycles, the Radial Basis Function (RBF) kernel parameters were optimized as described above considering only the features to be used to train the model during this parameters optimization step.

  • Features Ranking.

For the Feature Ranking-based feature selection strategy, the top 25 variables according to the ranking derived from the non-parametric Wilcoxon rank-sum test were selected.[7,8] Feature Ranking was performed using the MATLAB[4] built-in functions. These 25 features were used to conduct the same classifiers training process carried out with the features subset derived from the Bagged Trees. This process lead to three classifiers (Adaboost Ensemble, LS-SVM and LDA) trained with features derived from the feature ranking process. The pseudocode for the Feature Ranking feature selection and modeling strategies is shown in Chart S1.5.

Chart S1.5. Feature Ranking Setup
  1. //
  1. // Require a dataset ,a vector of observed classifications and a classifier C
  2. //
  3. // Feature Ranking-based Feature selection
  4. //
  5. Select top m’ features according to the Wilcoxon rank-sum test
  6. Return top m’ features.
  7. //
  8. // Classifiers training
  9. //
  10. For i=1:m’
  11. Train classifier C using features 1 to i
  12. End For
  13. Return set Sofup to m’ models of type C

Evaluate models generalization block

Once each set of classifiers is generated, the selection dataset is evaluated by each model in order to evaluate their generalization capabilities. For the models derived from features selected using Bagged Trees and Features Ranking there is a maximum of 25 unique models to be evaluated. However, in the case of the GA wrappers the maximum number of candidate solution found during all the 25 evolutions is 2500. To save computational time and avoid unnecessary calculations, an additional filtering step was performed for the GA-derived models.

This filtering process starts by putting all the unique individuals (each one encoding for a classifier) from all GA runs together. Afterward, the duplicated individuals are removed and the accuracy in predicting both the training and the selection subsets by each of these unique individuals is evaluated. Next, the individuals that encode models with accuracy lower than the mean accuracy value on both the training and selection sets are removed. A model with low accuracy on the training set can be a consequence of a low performance individual that was produced by a mutation during the last GA generation.

On the other hand, the prediction of the selection data subset gives an estimation of the generalization capabilities of the models and will discard those that cannot be generalized to new data. This filtering step based on the accuracy in predicting both the training and selection sets allow choosing for further cross-validation only models combining good performance on the training data as well as on the selection subset that was not used during the learning process. The result of this filtering step is a number of candidate solutions from which the optimal model will be selected based on a consensus ranking approach that is described in the next section.

Cross-validation, selection of the optimal classifier and evaluate optimal model generalization blocks

Once the candidate solutions are obtained and their performance on both the training and selection sets is evaluated, they are subject to Leave-One-Out (LOO)[9] and 1000 cycles of Bootstrap[10] cross-validations in order to evaluate their robustness. Following the concept that the ideal model should combine fitting, predictability and robustness, the accuracy on predicting the training subset was considered an estimator of the fitting of the model; the accuracy on the selection set as a descriptor of the generalization capabilities of the model and the LOO and Bootstrap cross-validation accuracies as estimators of the robustness of the model. To retrieve one model from the whole pool that combines good fitting, generalization capabilities and robustness, the Borda-Kendall consensus ranking approach was used.[11]