1
QSAR modeling of endpoints for peptides
which is based on representation of the molecular structure
by a sequence of amino acids
Supplementary materials
Important:
+ is indicator of sub-training set
- is indicator of calibration set
# is indicator of test set
Table S1. Data set 1 (split A)
+P01 WLEPGPVTA 6.082
#P02 ITSQVPFSV 6.196
+P03 FLEPGPVTA 6.898
#P04 ITAQVPFSV 7.020
#P05 YLEPGPVTL 7.058
#P06 YTDQVPFSV 7.066
+P07 YLEPGPVTI 7.187
+P08 YLEPGPVTV 7.342
#P09 YLSPGPVTA 7.383
-P10 IIDQVPFSV 7.398
+P11 ITWQVPFSV 7.463
+P12 ITYQVPFSV 7.480
+P13 ILSQVPFSV 7.699
-P14 IMDQVPFSV 7.719
-P15 YLMPGPVTV 7.932
+P16 WLDQVPFSV 7.939
#P17 YLAPGPVTA 8.032
-P18 YLYPGPVTV 8.051
+P19 YLWPGPVTV 8.125
+P20 ILYQVPFSV 8.310
-P21 ILDQVPFSV 8.481
+P22 YLFPGPVTA 8.495
+P23 YLDQVPFSV 8.638
-P24 ILFQVPFSV 8.699
+P25 ILWQVPFSV 8.770
-P26 WTDQVPFSV 6.145
+P27 YLEPGPVTA 6.668
-P28 ITDQVPFSV 6.947
-P29 ITFQVPFSV 7.179
-P30 FTDQVPFSV 7.212
-P31 ITMQVPFSV 7.398
#P32 YLSPGPVTV 7.642
+P33 YLYPGPVTA 7.772
-P34 YLAPGPVTV 7.818
#P35 ILAQVPFSV 7.939
+P36 ILMQVPFSV 8.125
-P37 YLFPGPVTV 8.237
#P38 YLMPGPVTA 8.367
+P39 YLWPGPVTA 8.495
-P40 FLDQVPFSV 8.658
Table S2. Data set 2 (Split A)
+CAMEL118 KWKLFLGILAVLKVL 0.159
+CAMEL17 KWNLNGNINAVLKVL 0.209
-CAMEL38 KWKGELEIEAELKVL 0.376
-CAMEL107 GWKLGLKILNVLKVL 0.496
+CAMEL20 KWKLFKKNNNNNKHN 0.498
-CAMEL116 KWHLFLLILAVLKVL 0.514
#CAMEL34 KRGLFKKGGAVLKGL 0.528
-CAMEL18 KWHLRNKIGAVRNNL 0.537
#CAMEL16 KHKLFKKIGAHRKRN 0.553
+CAMEL39 HWHLHKHRGARHKVL 0.677
+CAMEL134 GWELGEEILNVLKVL 0.708
-CAMEL115 KWHLFLKILAVLKVL 0.741
+CAMEL50 KWKLFKKHGNVRKVL 0.771
-CAMEL10 KNKRNKKIGAVLKVL 0.848
+CAMEL14 KHNLFKGIGAVLLVL 0.922
+CAMEL51 KWKLFKKIGNRNKVL 0.947
+CAMEL113 LWKLFLHILAVLKVL 0.962
+CAMEL26 KNKLEKKIGAVLKVL 1.027
+CAMEL52 KWKLGKGIGAVGKVL 1.033
-CAMEL58 KWKLFNRIGHNRKVN 1.049
+CAMEL137 GWRLFRGIRAVLNVL 1.074
#CAMEL54 KWGLFKNIGAVLHVN 1.156
-CAMEL57 RWKLNNNIGARLKVL 1.206
+CAMEL33 HWKLFKKIGHVNKRL 1.34
+CAMEL60 HWKRFLRIGHNLNVN 1.495
+CAMEL11 KWKLFKKIGGVGGVL 1.593
+CAMEL120 GWKLFLKILAVLKVL 1.598
#CAMEL13 GWKLFKNRGAVLKHL 1.605
-CAMEL8 KWKLFNKRGAVLKVL 1.605
+CAMEL19 RWKNFKNIRANLRVL 1.742
+CAMEL56 KWKLFGKNGRNLLVL 1.814
#CAMEL138 GWRLFKGIRAVLNVL 1.826
+CAMEL41 KWKLFKKGAVLKVLT 1.891
+CAMEL47 KWKLFKKRNAVLKVL 1.964
+CAMEL44 KWKLFKKIGANLKVL 2.07
+CAMEL12 KWKLFKRIGAVHKRL 2.242
+CAMEL53 HWKLFKKIHAVRKHL 2.244
+CAMEL112 KWKLFLHILAVLKVL 2.245
#CAMEL32 KWKLFRKIGAVHRVL 2.306
-CAMEL29 LWKLFKKHGAVLKVL 2.422
+CAMEL36 KWHLNKRIHAVLKRL 2.587
+CAMEL1 KWKLFKKGIGAVLKV 2.649
+CAMEL35 KWKLFRRIGAVLKHR 2.706
+CAMEL30 KRKRFRKIGAVLKVL 2.747
-CAMEL15 KWKLFKLRGRVRKVL 2.879
-CAMEL31 KWKLFKKIGLGLGVL 3.148
+CAMEL2 KWKLFKKLKVLTTGL 3.246
+CAMEL126 LWRLLKHILRVLKVL 3.259
+CAMEL55 KWLLFKKIGAVLLNH 3.425
-CAMEL37 LRKLFKKIRAVLLVR 3.617
+CAMEL125 LWRLLKKILRVLKVL 3.822
-CAMEL119 GWKLFKLIGAVLKVL 3.86
+CAMEL28 KWKLGKKIGAVLGVL 3.946
+CAMEL114 KWKLFHKILAVLKVL 4.105
-CAMEL61 KWKLFKKAVLKVLTT 4.105
-CAMEL111 KWKLFHLIGAVLKVL 4.165
-CAMEL49 NWKLFHKIGAVLKVL 4.187
-CAMEL25 KWKLRKKIGAVLKVL 4.262
+CAMEL43 KWKGFKKIGAVLKVL 4.319
-CAMEL105 GWKLGKKIGRVLKVL 4.336
+CAMEL124 KWKLFKLIRAVLKVL 4.533
+CAMEL7 KWKLFKKIGAVLHNL 4.534
+CAMEL121 GWKGFKKIGRVLKVL 4.759
+CAMEL27 KWKLFKKIGAVLNRL 4.869
+CAMEL3 KWGLFKKIGAVLKVL 4.989
+CAMEL81 KWKLFKKVLKVLTTG 5.297
-CAMEL106 GWKLFKKIGRVLKVL 5.318
+CAMEL127 GWKLFKKIGRVLRVL 5.47
+CAMEL130 LWKLFKKIRRLLKVL 5.49
+CAMEL128 LWKLFKKIGRVLKVL 5.493
+CAMEL131 LWKLFRKIRRLLRVL 5.504
-CAMEL104 GWKLGKKILRVLRVL 5.562
-CAMEL108 KWKLGKKILNVLKVL 5.566
+CAMEL109 GWRLGKKILRVLKVL 5.572
+CAMEL123 LWKLFKKIRRVLRVL 5.614
#CAMEL0 KWKLFKKIGAVLKVL 5.712
+CAMEL42 HWKLFKKIGAVLKVL 5.712
+CAMEL102 GWKLGKKILRVLKVL 5.725
-CAMEL6 NWKLFKKIGAVLKVL 5.803
+CAMEL23 KWHLFKKIGAVLKVL 5.81
#CAMEL101 KWKLGKKILRVLKVL 5.845
+CAMEL103 GWKLGLKILRVLKVL 5.879
+CAMEL22 GWKLFKKIGAVLKVL 5.946
-CAMEL110 GWKLGKKILNVLKVL 6.043
#CAMEL129 LWKLFKKINRVLKVL 6.045
+CAMEL4 KWKLFHKIGAVLKVL 6.072
+CAMEL24 KWKLFKHIGAVLKVL 6.167
+CAMEL132 GWKLGKHILNVLKVL 6.182
-CAMEL48 KWKLGKKIGAVLKVL 6.323
-CAMEL136 VWRLIKKILRVFKGL 6.613
+CAMEL135 GWRLIKKILRVFKGL 6.665
+CAMEL59 KGKGGKKGGRGGKVL 1.077
+CAMEL40 GWLLHRNIGNVLHRL 1.387
+CAMEL5 KWKLFKKNGAVLKVL 1.408
+CAMEL139 GWKLFKGIRAVLNVL 1.497
#CAMEL117 LWHLFLKILAVLKVL 1.515
-CAMEL140 GWRLLKKILEVLKVL 4.136
-CAMEL45 KWKNFKKIGAVLKVL 4.249
-CAMEL122 LWKLFKKIRRVLKVL 6.142
+CAMEL9 KWRLFKNIGAVLKVL 6.292
-CAMEL46 KWKLFKGIRAVLKVL 6.45
Table S3. Data set 3 (Split A)
+1 GTLVALVGL 5.34
-2 TTAEEAAGI 5.38
-3 LTVILGVLL 5.58
+4 AMFQDPQER 5.74
#5 SLHVGTQCA 5.84
-6 GIGILTVIL 6
#7 NLQSLTNLL 6
+8 FVTWHRYHL 6.02
+9 AIAKAAAAV 6.18
+10 DPKVKQWPL 6.18
#11 ALAKAAAAI 6.21
+12 GLGQVPLIV 6.3
+13 MLDLQPETT 6.34
+14 LLSSNLSWL 6.34
+15 GLACHQLCA 6.38
-16 AAAKAAAAV 6.4
-17 LIGNESFAL 6.42
#18 ALAKAAAAV 6.42
#19 ILTVILGVL 6.42
-20 LLAVGATKV 6.48
#21 MLLAVLYCL 6.48
+22 AVAKAAAAV 6.5
-23 ALAKAAAAL 6.51
+24 WILRGTSFV 6.56
-25 AAGIGILTV 6.58
-26 ALIHHNTHL 6.62
#27 ILDEAYVMA 6.62
+28 NLSWLSLDV 6.64
+29 YMIMVKCWM 6.66
+30 VLQAGFFLL 6.68
+31 TLHEYMLDL 6.73
+32 VILGVLLLI 6.78
+33 VTWHRYHLL 6.79
#34 PLLPIFFCL 6.8
-35 CLTSTVQLV 6.83
+36 HLYQGCQVV 6.83
-37 ILLLCLIFL 6.84
+38 FLCKQYLNL 6.88
-39 FAFRDLCIV 6.89
-40 FLEPGPVTA 6.9
#41 ITDQVPFSV 6.95
-42 LMAVVLASL 6.95
-43 LLCLIFLLV 7
-44 ITAQVPFSV 7.02
-45 YLEPGPVTL 7.06
#46 YTDQVPFSV 7.07
+47 NLYVSLLLL 7.11
+48 NLGNLNVSI 7.12
#49 ILHNGAYSL 7.13
+50 HLYSHPIIL 7.13
#51 VVMGTLVAL 7.17
-52 ITFQVPFSV 7.18
#53 YLEPGPVTI 7.19
+54 FTDQVPFSV 7.21
-55 GLSRYVARL 7.25
-56 YMLDLQPET 7.31
#57 VLLDYQGML 7.33
-58 YLEPGPVTV 7.34
+59 YLSPGPVTA 7.38
+60 IIDQVPFSV 7.4
-61 ITMQVPFSV 7.4
-62 YMNGTMSQV 7.4
-63 SVYDFFVWL 7.44
-64 ITYQVPFSV 7.48
+65 GLYSSTVPV 7.48
-66 YLSPGPVTV 7.64
-67 VLIQRNPQL 7.64
-68 SLYADSPSV 7.66
+69 RLLQETELV 7.68
-70 GLYSSTVPV 7.7
+71 ILSQVPFSV 7.7
+72 IMDQVPFSV 7.72
-73 ALMDKSLHV 7.77
-74 YLYPGPVTA 7.77
#75 YAIDLPVSV 7.8
-76 FVWLHYYSV 7.82
#77 LLFGYPVYV 7.89
+78 MMWYWGPSL 7.92
+79 YLMPGPVTV 7.93
-80 ILAQVPFSV 7.94
+81 WLDQVPFSV 7.94
-82 KTWGQYWQV 7.96
+83 YLYPGPVTV 8.05
-84 LLMGTLGIV 8.1
#85 ILMQVPFSV 8.12
+86 YLWPGPVTV 8.12
-87 FLLTRILTI 8.15
#88 YLFPGPVTV 8.24
+89 ILYQVPFSV 8.31
#90 ILWQVPFSV 8.77
+91 LQTTIHDII 5.5
+92 HLLVGSSGL 5.79
+93 ALPYWNFAT 5.87
#94 TVILGVLLL 6.07
+95 QVMSLHNLV 6.17
-96 VLHSFTDAI 6.38
#97 KLPQLCTEL 6.48
#98 YLEPGPVTA 6.67
-99 LLWFHISCL 6.68
-100 GTLGIVCPI 6.71
-101 QLFHLCLII 6.89
+102 ALAKAAAAA 6.95
-103 ALCRWGLLL 7
-104 HLAVIGALL 7
+105 LLAQFTSAI 7.3
+106 ILSPFMPLL 7.35
#107 KLHLYSHPI 7.35
-108 ITWQVPFSV 7.46
-109 KIFGSLAFL 7.48
-110 ALVGLFVLL 7.58
#111 LLLCLIFLL 7.58
-112 YLAPGPVTV 7.82
-113 ILDQVPFSV 8.48
-114 YLWPGPVTA 8.5
#115 YLDQVPFSV 8.64
#116 FLDQVPFSV 8.66
#117 ILFQVPFSV 8.7
Table S4. Y-randomization includes the following steps:
Step 1: 100 random shifting of predicted values of a endpoint with calculation R2r;
Step 2: Calculation of average R2r;
Step 3: Calculation of
Data 1, split A / Ntest= 9R2 before Y-randomization / 0.7081
1 / 0.5493
2 / 0.2697
3 / 0.1293
4 / 0.3576
5 / 0.2622
6 / 0.2293
7 / 0.4056
8 / 0.4847
9 / 0.1144
10 / 0.0021
R2r / 0.2804
CR2p / 0.5503
Data 1, split B / Ntest= 10
R2 before Y-randomization / 0.7986
1 / 0.2723
2 / 0.2149
3 / 0.3774
4 / 0.0615
5 / 0.0088
6 / 0.1522
7 / 0.5825
8 / 0.0072
9 / 0.4241
10 / 0.1275
R2r / 0.2228
CR2p / 0.6781
Data 1, split C / Ntest= 12
R2 before Y-randomization / 0.7991
1 / 0.1141
2 / 0.4849
3 / 0.1481
4 / 0.0005
5 / 0.2584
6 / 0.1473
7 / 0.1262
8 / 0.0005
9 / 0.0080
10 / 0.0001
R2r / 0.1288
CR2p / 0.7319
Data 2, split A / Ntest= 10
R2 before Y-randomization / 0.9610
1 / 0.3113
2 / 0.0285
3 / 0.4274
4 / 0.1495
5 / 0.0085
6 / 0.0044
7 / 0.1053
8 / 0.0739
9 / 0.2893
10 / 0.4602
R2r / 0.1858
CR2p / 0.8631
Data 2, split B / Ntest= 13
R2 before Y-randomization / 0.9570
1 / 0.0002
2 / 0.0310
3 / 0.1953
4 / 0.0417
5 / 0.2014
6 / 0.0000
7 / 0.0002
8 / 0.2315
9 / 0.0087
10 / 0.0349
R2r / 0.0745
CR2p / 0.9190
Data 2, split C / Ntest= 13
R2 before Y-randomization / 0.9072
1 / 0.0966
2 / 0.1175
3 / 0.0341
4 / 0.1533
5 / 0.0930
6 / 0.0004
7 / 0.1352
8 / 0.0029
9 / 0.0089
10 / 0.0350
R2r / 0.0677
CR2p / 0.8727
Data 3, split A / Ntest= 20
R2 before Y-randomization / 0.7972
1 / 0.0799
2 / 0.0652
3 / 0.0380
4 / 0.0595
5 / 0.0259
6 / 0.0798
7 / 0.1076
8 / 0.0073
9 / 0.2646
10 / 0.1263
R2r / 0.0854
CR2p / 0.7533
Data 3, split B / Ntest= 23
R2 before Y-randomization / 0.7779
1 / 0.0404
2 / 0.0043
3 / 0.0007
4 / 0.0155
5 / 0.0053
6 / 0.0976
7 / 0.2104
8 / 0.0052
9 / 0.1029
10 / 0.2317
R2r / 0.0714
CR2p / 0.7413
Test
Data 3, split C / Ntest= 22
R2 before Y-randomization / 0.8270
1 / 0.0147
2 / 0.0002
3 / 0.0720
4 / 0.1843
5 / 0.0107
6 / 0.1101
7 / 0.0186
8 / 0.1206
9 / 0.0167
10 / 0.0328
R2r / 0.0581
CR2p / 0.7974