1

QSAR modeling of endpoints for peptides

which is based on representation of the molecular structure

by a sequence of amino acids

Supplementary materials

Important:

+ is indicator of sub-training set

- is indicator of calibration set

# is indicator of test set

Table S1. Data set 1 (split A)

+P01 WLEPGPVTA 6.082

#P02 ITSQVPFSV 6.196

+P03 FLEPGPVTA 6.898

#P04 ITAQVPFSV 7.020

#P05 YLEPGPVTL 7.058

#P06 YTDQVPFSV 7.066

+P07 YLEPGPVTI 7.187

+P08 YLEPGPVTV 7.342

#P09 YLSPGPVTA 7.383

-P10 IIDQVPFSV 7.398

+P11 ITWQVPFSV 7.463

+P12 ITYQVPFSV 7.480

+P13 ILSQVPFSV 7.699

-P14 IMDQVPFSV 7.719

-P15 YLMPGPVTV 7.932

+P16 WLDQVPFSV 7.939

#P17 YLAPGPVTA 8.032

-P18 YLYPGPVTV 8.051

+P19 YLWPGPVTV 8.125

+P20 ILYQVPFSV 8.310

-P21 ILDQVPFSV 8.481

+P22 YLFPGPVTA 8.495

+P23 YLDQVPFSV 8.638

-P24 ILFQVPFSV 8.699

+P25 ILWQVPFSV 8.770

-P26 WTDQVPFSV 6.145

+P27 YLEPGPVTA 6.668

-P28 ITDQVPFSV 6.947

-P29 ITFQVPFSV 7.179

-P30 FTDQVPFSV 7.212

-P31 ITMQVPFSV 7.398

#P32 YLSPGPVTV 7.642

+P33 YLYPGPVTA 7.772

-P34 YLAPGPVTV 7.818

#P35 ILAQVPFSV 7.939

+P36 ILMQVPFSV 8.125

-P37 YLFPGPVTV 8.237

#P38 YLMPGPVTA 8.367

+P39 YLWPGPVTA 8.495

-P40 FLDQVPFSV 8.658

Table S2. Data set 2 (Split A)

+CAMEL118 KWKLFLGILAVLKVL 0.159

+CAMEL17 KWNLNGNINAVLKVL 0.209

-CAMEL38 KWKGELEIEAELKVL 0.376

-CAMEL107 GWKLGLKILNVLKVL 0.496

+CAMEL20 KWKLFKKNNNNNKHN 0.498

-CAMEL116 KWHLFLLILAVLKVL 0.514

#CAMEL34 KRGLFKKGGAVLKGL 0.528

-CAMEL18 KWHLRNKIGAVRNNL 0.537

#CAMEL16 KHKLFKKIGAHRKRN 0.553

+CAMEL39 HWHLHKHRGARHKVL 0.677

+CAMEL134 GWELGEEILNVLKVL 0.708

-CAMEL115 KWHLFLKILAVLKVL 0.741

+CAMEL50 KWKLFKKHGNVRKVL 0.771

-CAMEL10 KNKRNKKIGAVLKVL 0.848

+CAMEL14 KHNLFKGIGAVLLVL 0.922

+CAMEL51 KWKLFKKIGNRNKVL 0.947

+CAMEL113 LWKLFLHILAVLKVL 0.962

+CAMEL26 KNKLEKKIGAVLKVL 1.027

+CAMEL52 KWKLGKGIGAVGKVL 1.033

-CAMEL58 KWKLFNRIGHNRKVN 1.049

+CAMEL137 GWRLFRGIRAVLNVL 1.074

#CAMEL54 KWGLFKNIGAVLHVN 1.156

-CAMEL57 RWKLNNNIGARLKVL 1.206

+CAMEL33 HWKLFKKIGHVNKRL 1.34

+CAMEL60 HWKRFLRIGHNLNVN 1.495

+CAMEL11 KWKLFKKIGGVGGVL 1.593

+CAMEL120 GWKLFLKILAVLKVL 1.598

#CAMEL13 GWKLFKNRGAVLKHL 1.605

-CAMEL8 KWKLFNKRGAVLKVL 1.605

+CAMEL19 RWKNFKNIRANLRVL 1.742

+CAMEL56 KWKLFGKNGRNLLVL 1.814

#CAMEL138 GWRLFKGIRAVLNVL 1.826

+CAMEL41 KWKLFKKGAVLKVLT 1.891

+CAMEL47 KWKLFKKRNAVLKVL 1.964

+CAMEL44 KWKLFKKIGANLKVL 2.07

+CAMEL12 KWKLFKRIGAVHKRL 2.242

+CAMEL53 HWKLFKKIHAVRKHL 2.244

+CAMEL112 KWKLFLHILAVLKVL 2.245

#CAMEL32 KWKLFRKIGAVHRVL 2.306

-CAMEL29 LWKLFKKHGAVLKVL 2.422

+CAMEL36 KWHLNKRIHAVLKRL 2.587

+CAMEL1 KWKLFKKGIGAVLKV 2.649

+CAMEL35 KWKLFRRIGAVLKHR 2.706

+CAMEL30 KRKRFRKIGAVLKVL 2.747

-CAMEL15 KWKLFKLRGRVRKVL 2.879

-CAMEL31 KWKLFKKIGLGLGVL 3.148

+CAMEL2 KWKLFKKLKVLTTGL 3.246

+CAMEL126 LWRLLKHILRVLKVL 3.259

+CAMEL55 KWLLFKKIGAVLLNH 3.425

-CAMEL37 LRKLFKKIRAVLLVR 3.617

+CAMEL125 LWRLLKKILRVLKVL 3.822

-CAMEL119 GWKLFKLIGAVLKVL 3.86

+CAMEL28 KWKLGKKIGAVLGVL 3.946

+CAMEL114 KWKLFHKILAVLKVL 4.105

-CAMEL61 KWKLFKKAVLKVLTT 4.105

-CAMEL111 KWKLFHLIGAVLKVL 4.165

-CAMEL49 NWKLFHKIGAVLKVL 4.187

-CAMEL25 KWKLRKKIGAVLKVL 4.262

+CAMEL43 KWKGFKKIGAVLKVL 4.319

-CAMEL105 GWKLGKKIGRVLKVL 4.336

+CAMEL124 KWKLFKLIRAVLKVL 4.533

+CAMEL7 KWKLFKKIGAVLHNL 4.534

+CAMEL121 GWKGFKKIGRVLKVL 4.759

+CAMEL27 KWKLFKKIGAVLNRL 4.869

+CAMEL3 KWGLFKKIGAVLKVL 4.989

+CAMEL81 KWKLFKKVLKVLTTG 5.297

-CAMEL106 GWKLFKKIGRVLKVL 5.318

+CAMEL127 GWKLFKKIGRVLRVL 5.47

+CAMEL130 LWKLFKKIRRLLKVL 5.49

+CAMEL128 LWKLFKKIGRVLKVL 5.493

+CAMEL131 LWKLFRKIRRLLRVL 5.504

-CAMEL104 GWKLGKKILRVLRVL 5.562

-CAMEL108 KWKLGKKILNVLKVL 5.566

+CAMEL109 GWRLGKKILRVLKVL 5.572

+CAMEL123 LWKLFKKIRRVLRVL 5.614

#CAMEL0 KWKLFKKIGAVLKVL 5.712

+CAMEL42 HWKLFKKIGAVLKVL 5.712

+CAMEL102 GWKLGKKILRVLKVL 5.725

-CAMEL6 NWKLFKKIGAVLKVL 5.803

+CAMEL23 KWHLFKKIGAVLKVL 5.81

#CAMEL101 KWKLGKKILRVLKVL 5.845

+CAMEL103 GWKLGLKILRVLKVL 5.879

+CAMEL22 GWKLFKKIGAVLKVL 5.946

-CAMEL110 GWKLGKKILNVLKVL 6.043

#CAMEL129 LWKLFKKINRVLKVL 6.045

+CAMEL4 KWKLFHKIGAVLKVL 6.072

+CAMEL24 KWKLFKHIGAVLKVL 6.167

+CAMEL132 GWKLGKHILNVLKVL 6.182

-CAMEL48 KWKLGKKIGAVLKVL 6.323

-CAMEL136 VWRLIKKILRVFKGL 6.613

+CAMEL135 GWRLIKKILRVFKGL 6.665

+CAMEL59 KGKGGKKGGRGGKVL 1.077

+CAMEL40 GWLLHRNIGNVLHRL 1.387

+CAMEL5 KWKLFKKNGAVLKVL 1.408

+CAMEL139 GWKLFKGIRAVLNVL 1.497

#CAMEL117 LWHLFLKILAVLKVL 1.515

-CAMEL140 GWRLLKKILEVLKVL 4.136

-CAMEL45 KWKNFKKIGAVLKVL 4.249

-CAMEL122 LWKLFKKIRRVLKVL 6.142

+CAMEL9 KWRLFKNIGAVLKVL 6.292

-CAMEL46 KWKLFKGIRAVLKVL 6.45

Table S3. Data set 3 (Split A)

+1 GTLVALVGL 5.34

-2 TTAEEAAGI 5.38

-3 LTVILGVLL 5.58

+4 AMFQDPQER 5.74

#5 SLHVGTQCA 5.84

-6 GIGILTVIL 6

#7 NLQSLTNLL 6

+8 FVTWHRYHL 6.02

+9 AIAKAAAAV 6.18

+10 DPKVKQWPL 6.18

#11 ALAKAAAAI 6.21

+12 GLGQVPLIV 6.3

+13 MLDLQPETT 6.34

+14 LLSSNLSWL 6.34

+15 GLACHQLCA 6.38

-16 AAAKAAAAV 6.4

-17 LIGNESFAL 6.42

#18 ALAKAAAAV 6.42

#19 ILTVILGVL 6.42

-20 LLAVGATKV 6.48

#21 MLLAVLYCL 6.48

+22 AVAKAAAAV 6.5

-23 ALAKAAAAL 6.51

+24 WILRGTSFV 6.56

-25 AAGIGILTV 6.58

-26 ALIHHNTHL 6.62

#27 ILDEAYVMA 6.62

+28 NLSWLSLDV 6.64

+29 YMIMVKCWM 6.66

+30 VLQAGFFLL 6.68

+31 TLHEYMLDL 6.73

+32 VILGVLLLI 6.78

+33 VTWHRYHLL 6.79

#34 PLLPIFFCL 6.8

-35 CLTSTVQLV 6.83

+36 HLYQGCQVV 6.83

-37 ILLLCLIFL 6.84

+38 FLCKQYLNL 6.88

-39 FAFRDLCIV 6.89

-40 FLEPGPVTA 6.9

#41 ITDQVPFSV 6.95

-42 LMAVVLASL 6.95

-43 LLCLIFLLV 7

-44 ITAQVPFSV 7.02

-45 YLEPGPVTL 7.06

#46 YTDQVPFSV 7.07

+47 NLYVSLLLL 7.11

+48 NLGNLNVSI 7.12

#49 ILHNGAYSL 7.13

+50 HLYSHPIIL 7.13

#51 VVMGTLVAL 7.17

-52 ITFQVPFSV 7.18

#53 YLEPGPVTI 7.19

+54 FTDQVPFSV 7.21

-55 GLSRYVARL 7.25

-56 YMLDLQPET 7.31

#57 VLLDYQGML 7.33

-58 YLEPGPVTV 7.34

+59 YLSPGPVTA 7.38

+60 IIDQVPFSV 7.4

-61 ITMQVPFSV 7.4

-62 YMNGTMSQV 7.4

-63 SVYDFFVWL 7.44

-64 ITYQVPFSV 7.48

+65 GLYSSTVPV 7.48

-66 YLSPGPVTV 7.64

-67 VLIQRNPQL 7.64

-68 SLYADSPSV 7.66

+69 RLLQETELV 7.68

-70 GLYSSTVPV 7.7

+71 ILSQVPFSV 7.7

+72 IMDQVPFSV 7.72

-73 ALMDKSLHV 7.77

-74 YLYPGPVTA 7.77

#75 YAIDLPVSV 7.8

-76 FVWLHYYSV 7.82

#77 LLFGYPVYV 7.89

+78 MMWYWGPSL 7.92

+79 YLMPGPVTV 7.93

-80 ILAQVPFSV 7.94

+81 WLDQVPFSV 7.94

-82 KTWGQYWQV 7.96

+83 YLYPGPVTV 8.05

-84 LLMGTLGIV 8.1

#85 ILMQVPFSV 8.12

+86 YLWPGPVTV 8.12

-87 FLLTRILTI 8.15

#88 YLFPGPVTV 8.24

+89 ILYQVPFSV 8.31

#90 ILWQVPFSV 8.77

+91 LQTTIHDII 5.5

+92 HLLVGSSGL 5.79

+93 ALPYWNFAT 5.87

#94 TVILGVLLL 6.07

+95 QVMSLHNLV 6.17

-96 VLHSFTDAI 6.38

#97 KLPQLCTEL 6.48

#98 YLEPGPVTA 6.67

-99 LLWFHISCL 6.68

-100 GTLGIVCPI 6.71

-101 QLFHLCLII 6.89

+102 ALAKAAAAA 6.95

-103 ALCRWGLLL 7

-104 HLAVIGALL 7

+105 LLAQFTSAI 7.3

+106 ILSPFMPLL 7.35

#107 KLHLYSHPI 7.35

-108 ITWQVPFSV 7.46

-109 KIFGSLAFL 7.48

-110 ALVGLFVLL 7.58

#111 LLLCLIFLL 7.58

-112 YLAPGPVTV 7.82

-113 ILDQVPFSV 8.48

-114 YLWPGPVTA 8.5

#115 YLDQVPFSV 8.64

#116 FLDQVPFSV 8.66

#117 ILFQVPFSV 8.7

Table S4. Y-randomization includes the following steps:

Step 1: 100 random shifting of predicted values of a endpoint with calculation R2r;

Step 2: Calculation of average R2r;

Step 3: Calculation of

Data 1, split A / Ntest= 9
R2 before Y-randomization / 0.7081
1 / 0.5493
2 / 0.2697
3 / 0.1293
4 / 0.3576
5 / 0.2622
6 / 0.2293
7 / 0.4056
8 / 0.4847
9 / 0.1144
10 / 0.0021
R2r / 0.2804
CR2p / 0.5503
Data 1, split B / Ntest= 10
R2 before Y-randomization / 0.7986
1 / 0.2723
2 / 0.2149
3 / 0.3774
4 / 0.0615
5 / 0.0088
6 / 0.1522
7 / 0.5825
8 / 0.0072
9 / 0.4241
10 / 0.1275
R2r / 0.2228
CR2p / 0.6781
Data 1, split C / Ntest= 12
R2 before Y-randomization / 0.7991
1 / 0.1141
2 / 0.4849
3 / 0.1481
4 / 0.0005
5 / 0.2584
6 / 0.1473
7 / 0.1262
8 / 0.0005
9 / 0.0080
10 / 0.0001
R2r / 0.1288
CR2p / 0.7319
Data 2, split A / Ntest= 10
R2 before Y-randomization / 0.9610
1 / 0.3113
2 / 0.0285
3 / 0.4274
4 / 0.1495
5 / 0.0085
6 / 0.0044
7 / 0.1053
8 / 0.0739
9 / 0.2893
10 / 0.4602
R2r / 0.1858
CR2p / 0.8631
Data 2, split B / Ntest= 13
R2 before Y-randomization / 0.9570
1 / 0.0002
2 / 0.0310
3 / 0.1953
4 / 0.0417
5 / 0.2014
6 / 0.0000
7 / 0.0002
8 / 0.2315
9 / 0.0087
10 / 0.0349
R2r / 0.0745
CR2p / 0.9190
Data 2, split C / Ntest= 13
R2 before Y-randomization / 0.9072
1 / 0.0966
2 / 0.1175
3 / 0.0341
4 / 0.1533
5 / 0.0930
6 / 0.0004
7 / 0.1352
8 / 0.0029
9 / 0.0089
10 / 0.0350
R2r / 0.0677
CR2p / 0.8727
Data 3, split A / Ntest= 20
R2 before Y-randomization / 0.7972
1 / 0.0799
2 / 0.0652
3 / 0.0380
4 / 0.0595
5 / 0.0259
6 / 0.0798
7 / 0.1076
8 / 0.0073
9 / 0.2646
10 / 0.1263
R2r / 0.0854
CR2p / 0.7533
Data 3, split B / Ntest= 23
R2 before Y-randomization / 0.7779
1 / 0.0404
2 / 0.0043
3 / 0.0007
4 / 0.0155
5 / 0.0053
6 / 0.0976
7 / 0.2104
8 / 0.0052
9 / 0.1029
10 / 0.2317
R2r / 0.0714
CR2p / 0.7413
Test
Data 3, split C / Ntest= 22
R2 before Y-randomization / 0.8270
1 / 0.0147
2 / 0.0002
3 / 0.0720
4 / 0.1843
5 / 0.0107
6 / 0.1101
7 / 0.0186
8 / 0.1206
9 / 0.0167
10 / 0.0328
R2r / 0.0581
CR2p / 0.7974