Supplementary File S2: Metacyclogenesis Determination by K-Means

Supplementary File S2: Metacyclogenesis determination by K-means

We applied k-means clustering to differentiate metacyclic promastigotes from procyclic and transitional promastigotes based on morphological data. K-means is a statistical approach to iteratively search for a specified amount of clusters with minimal within-cluster variation (sum of squares, i.e., the distance from the cluster centre to each point of the cluster is squared and summed).

As explained in the manuscript, FL-(flagella length/cell body length) and LW-(cell body area) values are generated from the morphological measurements. This data needs to be exported as a *.txt file in the format as mentioned below (Figure S1).Then, a home-made script needs to be run in the statistical software package ‘R’ (the full script is provided at the bottom of this document).

Importantly, different clinical strains are known to be able to differ in average cell size. In order to allow comparison of strains which may have different morphometric characteristics and to equalize the relative importance of FL and LW, k-means is performed ona transformed dataset. The LW-values are transformed into treatment specific z-scores. FL-values are divided by the overall FL-standard deviation across all strains, because it is already a ratio that takes size differences into account. This data transformation also resolves the problem of LW typically yielding higher values than FL, which would result in a higher importance of LW compared to FL when performing k-means clustering (which is a method based on the Euclidean distance).

K-means is then applied to cluster the data into 3 clusters, as this appeared to be the optimum amount of clusters by assessment of the total within-cluster sum of squares in function of the amount of clusters. This function of course always decreases (the more clusters, the smaller the within-cluster variation), but when the function reaches an amount of clusters that fit the data optimally, the function will further decreaseat a slower rate (‘elbow finding’). In this case, three clusters seemed to fit the data significantly better than the two clusters (procyclic and metacyclic) that we originally suspected. Adding a fourth cluster did not significantly decrease the within-cluster sum of squares.

After clustering the transformed data, the data is re-transformed into its original form (true FL- and LW- values) to simplify data interpretation. The next step is to identify which promastigote life stages correspond to which cluster. Parasites with the highest FL-value represent the metacyclic group (which also have a low LW) and parasites with the highest LW represent the procyclic group (which also have a low FL). The script then generates graphs for each ‘treatment’ and calculates the percentages of parasites in each cluster. The morphological characteristics of the third group lie in-between the procyclic and metacyclic parasites, and are therefore considered to be transitional forms.

The final output is a *.txt file reporting the percentage of metacyclic, procyclic and transitional parasites in each treatment and the amount of parasites that were measured and on which these calculations are based.

Figure S1: LEFT: The pooled dataset, with the clusters projected back to their original values. Due to the treatment specific size corrections, green, black and red dots are mixed at the border. RIGHT: Objective function: The within cluster sum of squares (squared, summed distance to the cluster centra) in function of the amount of clusters. We chose 3 clusters because the decrease in variation after this point is minimal.

Example of input file

The data file that will be imported into R must have 3 columns called LW, FL and TREATMENT. The rows represent different parasites. The LW and the FL column contain the L*W and F/L data respectively (calculated in Excel). The TREATMENT column indicates the strain name of the parasite of which the LW and FL data on this specific row is mentioned. The ‘...’ indicate that more measurements of the same strain are excluded here to improve visualisation. The should not be included in the actual input file.

LW / FL / TREATMENT
11.78588 / 1.480436 / 450RSTAT
9.439816 / 1.908546 / 450RSTAT
… / … / …
12.21457 / 1.464691 / 450RSTAT
28.42775 / 1.570487 / 500STAT
20.53287 / 0.974139 / 500STAT
… / … / …
19.82847 / 0.616049 / 500STAT
16.5174 / 1.491622 / 514RSTAT
9.83758 / 1.374679 / 514RSTAT
… / … / …
10.24951 / 2.19047 / 514RSTAT
13.68904 / 1.176399 / WT(602STAT)
18.51614 / 1.00781 / WT(602STAT)
… / … / …
19.11102 / 1.049548 / WT(602STAT)

Detailed script that was used

#import dataset (make sure that the last of the alphabetically sorted strains is the WT)

datastat<-read.table(file.choose(), header = T)

attach(datastat)

unlink("C:/MetacyclogenesisR", recursive = TRUE, force = FALSE)

dir.create("C:/MetacyclogenesisR")

#perform data transformation and apply k-means method

TREATMENTORIG<-TREATMENT

name<-levels(TREATMENT)

FLnew<-NULL

LWnew<-NULL

Namenew<-NULL

for (i in 1:length(levels(TREATMENT))){

FLnew<-append(FLnew,(datastat$FL[TREATMENT==name[i]])/sd(datastat$FL))

LWnew<-append(LWnew,(datastat$LW[TREATMENT==name[i]]-mean(datastat$LW[TREATMENT==name[i]]))/sd(datastat$LW[TREATMENT==name[i]]))

Namenew<-append(Namenew, datastat$TREATMENT[TREATMENT==name[i]])

}

FL<-FLnew

LW<-LWnew

TREATMENT<-name[Namenew]

data2<-cbind(FL,LW)

k<-kmeans(data2,3,nstart=1000)

clusFL<-k$cluster[FL==max(FL)]

clusLW<-k$cluster[LW==max(LW)]

clusMID<-6-clusFL-clusLW

clusnew<-NULL

for (i in 1:length(LW)){

if (k$cluster[i]==clusFL) {clusnew[i]<-1} else if (k$cluster[i]==clusLW){clusnew[i]<-2} else {clusnew[i]<-3}

}

png("C:/MetacyclogenesisR/pooled.jpg")

plot(datastat$LW,datastat$FL, xlab= "LW", ylab="FL", main = 'Metacyclogenesis pooled')

points(datastat$LW, datastat$FL, col= clusnew, xlab= "LW", ylab="LW")

legend("topright",legend = c('Metacyclic', 'Procyclic','Transition phase'),pch=16,col= c('black', 'red','green'))

dev.off()

#control model (shows the variation depending on the number of clusters)

fit<-matrix(data=NA,nrow=10,ncol=1,dimnames=NULL)

for (i in 1:10){

clusFL<-kmeans(data2,i)

fit[i]<-sum(clusFL$withinss)

}

png("C:/MetacyclogenesisR/objectivefunction.jpg")

plot(fit, main = 'Objective Function', ylab='Total within-cluster sum of squares', xlab= 'Amount of clusters')

dev.off()

#determine metacyclogenese percentages, etcetera

meta<-NULL

pro<-NULL

between<-NULL

mains<-NULL

files<-NULL

N<-NULL

for (i in 1:length(levels(TREATMENTORIG))){

meta[i]<-length(LW[TREATMENT==name[i]&clusnew==1])/length(LW[TREATMENT==name[i]])

pro[i]<-length(LW[TREATMENT==name[i]&clusnew==2])/length(LW[TREATMENT==name[i]])

between[i]<-length(LW[TREATMENT==name[i]&clusnew==3])/length(LW[TREATMENT==name[i]])

N[i]<-length(LW[TREATMENT==name[i]])

mains[i]<- paste ("Metacyclogenesis",name[i], collapse = NULL)

files[i]<- paste ("C:/MetacyclogenesisR/",name[i],".jpg",sep="")

png(files[i])

plot(datastat$LW[TREATMENT==name[i]], datastat$FL[TREATMENT==name[i]], main = mains[i], xlab= "LW", ylab="FL", pch=16)

points(datastat$LW[TREATMENT==name[i]&clusnew==1], datastat$FL[TREATMENT==name[i]&clusnew==1], col=1, pch=16)

points(datastat$LW[TREATMENT==name[i]&clusnew==2], datastat$FL[TREATMENT==name[i]&clusnew==2], col=2, pch=16)

points(datastat$LW[TREATMENT==name[i]&clusnew==3], datastat$FL[TREATMENT==name[i]&clusnew==3], col=3, pch=16)

legend("topright", legend = c('Metacyclic', 'Procyclic','Transition phase'),pch=16,col= c('black', 'red','green'))

dev.off()

}

tab1<- matrix(c(c(name),c(meta),c(pro),c(between),c(N)), ncol= 5,byrow=FALSE)

colnames(tab1) <- c("Treatment","Metacyclogenesis(%)","Procyclogenesis(%)", "Transition Phase(%)","N")

Summary<-as.table(tab1, row.names = FALSE)

write.table(Summary,"C:/MetacyclogenesisR/Summary.txt", col.names=NA)