SUPPLEMENTAL FIGURES

Supplemental Figure 1.Distribution of IGHV sequences of the present study according to mutation status.

Based on the percentage of identity to germline, this collection of sequences was divided into four major “identity groups”; “truly unmutated” (100% identity; 864 sequences), “minimally mutated” (99-99.9% identity; 243 sequences), “borderline mutated” (98-98.9% identity; 129 sequences) and “mutated” (<98% identity; 1426 sequences). The IGHV repertoires of the “mutated”, “minimally mutated”, “borderline mutated” and “truly unmutated” subgroups differed (Supplemental Table III), in keeping with previous reports. Also, at the individual gene level, the distribution of rearrangements of IGHV genes according to mutation status varied significantly (Supplemental Table III).

Supplemental Figure 2. Distribution of HCDR3 lengths of the CLL cases from our series included in the present study. The x axis depicts HCDR3 amino acid length; the y axis refers to % of sequences with a given length.

HCDR3 length ranged from 4-32 amino acids (AA) (median, 17) (Supplemental Table VI). “Truly unmutated” sequences had significantly longer HCDR3s (median 21 AA) than all other sequences. A significant difference in HCDR3 length was also observed between “minimally mutated” (median 19 AA) and “borderline mutated” or “mutated” sequences (median 15 AA for both groups) (Supplemental Table VI; Supplemental Figure 3).

Supplemental Figure 3. Distribution of HCDR3 lengths of the CLL cases from our series according to mutation status. The x axis depicts HCDR3 amino acid length; the y axis refers to % of sequences with a given length.

The striking peak at 9 AA in the “borderline mutated” group is made up predominantly of IGHV3-21 cases with distinctive, stereotyped HCDR3s (Supplemental Figure 4). A further striking peak at 13 AA in both the “minimally mutated” and “truly unmutated” groups is made up predominantly of rearrangements with stereotyped HCDR3s utilizing genes of the IGHV1/5/7 clan along with IGHD6-19 and IGHJ4. The increase in HCDR3 length observed among truly unmutated sequences is mainly accounted for by rearrangements of the IGHV1-69 gene (for example, 40.35% of all sequences with a 24 AA-long HCDR3 utilized the IGHV1-69 gene). That notwithstanding, it is worth underscoring the fact that two groups of mutated IGHV4-34 sequences with stereotyped HCDR3s (corresponding to subsets 4 and 16 in Murray et al), both utilizing the IGHJ6 gene, carried significantly longer HCDR3s (20 and 24 AA, respectively) compared to other mutated cases (p<0.001).

Supplemental Figure 4. Distribution of HCDR3 lengths of the CLL cases from our series utilizing IGHV3 subgroup genes. The striking peak at 9 AA in the “borderline mutated” group is made up predominantly of IGHV3-21 cases with distinctive, stereotyped HCDR3s. Green line: all cases; black line: cases utilizing the IGHV3-21 gene; red line: all other IGHV3 genes except IGHV3-21.

Clustering of CLL sequences based on HCDR3 patterns

Supplemental Figure 5.A Biolayout 2D representation of two clustering examples with real data, namely the formation of clusters 2-0020 and 1-0089 with four samples each. Black nodes are either clusters (smaller nodes) or samples (larger nodes). Blue lines connect clusters to samples and clusters to clusters, while red lines connect samples between themselves. A. Cluster 2-0020 with samples NY278, P2959, F73, and N2518. B. Cluster 1-0089 with samples NY585, FRA-340, DK76, and FRA-069.

The clustering procedure for these samples is as follows:

Sample NY278 (HCDR3 sequence CARGPDESGWCGFRYW) was connected to P2959 (HCDR3 sequence CARGPDISGWNGFEYW) by pattern ARGPD.SGW.GF.Y providing 78.57% identity, forming the level 0 cluster 0-0168. P2959 was then found to be connected to F73 (CARGPDTSGWNSLDYW) by ARGPD.SGWN..[DE]Y providing 71.43% identity and 78.57% similarity, making F73 a candidate for membership in cluster 0-0168. However, F73 was not found to be connected to the other member of the cluster, NY278, and therefore was not allowed to join the cluster. Consequently, F73 formed a new level 0 cluster 0-0247 by borrowing P2959. F73 (CARGPDTSGWNSLDYW) was finally found to be connected to N2518 (CARGPDESGWLALAYW) by ARGPD.SGW..L.Y with 71.43% identity, making N2518 a candidate for membership in cluster 0-0247. As in the previous case though, N2518 was not found to be connected to the other member of the cluster, P2959, therefore it had to borrow F73 and form yet another new level 0 cluster 0-0250. Since clusters 0-0168 and 0-0247 shared P2959 they were connected on level 1 to form cluster 1-0107; similarly, clusters 0-0247 and 0-0250 shared F73 and were thus connected on level 1 to form cluster 1-0110. Finally, since clusters 1-0107 and 1-0110 shared cluster 0-0247, they formed the level 2 cluster 2-0020 (Supplemental Figure 5A).

Samples FRA-340 (CARAGEMATVFGRGAFDIW) and NY585 (CARAGEMATLMGLGAFDIW) were connected by pattern ARAGEMAT[AVLI].G.GAFDI providing 82.35% identity and 88.24% similarity to form level 0 cluster 0-0143. With the same identity and similarity values, the pattern REGEMAT.[KRH]GFGAFDI connected samples DK76 (CGREGEMATQRGFGAFDIW) and FRA-069 (CAREGEMATMKGFGAFDIW) to form cluster 0-0144. Pattern AR.GEMAT..G.GAFDI connected FRA-069, FRA-340, and NY585 (76.47% identity), and all four samples shared pattern R.GEMAT..G.GAFDI (70.59% identity). Therefore the two level 0 clusters were connected into the 1-0089 cluster (Supplemental Figure 5B).

The second example provides us with the opportunity to make some important remarks regarding the clustering procedure. Level 0 clusters are guaranteed to contain sequences that all share patterns between themselves, but are not guaranteed to contain all samples that display above-threshold identity and similarity between themselves. In the second example above, one could argue that DK76, FRA-069, FRA-340, and NY585 should be in the same level 0 cluster since they are all connected between themselves by the same pattern albeit with a lower score. However, this would lead to a loss of information, in this case the fact that DK76 and FRA-069 are more similar between themselves than across to FRA-340 and NY585, and vice versa. The issue is addressed by higher-level clusters, with the end result that if two sequences are above-threshold identical and similar they are guaranteed to reside within a cluster of some level, i.e. the two sequences could be in different level 0 and level 1 clusters but in the same level 2 cluster.

Supplemental Figure 6. Major high level clusters (HCDR3 archetypes) in CLL. The screenshot is taken from within Biolayout3D. The grey spherical nodes can either represent sequences or clusters; the connecting lines are coloured according to the score of the connection ranging from blue (low score) to red (high score). Each autonomous cluster of sequences is annotated with the cluster ID, its size (in number of sequences), and its major IGHV genes and their frequencies. The ID (e.g. 3-0002) is made out of the Level of hierarchy the cluster is representing (e.g. 3) and a sequential four-digit number (e.g. 0002). Note that clusters 3-0003 and 3-0004 are connected to form the only Level 4 cluster in the hierarchy (not shown).

Supplemental Figure 7.Breakdown of sequences in level 3 clusters with regard to HCDR3 length.

(i)3-0000:20 amino acids (aa) = 1 case; 21 aa=13 cases; 22 aa= 17 cases; 23 aa= 29 cases; 24 aa = 34 cases; 25 aa= 9 cases; 26 aa= 3 cases

(ii)3-0001:9 aa = 82 cases

(iii)3-0002:13 aa=144 cases; 14 aa=29 cases

(iv)3-0003:20 aa=38 cases; 21 aa=16 cases

(v)3-0004:20 aa=2 cases; 21 aa=22 cases; 22 aa=37 cases

(vi)3-0005:20 aa=88 cases

Supplemental Figure 8. Level 0 clusters in all sequences (CLL+other entities). Stacked percentage distribution of level 0 cluster sizes from the group of all sequences, divided by cluster specificity. The “specificity” of a cluster is determined by the number of sequences that belong to the CLL group or the non-CLL group. If all sequences are non-CLL then the cluster is considered as non-CLL unique, if the majority of the sequences are non-CLL then the cluster is non-CLL biased, and if the number of non-CLL and CLL sequences is equal then the cluster is considered neutral. If most sequences are CLL then the cluster is CLL biased, and finally the cluster is CLL-unique if all the sequences are CLL. Evidently, the majority of small (i.e. two- or three-member) clusters are mostly non_CLL-unique or non-CLL biased. From cluster size of four and onwards the majority of clusters feature CLL sequences. Of note is the big gap to the cluster of 22 sequences, that has 21 stereotypical IGHV3-21 sequences from patients with CLL and one non-CLL sequence.

It should be noted that groups of unique public-database sequences with identical or near-identical HCDR3s were often referenced in the same publication i.e. probably clonally related. In this context, 218 level 0 clusters included at least two sequences from the same publication (480 sequences referenced in 54 publications); 193 of these 218 level 0 clusters were characterized by the exclusive use of sequences from the same publication (429 sequences in 50 publications).

Supplemental Figure9. IGHV repertoires in clustered and non-clustered cases. The distribution of CLL sequences with selected genes along the process of clustering, from the whole cohort (all) to sequences in level 3 clusters. In each pie diagram, the five most frequent genes are highlighted in color; the same color code is used in all diagrams.

Supplemental Figure 10. Effects of clustering on the IGHJ repertoire.

The percentage of CLL sequences with IGHJ4 or IGHJ6 along the process of clustering, from the whole cohort (all) to sequences in level 3 clusters. In general, the IGHJ4 and IGHJ6 genes were the most frequent IGHJ genes (Supplemental Table VI). The contradictory fate of the IGHJ4 and IGHJ6 genes in evident in this graph. More specifically, the IGHJ6 gene was represented with an increasing frequency in each successive level of clustering, starting from 32.8% at the cohort level and reaching up to 71.6% in level 3 (X2-test: p<0.0001). In contrast, the IGHJ4 gene started from 42.5% at the cohort level plummeting to 27.8% in level 3 (X2-test: p<0.0001) (see also Supplemental Table VI).

1