Supplemental Online Material for Wasmuth et al.: Integrated bioinformatic and targeted deletion analyses of the SRS gene superfamily identify SRS29C as a negative regulator of Toxoplasma virulence.

The structural core of the SRS domain is stabilized by four invariant cysteines

To highlight the key differences between the SRS subfamilies, we constructed a structure based alignment for each SRS domain subfamily, which were then combined to provide a refined alignment for the SRS superfamily. Of the six characteristic cysteines (cys-1, cys-2 ... cys-6), four are invariant: cys-1, cys-2, cys-5 and cys-6 (Figure S1). In families one and two, the canonical cys-3 is absent, but studies of the SRS28 (sporoSAG) structure showed that a spatially conserved disulphide bond forms between cys-4 and a subfamily specific cysteine that is present just upstream of cys-5 (Crawford et al., 2010), indicating that the relative position of the cysteine in the primary sequence does not necessarily predict the tertiary structure’s topology. More striking are families that are missing one or both of the middle cysteines with no apparent compensatory substitution elsewhere; for example families three, four and six. Taken with studies on the three dimensional structures of SRS proteins, four invariant cysteines stabilise the core of the domain whereas some subfamilies possess two other cysteines that form a peripheral disulphide bond, whose importance likely correlates with function and depends upon several factors including the length of the beta strands and the inter strand loops and topology. Despite intra-subfamily pair wise identity as low as 23%, the majority of the conservation can be mapped to the proposed β-strands structures.

Residues implicated in ligand binding display greatest sequence diversity

To investigate whether the relative position of the SRS domains in the two-domain proteins exhibit differences in conservation, we generated separate alignments for the D1 (solvent-exposed) and D2 (membrane proximal) domains from all two domain T. gondii Me49 SRS proteins and calculated conservation scores for each residue. Of the 122 total residues in the D1 domain, 33 (27%) are categorized as significantly conserved with 31 (25%) and 59 (48%) moderately and poorly conserved, respectively. Residue conservation scores were mapped onto the surfaces of SRS29B D1 and D2 (Figure 2A-C). The conservation scores across the D2 domains show a slightly different pattern with a more even distribution of conservation. Interestingly, the membrane distal end of the D1 subunit is poorly conserved, while the membrane proximal end of the D2 subunit appears highly conserved. This sequence variation in the D1 subunit likely reflects the role these residues play in generating antigenic variability or for binding a variety of different host targets. The interface between the two domains exhibit similar patterns of conservation, reflecting the complementary surfaces required to stabilize this interface. Beyond core structural elements of the SRS fold, a notably conserved feature in D1 is a pair of lysines, Lys33 and Lys94.Since these lysines are not likely serving as structural elements, the high degree of conservation suggests a potential functional role. Structural mapping reveals that Lys33 and Lys94 form a basic groove at the interface of the SAG1 dimer, and are predicted to serve as potential binding sites for negatively charged biomolecules such as a sulfated proteoglycan on the surface of host cells (He et al., 2002). In fact, the more polar surfaces tend to overlay with the moderately to poorly conserved regions, consistent with antigenic drift operating in regions under potential immune selection pressures. No equivalent surface was identified on D2. It is likely that the D2 domain serves as the foundation to stabilize and orient the D1 domain to promote multimerization.

To assess the potential for structural consequences of key residues identified as divergent in at least one of the eight newly defined SRS subfamilies, we mapped divergent residues onto the D1 domain of SRS29B and color coded them to represent separated regions in the primary sequence(Figure 2D). It is clear that divergent individual or clusters of residues are distal to the SRS29B D1 dimer interface. A side view of the SRS29B dimer (Figure 2D) shows the subfamily defining residues clustered at the tip of the D1 domain that contacts the D2 domain and largely localizes to the inter strand loops peripherally associated with the b sandwich core. A zoomed in view reveals the degree to which the subfamily defining residues are spatially associated despite adopting distal positions in the sequence alignment. The most noteworthy divergence is the substitution of a cysteine at position 54 with a threonine or valine/phenylalanine in families one and three, respectively and substitution of Cys62 with a large polar residue (Asp, Glu, Lys) in subfamily three (cysteine pair is shown in yellow in Figure 2). This observation is consistent with our analysis above that shows lower conservation scores for these cysteines.

Expansion of SRS gene superfamily includes whole locus duplications and convergent evolution

The sequence of the SRS proteins is highly diverse, indeed outside the core of four or six cysteine residues, there is little that is conserved across all the domains. This makes phylogenetic reconstruction extremely difficult, as few branch points could be robustly supported through a bootstrap analysis. Therefore, we decided to cluster the domain sequences based on pairwise alignment scores. The resultant eight domain subfamilies present an excellent opportunity to study the SRS superfamily. An important caveat is that intra-subfamily relations must remain ambiguous. Within each multi-gene locus, we note that member genes have similar architectures of domain families, have similar gene structures (number of introns) and, with the exception of the SRS38 locus, are transcribed in the same direction (Figure 1B). These strong associations between domain families and the small number of observed domain architectures (only nine of a potential 36 non-directional two-domain combinations) implies that exon-switching is not a significant mechanism for increasing protein divergence. Taken together, these data suggest that these loci arose from tandem gene duplication. We focused upon the most abundant two-domain proteins to understand the mechanism(s) that underlie the spread of the family throughout the genome (Figure 4). The first mechanism relies solely upon single gene duplications, some resulting in tandemly arrayed genes and others colonising new parts of the genome. The second is that a single multigene locus arose and is duplicated as a whole through interchromosomal exchanges. While the second mechanism is the more parsimonious one, our phylogenetic reconstruction of two-domain proteins containing domain families seven and eight, points to a mixture of these mechanisms (Figure 4). The situation is further complicated by the phenomenon of convergent evolution (reviewed in (Liao, 1999); intralocus exchange (ectopic recombination) within a locus will homogenise the member genes. It is plausible that the monophyletic groups for SRS12,19, 36 and 38 were the result of whole locus duplications followed by ectopic recombination. However, the difference in locus size and pseudogenes prevents us from precluding locus-specific single gene duplications. Furthermore, the SRS15, 16 and 40 loci may be recently diverged from their common ancestor and their genes are concurrently being purified. The forthcoming genome sequences of the related coccidian parasites, Neospora caninum and Sarcocystis neurona, along with more strains of T. gondii, should facilitate construction of more informative phylogenies to propose accurate models for SRS gene expansion. The addition of SRS genes from strains GT1 and VEG showed that the syntenic genes were monophyletic (data not shown), suggesting that there has been insufficient time since their most recent common ancestor for strain-specific convergence.

Detailed Materials and Methods

Sequence datasets

The genome, protein and CDS sequences for the ME49, GT1 and VEG strains of Toxoplasma gondii were downloaded from ToxoDB (version 5.3 (Tg)) (Gajria et al., 2008) and the expressed sequence tag (EST) sequences and their associated library information for T. gondii, N. caninum and S. neurona were downloaded from dbEST (Boguski et al., 1993).

SRS domain family searches

The seed model for the Pfam (version 23) domain PF04092 was searched against the protein sequences for T. gondii (Finn et al., 2008). The gathering score (GA) cut off was used. The CD-Hit program was used to build a representative set of domain sequences (95% identity) (Li and Godzik, 2006). All-against-all global alignments were produced using the Needleman-Wunsch algorithm and the bit score used to hierarchically cluster the domains (metric: Kendall’s tau; method: Complete linkage) (de Hoon et al., 2004).

The eight SRS subfamilies were aligned with PROMALS which combines sequence and structural information (Pei et al., 2008). The alignment was seeded with the structures of SRS29B (1KZQ), SRS28 (2X28) and BSR4 (2JKS). From these alignments, hidden Markov models were built using HMMER (version 3.0b3 - http://hmmer.janelia.org/) (Eddy, 1998). Searches, with hmmsearch, were performed against the protein sets for Toxoplasma, as well as six-frame translations (transeq (Olson, 2002)) of the Toxoplasma genome. Manual inspection of the results, coupled with our understanding of the structural determinates of an SRS domain, helped define three rules to aid in domain identification. Namely, a sequence was accepted as an SRS domain if it aligned with one of the domain models with an Evalue < e-5, was at least 90 amino acids in length and contained at least four cysteine residues. Aligned regions with fewer than four cysteines, were designated as ‘degraded’ domains.

Pseudogenes were defined as SRS domains that satisfied our search criteria but were not part of a ToxoDB gene model. If two domains were separated by less than 1000bp we considered the domains to be part of a multidomain pseudogene separated by an intron. We noted that stop codons were present in 11% of the pseudogenes identified; the lack of a gene model for the remaining 89% is likely to be due to a stop codon elsewhere in the sequence, but outside the SRS domains.

Genome coverage

Sequences for the B1 (accession: AF179871) and rDNA (accession L25635) loci were downloaded from Genbank. Trace sequences for T. gondii Me49 were downloaded from the NCBI trace archive (ftp://ftp.ncbi.nih.gov/pub/TraceDB/). BLAT was used to align the traces to the assembled genome sequence (Kent, 2002) (Kent 2002) and genomeCoverageBed calculated the coverage for each nucleotide (Quinlan and Hall, 2010).

Phylogenetic reconstruction

The two domain proteins containing SRS domains from families 7 and 8 were aligned, with SRS37A and SRS37B providing an outgroup. The evolutionary history was inferred using the Minimum Evolution (ME) and Maximium Likelihoood (ML) methods (Felsenstein, 1981). For ME, evolutionary distances were calculated using the DayHoff (Schwarz and Dayhoff, 1979) and JTT matrices (Jones et al., 1992), and for ML the JTT matrix was used. The ME trees were generated with MEGA4 (Tamura et al. 2007) and the ML tree with Phylip (Felsenstein, 2005).

Inter-strain and inter-species comparisons

For the three Toxoplasma strains, comparisons were done at the protein level and between individual domains. Sequences were clustered to identify orthologus and paralogous relationships (Alexeyenko et al. 2006). Throughout the manuscript, the following terminology was used: “genes” refers to individual genetic loci; “orthologous group” refers to a set of syntenic genes shared between the three strains; and “loci” was used to define either one gene or a cluster of paralogous genes that group together in a distinct chromosomal region.

Sequence conservation

The individual family alignments were then aligned together. A superfamily-based HMMlogo was generated and deviations from this in each subfamily were visualised using HMMlogos (Beitz, 2006). The conservation at each position was calculated for D1 and D2 domains (Pei and Grishin, 2001).

Expression profiling

ESTs were mapped onto the SRS gene models using BLASTX (Altschul et al., 1997). The microarray data was downloaded from ToxoDB (Gajria et al., 2008). To detect stage-specific expression, we used an arbitrary cut-off of +/- two-fold divergence. For the three-way inter-Type comparisons, the same cut-off was used, relative to the median fold difference.

Polymorphism analyses

Strain type-specific single nucleotide (SNP) polymorphisms were downloaded for all SRS gene model IDs from ToxoDB (version 6.0) and the ratio of synonomous to non-synonomous changes calculated. In addition, protein-based multiple sequence alignments were generated for orthologous groups found in all three strains (Edgar, 2004). The coding sequences were then mapped onto the alignment, to maintain codon integrity. Pairwise distances were calculated using dnadist (Felsenstein, 2005) and a threshold of 0.4% used to delineate alleles (Grigg et al., 2001). To estimate w (dN/dS ratio), the program Slr was used (Massingham and Goldman, 2005); w>1 is positive selection and w <0.5 is purifying.

Plasmid constructs and parasite strains

The SRS29C gene was amplified from Type I RH and Type II 76K strain genomic DNA using the sense primer, 5’-CGGGATCCGCCGTACAGATACGAGC-3’ and the antisense primer, 5’-AAAACTGCAGCACTCCCTCCACCCCCCTG-3’. The resulting PCR fragments were cloned into pTOPO. The SRS29C-I and SRS29C-II pTOPO plasmids contain the extracellular domain of the coding sequence flanked by 1.0kb of 5’ sequence corresponding to the SRS29C promoter, un-translated regions, the coding sequence, and 1.0kb of 3’ un-translated region. These plasmids were transfected into the parental parasite RH∆ strain for this study (WT). WT is Type I RH strain whose hypoxanthine-xanthine-guanine phosphoribosyl transferase (HXGPRT) has been disrupted. The pHXGPRT plasmid contains the HXGPRT open reading frame flanked with the regulatory regions isolated from the Di-Hydro Folate Reductase (DHFR) gene. The pGFP-LUC plasmid contains separate cassettes for GFP (driven by the GRA4 promoter) and LUC (driven by the DHFR promoter)

Flow Cytometry

Parasites were fixed with 2% formaldehyde in PBS for 2 min, blocked with 1%BSA in PBS and stained with anti-SRS29C antibody (1:1000 monoclonal and polyclonal) anti-SRS29B DG52 antibody (1:2000 monoclonal), anti-SRS29A polyclonal (1:500), anti-SRS34A 5A6 monoclonal (1:600), and anti-SRS57 IF12 monoclonal (1:2000). Following three washes with 1% BSA in PBS, parasites were incubated with fluorescent (Phycoerythrin or FITC, Sigma) labeled secondary antibody. After four washes with PBS, the stained parasites were analysed for surface protein expression by Flow cytometry (FACS) using a FACSCalibur (BD Biosciences).

Mouse infection

7-10 week old female CD1 outbred mice were infected by intraperitoneal injection with a total of 50 parasites diluted in 400 µl of PBS.

Quantitative RT-PCR

Total RNA from 0.5-1x107 tachyzoites was isolated using RNeasy mini kit (Qiagen), according to the manufacturer’s instructions. 1 to 2 µg of total RNA was used in a cDNA synthesis reaction using random primers according to manufacturer’s instructions (Invitrogen). After DNase treatment using RNase-Free DNase kit (Qiagen) RNA concentration was determined by absorbance at 260 nm using NanoDrop 1000 Spectrophotometer V3.7 (Thermo Scientific). Additionally RNA integrity was electrophoretically verified by GelRed nucleic acid staining (Biotium). Total RNA (2 µg) from each strain was reverse transcribed with 200 units of reverse transcriptase using 250 ng of random primers (SuperScript II reverse transcriptase (RT) kit, Invitrogen) according to the manufacturer’s instructions.