2.1.Selection of the host genes

The host genes were selected on the basis of data mining and literature search.Through extensive text mining the initial disease related gene pool was generated. The mostimportant collection of scientific publicationsis PubMed ( Another online data-mining tools used for literature search was HighWire Topicmap ( (Supplementary Figure 1).

Awide number of text mining tools are availableto explore hiddenrelationships among biological entities. AliBaba ( is a tool that extracts information on the basis of PubMed indexed lietratures. AliBaba represents the query results as a network keeping the key terms associated to the query term as nodes. Edges in the network represent either co-occurrence or pattern match.

To validate the text-mined results a microarray cluster analysis was performed. The microarray data was downloaded from Gene Expression Omnibus (GEO) database. GEO accession taken for further analysis was GSE1880 dataset. The dataset included expression profile of B lymphoid cell lines overexpressing the latency-associated nuclear antigen (LANA) of Kaposi's sarcoma-associated herpesvirus. B cells were made to express LANA from a tetracycline inducible promoter. Expression was examined at various time points up to 48 hours after induction (Column 1-3).Gene expression profiling of three Primary Effusion Lymphoma (PEL) cell lines compared to three Burkitt's lymphoma lines were tested to figure out the changed gene expression under KSHV latent infection (Column 4-6).Gene expression profiling of two time points on Telomerase Immortal Endothelial Cells (TIVE) after infection by KSHV were compared to expression of TIVE cells without infection by KSHV (Column 7-8). Itwas carried out to figure out the changed gene expression on TIVE cell under lytic infection of KSHV.Gene expression profiling of four time points after inducing recombinant LANA protein expression compared to non induced BJAB/Tet-On/LANA cells were done to figure out the changed gene expression under the latency-associate nuclear antigen (LANA) of KSHV expression (Column 9-12).Gene expression profiling of three time points after inducing recombinant LANA protein expression were compare to non induced Jurkat/Tet-On/LANA cell line was carried out to figure out the changed genes under the latency-associate nuclear antigen (LANA) of KSHV expression (Column13-15).Gene expression profiling of two time points after inducing recombinant LANA protein expression were compare to non induced 293/Tet-On/LANA cell line was done to figure out the changed genes under the latency-associate nuclear antigen (LANA) of KSHV expression (Column: 16-17). The log mean centered log expression data was collected from NCBI database. The data was modified as a suitable input file for Gene Cluster 3.0 functional clustering tool. At first the data was log transformed. Subsequently, hierarchical cluster was formed using complete linkage and uncentered correlation matrix. 4X4 SOM (self organizing matrix) with 100,000 iterations was used for the cluster analysis. The clustering was finally finished performing a PCA (Principle component analysis) for the genes. The microarray could have been clustered in 7 functional clusters. They are marked with arrows followed by their corresponding cluster number (Figure 2).

The platform chosen for the dataset was GPL8300. Using the GEO2R tool top differentially co-expressed 250 genes were selected. For the selection the p value was measured using a Benjamini & Hochberg statistics [1]. Post LANA expression data for 12, 24, 36 and 48 hours in BJAB, 293 and TIVE cells were compared to that of the data from same hours for no LANA expression (control). From the GEO2R generated list the following genes were chosen for generating a heat map. Using the heat map the listed literature mined genes were validated.

The function of the host proteins were searched into several databases such as SWISSPROT, KEGG, GO, GeneCard and GAD database.

2.2.Characterization of the viral ORF

18 previously uncharacterized viral ORFs were characterized using several computational tools. Gene Ontology (GO) database was used to retrieve a probable functional, structural and molecular annotation of the sequences. Consequently, the sequences were searched into InterPro database for any potential association to known proteins on the basis of sequence homology. InterPro analyzes the protein and functionally classifies them into families and predicting domains and important sites. For proteins that did not have any homologues in InterPro scan, they were subjected to a motif based characterization; sequence motifs present in the protein and nucleotide sequences associate the proteins with novel functional and structural annotation. For a motif wide screening MotifFinder tool was used. The 3D structures for the proteins were generated using I-Tasser online server; I-Tasser returns possible binding sites and templates for the structure. I-Tasser results revealed new information about the binding sites of the structure. The template for the 3D structure was also searched using SWISS-Model online server. When both I-Tasser and SWISS-Model results were consistent to each other the template structure was taken into consideration. If structure based analysis matched to that of the sequence based analysis, the characterization was assumed to be conclusive.

However, this characterization process often produced false positives. To rule out the false positives the analysis was redone by several tools. Additionally, the results were made to overlap by functional, sequence based and structural analysis; only after this it was concluded that the functional analysis was not spurious.

For selecting the genes SAGE data was also looked into. SAGE data for Kaposi Sarcoma Associated Virus was collected from Gene Expression Omnibus (GEO) database. The accession number of the dataset is GSM3241. The dataset was kept in the Gene pool B and it was compared with the Gene pool A which corresponds to normal cell lines from skin and white blood cell. SageGenie SAGE Digital Gene Expression Displayer tool was used to perform the analysis. The SAGE Digital Gene Expression Displayer is a tool that identifies those genes that are expressed at significantly different levels (as defined by the user) in two pools of human libraries, based on SAGE tag analysis. The algorithm takes into account the differences in sample size between Pools A and B, which can be large. The user selects a value for statistical significance (P value) and a value for the difference in the level of expression (F value) between the two pools. The results are based on the sequence odds ratio and measure of significance.

Range of F could have been any number greater than or equal to 1. Range of F was set as thoughresults are reported only the odds ratio is greater than F or less than 1/F. As F has been reported to be 2 for standard calculations and no improvisation was done in this regard. Range of Q is another parameter for SAGE calculation which can take up any values ranging from 0 (Show only most significant results) to 1 (show all results). Q in our analysis was kept at 0.1 and was computed using the Benjamini Hochberg algorithm.

2.4.Establishing interaction network for the viral ORFs

Because the interactome data for the uncharacterized bacteria would not have been available in the databases, homologues proteins of them for which interactome data is available were first searched. The uncharacterized viral protein sequences were PSIBLASTed in the NCBI database using BLOSUM62 matrix. The gap cost was kept at Existence: 11, Extension: 1 and the word size were kept at 3 for the alignment. Maximum target sequences were kept 1000 for the search. The first non-viral sequence returned by the result was taken for further analysis. BLAST2SEQ E value for the PSIBLASTed hits were kept below 0.05 as the sequences needed to be distantly related [2]

The objectives of the sequence comparison were two folds- first, to identify homologues of the viral ORF so that interactors against these homologues can be assumed to be the interactors of the viral ones andsecond, to have an idea of the motif which was responsible for mediating the action. When the predicted function/ontology of the ORFs from motif analysis, as mentioned in previous section, matched to that of the homologue’s function/ontology the hit was taken into account. If they varied the hit was not taken further for building the interaction network.

To quantify something as vague as function, first non viral sequence was further subjected to a BLAST2SEQ analysis keeping the viral ORF as a query and the found viral ORF homologue as a subject. Despite the presence of adequate literatures on distantly related proteins, to set the threshold for the homology a control experiment was also run. In the control experiment- for both positive and negative ones- protein sequences were compared using BLAST2SEQ. In the positive control known virus proteins with bacterial homologues were compared. In negative control also bacterial proteins were compared with viral proteins, but here they were paired on the basis of their ultimate effector function grossly. For example if the viral protein was responsible for catalyzing oxidation and reduction reaction the bacterial protein selected randomly was chosen also to be a redox reaction catalysis protein.

Thus two sets of BLAST2SEQ data was generated from the experiment- one from the control experiments and the other one from the viral ORF and its mostly homologue pair. The percentages of identity and the length of the matched region were compared for both sets of data. When the BLAST2SEQ results returned percent identities and sequence length greater than that of the average value of negative controls the hit was further taken into consideration for searching interactors. Otherwise the hit was considered to be spurious.

The selected homologues to the viral proteins were looked for their probable interaction partner by using STRING 9.0 and BIND database [3,4]. The initial networks for the proteins were extended by looking for interaction partners of the interactors of the homologues. Because the purpose of the study was to fit the viral ORFs amongst the host proteins, host homologues of the OFR homologues were at first searched. When it returned any homologues with scores above bit score of 100 the host protein was taken into consideration. It is to be noted that the threshold recommended for selecting the homologous protein using STRING9.0 database is 60 bit score. However, a more stringent bit score was chosen for selecting the host homologues as the query proteins themselves were homologues of the viral proteins. This outruled the possibility of false positives arising from the search.

STRING 9.0 pulls data on the basis of both physical and logical interactions. Physical interactions give out information about experimental binding information such as Chip assays whereas classic example of logical information would be co-expression and gene fusion data. STRING also retrieves interactome information on the basis of homology, gene neighborhood and textmining.

However, when the search for human homologues was failed for any ORF homologue, an alternative approach was chosen. In this approach the homologue for the viral ORF homologue was chosen in the bacteria for which there is a host homologue. Also in this case, the threshold was kept to be above 100 bit score while searching (Supplementary Figure 2).

The interaction network for the homologues were further integrated between them by using Advanced Network Merge plug-in integrated in Cytoscape platform. The integration here was based on textual resembles between the nodes.

2.7.Analysis of the data

For analyzing the dynamic expression status of the viral ORFs the log mean centered log expression data was collected from UCL Herpes virus database ( The data was modified as a suitable input file for Gene Cluster 3.0 functional clustering tool. No Log transformation was required for the data as they were deposited as log transformed file in the database. Hierarchical cluster was formed using complete linkage using uncentered correlation matrix. 4X4 SOM (self-organizing matrix) with 100,000 iterations was used to for the cluster analysis. The clustering was finally finished performing a PCA (Principle component analysis). The figure shows ORF10 and 11 are extremely overexpressed upon lytic induction.

To design a gene regulatory circuit linking viral ORFs and host genes thorough literature search was performed. The circuit was established on the basis of function of the viral proteins from Epstein - Barr virus (EBV), Herpes virus 1 and Human Papilloma Virus (HPV) having similar sequence motifs as the uncharacterized ORFs had. For textmining, as before, AliBaba and HighWire along with annotated databases such as Pfam, InterPro etc. were searched vigorously.

The regulator circuit was established using textmining tools mentioned earlier with motif analysis of the ORFs for viral homologues.

References for Supplementary method:

1.Bhagwat, M. and L. Aravind, PSI-BLAST tutorial. Methods Mol Biol, 2007. 395: p. 177-86.

2.Ferreira, J.A., The Benjamini-Hochberg method in the case of discrete test statistics. Int J Biostat, 2007. 3(1): p. Article 11.

3.Diehn, M., et al., Large-scale identification of secreted and membrane-associated gene products using DNA microarrays. Nat Genet, 2000. 25(1): p. 58-62.

4.Szklarczyk, D., et al., The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39(Database issue): p. D561-8.