Dresden + NKI

In addition, within the PHAGOSYS project we analyzed data from a genome-wide siRNA screen (performed at MPI-CBG), Flow Cytometry screen (performed at NKI), small image screen (performed at NKI) and a large compound screen (performed at LUMC).

We analyzed data from a genome-wide siRNA image screen performed at MPI-CBG. The screen studies the process of endocytosis in HeLa cell lines (data provided by MPI-CBG). The phenotype images produced by silencing a gene were then analyzed using MotionTracking (MPI-CBG in-house software) to produce descriptions. Next, we described each gene with its functional annotation (using GO terms) and the phenotype it produces. A total of 20346 genes were analyzed, from which 3678 were without known functions. The goal of the study was first to provide knowledge about the endocytosis and then to annotate the hypothetical genes with possible functions using the methodology developed in (Kocev 2011).

Next, NKI provided data from two screens concerning MHC Class II AGP. The first screen includes Flow Cytometry data from 16675 genes using MelJuSo cell lines and the screen is focused on two anti-bodies L243 and CerCLIP. Then, 276 candidate genes from this screen (determined based on z-score cut-off) were used for the second screen where images of the phenotypes were taken. These images were then analyzed using CellProfiler software. Using (semi)manual clustering on the features from CellProfiler a total of 19 bin were identified (REF), with quite a few genes/images left unassigned. We described each gene with its GO annotations and 13 CellProfiler features.

The goal of the analysis was to merge the three studies (the MPI-CBG's genome-wide study, and the NKI's Flow Cytometry and phenotype images) and identify some false negatives from the Flow Cytometry. More precisely, we aimed to find genes (from the MPI-CBG genome wide study) that were not selected for the imaging study of NKI (due to the low z-score in the FlowCytometry study). This is to alleviate the problem of hard definition of a hit (typically, a hit has a z-score larger than 2 or smaller than -2). Figure AA depicts the intersection of the genes analyzed in the three studies. We conducted two types of analyses. First, we performed gene set enrichment analysis using SEGS-BioMine (Podpečan et al. 2011), then we predicted the possible bin assignment (the bins from NKI's study) of the genes from MPI-CBG's study.

In the first analysis, we discovered gene functions that were mostly present in the highly ranked genes from the Flow Cytometry study, separately for the two antibodies. For the L234 anti-body, most prominent are the genes involved in mRNA metabolic process (GO:0016071), the genes that interact with genes active in processes in endocytic vesicle membrane (GO:0030666) and the genes that interact with genes involved in the PPAR signaling pathway (KEGG:03320). Concerning the CerCLIP anti-body, we discovered that most prominent are the genes involved in cellular macromolecule metabolic process (GO:0044260), nucleic acid metabolic process (GO:0090304), intracellular membrane-bounded organelle (GO:0043231), the spliceosome (KEGG:03040), the genes that interact with genes involved in nuclear mRNA splicing, via spliceosome (GO:0000398), and the genes that interact with genes active in processes in nuclear body (GO:0016604).

Figure AA. Distribution of genes from the three studies.

The second analysis investigated the possibility of merging these two studies. This means that the analysis performed in the genome-wide study at MPI-CBG could help elucidate some genes that were found negatives in the FlowCytometry study. We performed two rounds of experiments. In the first round, we predicted the phenotype bins from the CellProfiler features (NKI screen), the Flow Cytometry scores and the GO annotations. The results show that we can predict the the values for the bins with limited success. In the second round of experiments, we predicted the phenotype bins from the MotionTracker features (MPI-CBG screen), the Flow Cytometry scores and the GO annotations. The results, however, showed that the predictive performance is quite low. That means knowing the information about the gene phenotypes from the MPI-CBG study cannot directly and straight-forward be connected to the phenotype bins defined from the NKI phenotype study. There are several factors that contribute to this outcome: different cell-lines, different mechanisms studies, and different image description techniques. We believe that detecting the false negatives will be more successful if a smaller subset of genes from the NKI study is selected, and then we calculate the closest matches from the MPI-CBG study. This is still ongoing collaboration and we hope that soon we will obtain publishable results.

LUMC

We analyzed data from a compound screen performed at LUMC. More specifically, LUMC conducted two compound studies using the LOPAC library that measure the (reduction in ) bacterial load with Flow Cytometry (FACS). In the first study, MelJuSo cells were infected with M. tuberculosis (Mtb) at MOI 10, while in the second study, HeLa cells were infected with S. typhimurium (Stm) at MOI 10. The main goal of the analysis is to elucidate proteins that contribute to the reduction/increase of the bacterial load, i.e., proteins that are involved in processes that kill or promote bacteria in the host cells.

The analysis consists of two parts: data preprocessing and elucidating proteins with machine learning. With the first part, we search for descriptors of the compounds, which are used in the second study to search for possible target proteins.

The LOPAC library consists of 1260 compounds. We first linked each of the compounds to the PubChem database at NCBI ( Then, we extracted the proteins that were targeted by each compound in a human study. We considered only studies were the compound was found to be active. In total, 711 protein targets were identified. We obtained three different descriptions of the dataset. First, we described 964 compounds through their respective protein targets (there is no protein targets for 296 compounds). Second, we described the compounds with the GO terms that the respective protein targets were annotated with. Third, we described the compounds with the GO terms of the target proteins and included gene-to-gene interaction information via the functions (terms) annotated with the interacting proteins. We then performed analyses on these representations of the data.

We performed three different analyses of the data at hand. First, we conducted a discriminative analysis of the targets based on the hit compounds. Second, we performed analysis using predictive clustering trees trying to predict the z-score for reduction of the bacterial load using the three data representations. Third, we performed feature ranking using the three data representations.

The number of 711 target proteins is large and it includes protein that are targeted by a large variety of compounds. Typically, these proteins are not specific and closely relevant for the Mtb and Stm measurements. Furthermore, more interesting are the protein targets that were found active in studies involving hit compounds (compounds with z-score larger than 2 or smaller than -2) and not active in the remainder of the compounds. To this end, we prune the set of the proteins using the following rule: A protein target is removed from the set of protein targets if at least 90% of the compounds that are targeted by are identified as hits. Considering this, we discovered three separate lists of compounds and the respective protein targets. The first list gives 60 protein targets that are obtained when the hit compounds were defined using Mtb and Stm z-score values. The second list gives 58 protein targets that are relevant for the hit compounds for the Mtb study. The third list gives 15 protein targets relevant for the hit compounds for the Stm study. For these gene lists we obtained three gene networks using STRING. Figure BB depicts the gene network for the first gene list. On this network, we can note three parts. The largest part is positioned in the center of the network and includes different kinases. The second part is on the right-hand side of the network and includes a group of somatostatin receptors. The third part is positioned on the left-hand side of the network and includes cell division cycle homologs, which through topoisomerases is connected to the largest part of the network.

Figure BB. Gene network obtained using STRING on the gene list with hits for both Mtb and Stm.

The second analysis involved construction of predictive clustering trees on the three data representations for each of the bacteria. This analysis discovered which protein targets, in which functions are the protein targets involved (through the respective GO terms) and which gene interactions are most responsible for the decrease or increase of bacterial load. In Figure CC, we show the predictive clustering tree for predicting the increase/decrease of bacterial load for Mtb, in which each tree leaf gives the compounds that belong to it. This tree selects the following GO terms/functions as most important: negative regulation of type 2 immune response (GO:0002829), defense response to virus (GO:0051607), early endosome (GO:0005769), myeloid cell activation involved in immune response (GO:0002275), phagocytic cup (GO:0001891), ER-associated protein catabolic process (GO:0030433), circulatory system process (GO:0003013) and antigen processing and presentation of exogenous peptide antigen via MHC class I, TAP-dependent (GO:0002479). All these functions are closely related to the process of Mtb infection.

Figure CC. The predictive clustering tree for predicting the increase/decrease of bacterial load for Mtb. Each tree leaf gives the compounds that belong to it. This tree is obtained using the data where the compounds are described with the GO terms that the respective protein targets were annotated with.

For the third analysis, we performed feature ranking on the three data representations. We used RReliefF as a feature ranking algorithm (Robnik-Šikonja and Kononenko 2003). Namely, we are aiming to get an ordered list of genes or gene functions that are most relevant for the decrease/increase of bacterial load. After that, using the top 40 ranked genes we constructed a gene network using STRING. Here, we show the two networks in Figures DD and EE.

Figure DD. Gene network obtained using STRING on the gene list obtained from feature ranking on the Mtb data.

Figure EE. Gene network obtained using STRING on the gene list obtained from feature ranking on the Stm data.

References

(Kocev 2011) Dragi Kocev. Ensembles for predicting structured outputs. PhD Thesis. International Postgraduate School Jožef Stefan, Ljubljana, Slovenia

(Vens et al 2008) Celine Vens, Jan Struyf, Leander Schietgat, Sašo Džeroski, and Hendrik Blockeel. Decision trees for hierarchical multi-label classification. Machine Learning 73(2):185-214, 2008

(Podpečan et al. 2011) Vid Podpečan, Nada Lavrač, Igor Mozetič, Petra Kralj Novak, Igor Trajkovski, Laura Langohr, Kimmo Kulovesi, Hannu Toivonen, Marko Petek, Helena Motaln, and Kristina Gruden. SegMine workflows for semantic microarray data analysis in Orange4WS. BMC Bioinformatics 12: 416 (2011)

(Robnik-Šikonja and Kononenko 2003) Marko Robnik-Šikonja and Igor Kononenko. Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning 53(1-2):23-69, 2003