Data Infrastructure for Chemical Safety

Grant Agreement Number: 283775

Project full title:

Data Infrastructure for Chemical Safety

Project acronym:

diXa

INFRA-2011-1.2.2: Data infrastructures for e-Science

Combination of Collaborative Project and Coordination and Support Action:

Integrated Infrastructure Initiative (I3)

Call Identifier FP7-Infrastructures-2011-2

Proposal RI-283775 diXa

M8.6: PathVisio multi-omics statistics and visualisation plugins released

Due date of deliverable:M30

Actual submission date:

Start date of project: 1st of October, 2011Duration: 36 months

Maastricht University (UM)

Internal review procedure

Milestone produced by: / Date:
Dr. Dennie Hebels (TGX, UM) / Sept 29, 2014
Dr. Lars Eijssen (BiGCaT, UM)
Dr. Egon Willighagen (BiGCaT, UM)
Bart Smeets, MSc (BiGCaT, UM)
Milestone internally reviewed by: / Date:

Objectives

To allow for the integrative analysis of –omics data related to chemical safety, with the use of biological pathway approaches in PathVisio.

Strategy for solving the task

Since in the meantime relevant functionality related to multi-omics analysis and visualisation has already been developed for PathVisio, in this Milestone their possibilities for application are demonstrated on existing chemical safety data, some additional developments done, and needs for further developments identified.

The strategy to demonstrate the possibilities starts with collecting data for compound Troglitazone (FP010SG)from the PredTox project, in which ‘the effects of 16 test compounds were characterized using conventional toxicological parameters and –omicstechnologies’.1 Within PredTox several types of –omics data have been collected, among which transcriptomics (gene expression), metabolomics, and proteomics. For Troglitazone peak identification has been run on the serum metabolomics outcome by diXa partner Genedata AG, rendering it usable for further analytical approaches. For this reason the transcriptomics and metabolomics results of this compound were chosen as case study for this Milestone.

The next step is that the data available from PredTox is to be pre-processed in order to be used for pathway analysis. This means that normalised expression data as available from the diXa warehouse has to be analysed with statistical models tuned for this task in order to obtain estimates of differences in expression between conditions, and their significance. The same holds for the table of metabolite cluster expression level measurements obtained from Genedata AG (Dr. Hans Gmuender).

Thereafter the information has to be merged to be jointly loaded into PathVisio, the tool for pathway analysis and visualisation used in this Milestone.2Then pathway statistics will be run using pathway collections from the WikiPathways portal to find the processes that have an overrepresentation of differentially expressed genes or metabolites, and the combined data will be visualised on the biological pathways.3

As a final step the link between PathVisio and Cytoscape will be explored, to illustrate how pathway diagrams can be imported as networks in Cytoscape, providing access to the many functionalities available within this network analysis tool or its associated apps (plugins).4

The exact steps taken and relevant illustrations of the outcome are described in the results section. Focus within this Milestone is on the procedural and technical aspects, and the biological outcome is presented as an illustration.

Result

Predtox data tables used

First of all, the relevant gene expression data had to be selected from the Predtox data in the diXa warehouse. The transcriptomics data from the liver were taken, since the clearest differences were expected for this organ, and because of its function as a metabolising and detoxification organ, rendering it of interest to link its gene expression tometabolomics measurements. The metabolomics has been measured in two body fluids, serum and urine, of which the serum data are used in this report. In consultation with Genedata AG (Dr. Hans Gmuender) the MAS5 normalised and 2log transformed gene expression data were chosen, as available from (from the archive ‘Liver MAS5 cond log2 transf median norm.7z’).

Processing of the livertranscriptomics data

All pre-processing of the data was done in R, with scripts based on those from ArrayAnalysis.org.5,6The sample names were adapted for the creation of images and further analyses, to get to short and clear names reflecting experimental groups.To correctly associate measurement with experimental groups, the annotation file “Exp annotations.txt” was downloaded in addition to the data table.A group variable (indicating the treatment) and a factorial covariate (indicating the day of the measurement) were created based on these annotations to be used in statistical modelling using the Bioconductor package limma in R.7,8The group variable had levels vehicle, low, and high. The day variable had levels day2, day4, and day15.

After creating the variables needed, some diagnostic plots were made to verify the comparability of the sample signals, illustrated by theboxplot of the signals shown in Figure 1, which shows the comparability between arrays.

Figure 1. Boxplot of signal intensities, coloured by experimental group.

Figure 2. Distribution of signal intensities.

The data was filtered to only include probesets (genes) that had an average expression level of at least 2log(100), based on the distribution of expression values in the set, see Figure 2.

This is done to exclude those genes that are never expressed in any of the samples, in order to avoid false positives due to noise and reduce the number of statistical tests done.

To illustrate the similarities between the arrays, in Figures 3 and 4, a hierarchical clustering and correlation heatmap of all arrays (after excluding the non-expressed genes) are shown.

Figure 3. Hierarchical clustering dendrogram, colour labelled by experimental group. The clustering was made using Pearson correlation distance and Ward clustering.

Figure 4. Correlation heatmap of the arrays, colour labelled by experimental group. The clustering was made using Pearson correlation distance and Ward clustering. The colour gradient on the array-array correlation values ranges from 0.92 (red) to 1 (white).

Both images clearly show that the high dose samples stand apart from the vehicle and low dose samples. Also the images show that within the cluster of vehicle and low dose samples, as well as within the high dose samples, the samples for day 15 are mostly separated from those for days 2 and 4, except for a few samples. Within the high dose group also day 2 and day 4 samples cluster in separate sub-branches. To more clearly show this grouping structure, the clustering diagram of Figure 3 is repeated in Figure 5, but now coloured by measurement day.

Figure 5. Hierarchical clustering dendrogram, colour labelled by measurement day. The clustering was made using Pearson correlation distance and Ward clustering.

To get more insight, PCA plots were created as well, as shown in Figures 6 and 7 on the next page. These show that Principal Component 2 separates the data based on treatment group, and components 3,aided bycomponent 1, separate on the days of measurement.

Figure 6. PCA plot of the arrays, colour labelled by treatment group.

Figure 7. PCA plot of the arrays, colour labelled by measurement day.

After running the quality control and pre-processing script, limma regression modelling was run to compare the several groups of interest. In order to do so, a model was used that included the treatment group factor and the measurement day factor, and allowed for an interaction between the two. Contrasts were computed to compare the treatment groups at each given time point, and the time points for each treatment group. In addition, the changes over time for the high and low treatment doses, as corrected for the changes over time in the vehicle (control) group, were computed. Using scripts again based on those of ArrayAnalysis.org, p-value histograms and tables were computed.

Table 1. Number of genes meeting certain p-value cutoffs (in total, up, and down). The total number of probesets (genes) in the data set after filtering was 16,372.

The table shows that mainly all comparisons (whether between group at a time point or between time points for a certain dose) involving the high dose group show many significantly changed genes. This is consistent with the findings from the images illustrating the similarities between the arrays.

For the over-time effects for high and low dose corrected for vehicle, p-value histograms are shown in Figures 8 and 9. Histograms give a direct visual clue to the overall significance of the results of a comparison. In case of random data, each p-value would be equally likely and the histogram would have bars of equal height. In case of clear difference, low p-values would be overrepresented, in case of no difference, high p-values would be overrepresented.

Figure 8. P-value histograms of low dose comparisons on day 4 versus day 2 (l), day 15 versus day 4 (m), and day 15 versus day 2 (r), all corrected for the vehicle change over time.

Figure 9. P-value histograms of high dose comparisons on day 4 versus day 2 (l), day 15 versus day 4 (m), and day 15 versus day 2 (r), all corrected for the vehicle change over time.

Also these histograms clearly show the big differences over time for the high dose, and the quite moderate changes for the low dose (we cannot say from these figures whether specific changes are present, they only show the overall amount of significantly changed genes).

Based on this it was decided to use the day 15 versus day 2 comparison for the high Troglitazone dose as test case for the further analyses.

Processing of the serum metabolomics data

Also the serum metabolomics data as provided by Genedata AG were processed with an R script. The measurements file was extended with the HMDB identifiers from the peak annotations file for those clusters that had been identified.9 Also, based on the experimental group annotations the measurements were matched to the experimental groups. All values were set to the 2log scale (after adding 1 to all values, to map 0 to 0). To get the same results as for the gene expression data, we focused on the day 15 versus day 2 comparison for the high dose (called day 14 and day 1 in the metabolomics annotations).

The average expression of the metabolites, as well as the change in expression between the time points and the significance of this change were computed using an unpaired t-test with unequal variance.In order to explore the data, the same computations were also done for the vehicle.

Results showed that there were more significantly changed metabolites for the high dose group than for the vehicle group (data not shown). Since the metabolomics data will be more variable due to the smaller amount of data points, and the fact that peaks may not have been detected in all samples, it was decided to not correct the high dose over-time difference for the vehicle over-time difference, as this may explode variability (noise) in the data.

In order to look at consistent changes, also for the gene expression data, the high day 15 versus high day 2 comparison not corrected for vehicle was used, alongside with the same comparison of the metabolomics data.

Before continuing the analyses, all metabolites that were not at least present in threeout of the ten samples (five in both the day 15 high dose and the day 2 high dose groups) were removed, in order to exclude unexpressed metabolites. This non-strict filtering was done as peaks may not always have been detected for all samples, so a too strict cut-off would remove too many metabolites.

Figure 10 below shows the average expression, log fold change and p-value of the filtered metabolomics data set.

Figure 10. The distribution of the average expression values (l), the log fold changes (m), and the p-values (r) for the comparison high dose day 15 versus high dose day 2 for the metabolite clusters in the data set.

The histograms show a clear overrepresentation of significant metabolites, and a wide range of log fold changes. The average expression histogram shows that some lowly expressed metabolites are still included, which links to the prudent filtering related to possibly missed peaks (reducing the average expression of those metabolites).

One point to raise attention to is that a preliminary analysis of the ten most changed identified metabolites (highest absolute fold changes), six were food additives. This clearly shows the effect of food or diet on the outcome of metabolomics experiments and the relevance of carefully tracking diet composition and controlling for the diet. This has also been recognised in the nutrigenomics field.

Preparing the files for PathVisio and merging the transcriptomics and metabolomics data

Before being able to use the statistical results files for PathVisio, some steps need to be taken. First of all, in the gene expression file, some probe sets were annotated with multiple gene identifiers. Since this will not map in PathVisio, extra columns were added with unique gene identifiers, just selecting the first and discarding the others as a best-we-can-do approach without duplicating data rows. This was done for the Gene Symbol, Entrez Gene identifier, and Ensembl identifier.

Secondly, since we had to merge two data types in one file, we needed to add System Code columns for PathVisio, as the identifiers were of different origin now (gene identifiers, e.g. Ensembl, and HMDB metabolite identifiers). The System Code column indicates for each datapoint what type of identifier it has.

Furthermore, using a downloaded translation table from the BioMart instance at Ensembl with known human Ensembl identifiers of rat gene homologues, a column was added containing the human homologue of each measured gene, if known.10 This was done as the human pathway collections are more extensive than those for rat, and having homologues allows running pathway analysis on human collections. For metabolites the identifiers are the same for all species.

After getting both the gene expression and metabolomics tables in the same format, they were merged for upload into PathVisio.

Table 2. Number of measured and identified metabolites that are mapped to rat (l) and human (r) pathway collections. There are 504 identified metabolites measured in the data set, of which all map (their identifiers are found back), but of those only 18 and 97 can be found back in any pathway for rat and human respectively. In rat there are 27 pathways that have at least one identified metabolite mapped to them, and 133 for human. Matches can be direct (labeld by HMDB identifier on the pathway) or indirect, labeled by other identifier systems but mapped back by the PathVisio mapping system.

Rat collection / Human collection
HMDB numbers / 504 / 504
HMDB numbers with mappings / 504 / 504
HMDB numbers matches / 18 / 97
Pathways found / 27 / 133
Matches via HMDB / 14 / 20
Matches via mapping / 4 / 69

Pathway content for PathVisio

In order to explore the coverage of the measured metabolites in the curated pathway collections from WikiPathways, we computed the numbers covered by the rat and human collections. The results are presented in Table 2 and explained in more detail in its caption.

Pathway analysis

The combined transcriptomics and metabolomics statistics table as created, was uploaded into PathVisio. Then several pathway analyses were run.

First of all the table was analysed using the Rattus norvegicus curated analysis collection of WikiPathways pathways as downloaded from the PathVisio website. When performing pathway analysis, pathways are ranked based on the overrepresentation of changed genes and metabolites in the data. This means that a criterion for being changed has to be defined first. This was chosen to be a change in expression of at least 50% and a significant p-value at the 0.05 level.

Also an identifier mapping database needs to be loaded in order to translate identifiers from different systems (e.g. Ensembl and Entrez Gene) into one another. BridgeDB databases can be downloaded from the PathVisio website to perform this task, in this case the Rn_Derby_20130701.bridge and hmdb_metabolites_20131122.bridge files were used.11

Table 3 shows the top ranked pathways for this analysis.It is clear from the pathway ranking that the most changed pathways are related to cell division. Also several processes to metabolism of steroids appear.

Table 3. Top ranked pathways for the high dose day 15 versus high dose day 2 integrated comparison. Positive: the number of genes and metabolites in the pathway that meet the criteria; measured: the total number of genes and metabolites in the pathway that are measured in the data set; total: the total number of genes and metabolites in the pathway; Z score: the score of the overrepresentation test under the hypergeometric distribution (1.96 corresponding to a 0.05 alpha level).

Pathway / positive / Measured / total / Z Score
DNA Replication / 17 / 26 / 41 / 7.99
G1 to S cell cycle control / 18 / 38 / 67 / 6.35
Cell cycle / 19 / 45 / 89 / 5.89
Mismatch repair / 4 / 7 / 10 / 3.47
PKA-HCG-Glycogen Syntase / 8 / 22 / 44 / 3.27
Homologous recombination / 4 / 8 / 14 / 3.11
Statin Pathway / 5 / 12 / 27 / 2.96
Steroid Biosynthesis / 3 / 6 / 12 / 2.70
Nucleotide GPCRs / 3 / 7 / 14 / 2.35
Cytokines and Inflammatory Response (BioCarta) / 4 / 11 / 28 / 2.31
Cholesterol metabolism / 6 / 20 / 40 / 2.27

Table 4. Lowest ranked pathwaysfor the high dose day 15 versus high dose day 2 integrated comparison. The column headers are explained in the caption of Table 3.