A Tutorial on Tissue Heterogeneity Correction by Computational Source Separation
1.Introduction
The computational sourse separation (BSS) algorithm has been implemented in MATLAB for computational dissection of tissue heterogeneity (CDTH). Gene expression profiling by microarrays often represents a composite of more than one cell source due to the tissue heterogeneity. Thus, such “artifacts” would be potentially misleading, in the case of the presence of signatures from other cells in the sample (Wang, 2005). BSS is a novel and effective computational method that can blindly decompose gene expression profiles. Figure BSS shows a snapshot of the BSS software for CDTH.
A Brief Description of the BSS Method
In the algorithm, we select the final ISGs that result in maxima value of the sum of correlation coefficient (S1source, S1recovered) and correlation coefficient (S2source, S2recovered). There are several ways for estimating the model. Most estimation principles and objective functions are equivalent, at least in theory. To evaluate the performance of the algorithm, we will use the following two criteria: (1) correlation coefficients and (2) performance index E1. The performance index is defined as (Aapo Hyvärinen., 2001):
Where pij is the ijth element of the matrix P=WA. If the source profiles have been separated perfectly, P becomes simply a permutation matrix of actual mixing matrix. The index E1 attains its minimum value zero for an ideal permutation matrix.
2.A procedural Description of the BSS Method
In this section, we will describe a procedure to perform BSS algorithm. The following steps are described and illustrated in details: (1) preparing data, (2) reading in data, (3) CTDH using BSS, and (4) exporting recovered data to files and plotting result figures.
3.1.Prepare Data
The input data set includes two parts, the actual expression values of samples and the corresponding mixture label, and each data sample’s format should be tab-delimited. The data set and label are saved in one text file. And all the data files’ name should be lowercase. And the name of the probe ID file or image ID file is composed of data file’s name and imageid. It should also be a text file.
The data variable should be arranged to have samples with different mixing ratios as columns, and dimensions as rows. The label variable should be a row vector with each element corresponding to each sample, and one number is assigned to one sample. It is suggested to use consecutive numbers starting from 1 as labels. For example, we have LCC1 vs. MCF7 data set including four samples. So, a label variable vector [1 2 3 4] corresponds to mixture ratios as 1:0, 0:1, 3:1 and 1:3 respectively. Note that every biological mixing sample shoule be arranged in tha same sequence as that in our example.
3.2.Read in Data
- Start MATLAB.
- Choose a preferred data set to perform the PICA by entering a number in the command window:
Run BSS using simulation data or biological data? (0: exit; 1: simulation; 2: biological):
0: Exit the program;
1: Simulation data;
2: Biological data;
Other number: wrong option.
- When choosing the simulation data, the following prompt will be presented in the command window:
Run BSS using simulation data or biological data? (0: exit; 1: simulation; 2: biological):1
numerical mix...
load the pure data set...
Please input the file name of the pure data set (*.txt)[Enter: default data set]:
There are two options here. One is to input your own data in the format: ***.txt, and the other is simply pressing the RETURN key. If the RETURN key is pressed, our program will automatically choose the default simulation data file: nci_63.txt (2308 genes from neuroblastoma (NB) and the Ewing family of tumors (EWS) profiles)
- When choosing the biological data, the following prompt will be presented in the command window:
Run BSS using simulation data or biological data? (0: exit; 1: simulation; 2: biological):2
NOT numerical mix...
load the pure data set...
Please input the file name of the pure data set (*.txt)[Enter: default data set]:
There are two options here. One is for inputting your own data in the format: ***.txt, the other is simply pressing the RETURN key. If the RETURN key is pressed, our program will automatically choose the default biological data file: lcc9_m1.txt.txt (LCC1 vs. MCF-7 datasets)
- For the simulation data, the computer generates the mixture ratio automatically. And for the biological data, the current mixture ratios are 1:3/3:1.
2.3.Run ISG-PICA algorithm
With the selected data set, the software automatically selects appropriate methods to perform the PICA algorithm. The command window will give some useful information as the following:
------
perform BSS...
------
Calculating covariance...
Dimension not reduced.
Selected [ 2 ] dimensions.
Smallest remaining (non-zero) eigenvalue [ 2.36061e+006 ]
Largest remaining (non-zero) eigenvalue [ 2.50717e+007 ]
Sum of removed eigenvalues [ 0 ]
[ 100 ] % of (non-zero) eigenvalues retained.
Whitening...
Check: covariance differs from identity by [ 1.11022e-014 ].
Convergence after 8 steps
normalization/registration of sources and mixing matrix for comparison...
the E1 of demixing-mixing matrix:
0.79997
the correlation coefficient of sig and PICA is:
1
0.94882
the correlation coefficient of sig and PICA is:
2
0.99209
2.4.Export recovered data to files, and plot result figures
After the software of BSS finishes, the scatter plots of pure cases are presented. Also shown in the Figures, the profiles of the ground truth, mixture observations and the recovered sources are presented for comparison. A collection of recovered sources in the gene space is displayed. All these figures are also saved in the format of emf file.
3.Summary
In this tutorial, we have presented the BSS method with a procedural description of the method. The BSS method has been implemented in MATLAB. We give a brief summary of the algorithm in Section 2. The procedural steps for performing CDTH using BSS have been detailed and illustrated in Section 3. The material in this document is limited to the implementation of the BSS algorithms. It does not discuss the theory of this algorithm. The reader should refer to the cited references for more detail of the underling algorithm, as well as other related books and publications (such as Hyvärinen, 2001).
References
- Aapo Hyvärinen., (2001) Independent Component Analysis, John Wiley & Sons, Inc.
- Y. Wang, J. Zhang, J. Khan, R. Clarke, and Z. Gu, “Partially-independent component analysis for tissue heterogeneity correction in microarray gene expression analysis,” Proc.IEEE Workshop on Neural Networks for Signal Processing, pp. 24-32, Toulouse, France, September 2003.
- RunICA software package, (2005):