Section 1: Instructions

Section 1: Instructions

Open the Java file and change the file names appropriately. Input files should be comma delimited and stored in the same directory as the Java program. (Fig. 1). Description of the files are given in the Section 2.

Figure 1. Screen shot of the java file where the file names are specified.

Save the Java program after chages and compile the Java program into Java class with the javac command in the command prompt (Fig. 2). If the command is successfully executed, the directorywill have a new Java class named ReliefRanks.

Figure 2. Screen shot of compiling the ReliefRanks Java Program into Java class object.

Execute the ReliefRanksJava class in the command prompt (Fig. 3). The ranks for each gene set will be also be stored to the output file changed in Step 1.

Figure 3. Screen shot of the execution of ReliefRanks Java class.

Open the R script and change name of input files and output file. Input files should be comma delimited and stored in the same directory. The purpose of each file is stated in the comments within the R script (Fig. 4 and Fig. 5) and a brief descriptions are given in Section 2.

Figure 4. Screen shot of the beginning of the R script. The file names to be changed are initialized here

Figure 5. Screen shot of the end of the R script. The name for the output file should be specified here.

Start R Gui and install the following packages:

splines
survival
pamr
CPE

In R Gui, set the working directory to be the directory containing the R script and input files and execute the R script with the source command.

Section 2: File descriptions

Two input files are needed to run the Java program:

a)The file text file with list of significant probes to be ranked by Relief (r6_Models_sg_Lists is a tab-delimited). Each line represents one gene set. In each line, there are 5 fields: a string of gene symbol from the gene sets, a string of significant probes, an integer number indicating the number of genes, an integer indicating the number of significant probes, and an integer gene set identifier number.

b)The training file with gene expression and the patients’ 5-year survival status (20404Probes_NewGS_Train_smoking_no50miss_5yr). It is a comma-delimited file. The first column contains the gene symbol, second column contains the probe set ID, and the following columns contain the gene expressions for each patient. The last row of the data is the 5-year survival status (H- die within 5 years; L-survive at least 5 years).

Eight input files are required to execute the R script:

a)The summary file (Diff10Rules_r6_Summary.csv) contains the hallmarks (the first few columns) and gene symbols (the last column) of genes directly co-expressed with the hallmarks from the difference network components.

b)The file containing the list of all significant probes. This is the same file as stated in 1a) above.

c)The file containing the ranking of significant probes for each gene set. This is the output file from the Java program. It is a tab-delimited file with two fields. The first field is the list of significant probes and the second field is the ranking (with the last number indicating the size of significant probes).

Remark: the index in the ranking starts from 0.

d)The training and test data contains gene expression and survival outcome of patients. The first column contains the gene symbol, second column contains the probe set ID, and the following columns contain the gene expressions for each patient. The last two rows of the data is the survival time and status (1-dead, 0-alive).

e)The three clinical data file contains the tumor stage status for each patient.

f)The output file contains the information for each model construction. The first column has the gene symbols of the genes directly co-expressed with the hallmarks, the second column has the probe set ID for the significant probes, the third column has the number of genes, the forth column has the number of significant probes, the fifth column contains the identifier number corresponds to the gene get in the summary file. The rest of the columns have the information for the model, such as the cutoff value, the log-rank P value from the stratification, the hazard ratio, and CPE.