HPC and Bioinformatics COT 6930

Homework 2 (9 pts)

Due March 18

Part 1: Gene expression data analysis (4 pts)

The purpose of this assignment is for you to understand basic gene expression data analysis techniques. We will use WEKA data mining to perform two types of gene expression data analysis

  1. Molecular classification of leukemia cancer. We will build a classifier to identify whether a diseased tissue sample belongs to acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), by using itsgene expression data.This article provides more information on molecular classification of leukemia cancer.
  2. Selecting a small number of important genes to enhance the accuracy for leukemia classification.

What you should do:

First, download and install WEKA from Read the "WEKA Explorer User Guide" at

This ppt file provides detailed step-by-step guidance on how to use WEKA explorer.

Next, download the leukemia gene expression data for here. The data is in ARFF file format (which is the required format for WEKA). Memory issue: Because the size of the gene expression data is relatively large, you may need to change WEKA’s heap size by following the instructions below:

  • Find weak’s home directory, it’s normally under D:\program files\weka-3-4 (or C: drive, depending on where you install weak).
  • Open Runweka.ini
  • Add “maxheap=256m” (without double quote) right above the “mainclass=weka.gui.GUIChooser” (if the memory of your computer is more than 256m, you may increase the heap size according, say maxheap=512m.
  • Save and close Runweka.ini file, and restart weka

Then, use the WEKA Explorer to classify the data and compare the performance of different classifiers and feature selection algorithms. You should choose J4.8 classification algorithm. Compare the classifier's classification accuracy obtained with 10-fold cross-validation in the following scenarios:

  • Apply the classifiers directly without any feature (gene) selection.
  • Use feature selection algorithms to select the top 5 ranked genes, and then apply the classifiers on the filtered data. To do this properly, you should use the ReliefFAttributeEval in WEKA Explorer (use default parameter setting). You can find the details of the ReliefF algorithm from this article.

What should be turned in?

  • Please follow the format of this article and turn in a written report with the following content.
  • The 10-fold cross-validation results of apply J4.8 to classify the Leukemia.arff dataset without any feature selection (1 pt)
  • Draw a figure with x-axis showingthe number of selected genes (1, 3, 5, 10, 20, 30) and the y-axis showing the accuracy of the J4.8 built from the selected features (1 pt)
  • Provide a simple explanation on why selecting a small number of genes can help build a more accurate classification model (2 pt).

Part 2: Bioinformatics resources and search engines (5 pts)

Question 1 (4.0 pts)

Please read this article and go through Blast Guide at Given following DNA sequence and NCBI BLAST tools

Please answer following questions:

  1. Which version of the NCBI Blast tool should we use to find similar sequences? (0.2 pt)
  2. Now use the sequence as an input and search against the database (by using the tool you selected), please list the alignments, % identity, and E-values of the highest 3 matches (0.5 pt)
  3. Click the first matching sequence and describe what is the name of this protein? (0.2 pt) and what is the name of the gene encode this protein (0.2 pt), what is the amino acid sequence of this protein (0.2 pt), briefly state the function of this protein (0.2 pt).
  4. Assume that you are interested in findingmore information about the first matching protein retrieved from step 3, and you decide to find the second structure of the amino acid sequence of the protein first. Use PSIPRED protein structure prediction tool to find the secondary structure information of the protein and attach the results, please report your retrieval results (0.5 pt)
  5. Now you are interested in finding proteins structurally similar to the first matching protein in the steps 3, and you decide to use PSIPRED(Fold Recognition tool GenTHREADER). Please list the PDB_ID (Protein Data Bank ID) of the retrieved proteins (ignoring the last two characters) (0.5 pt)
  6. Assume that you decide to use the tblastn (Search translated nucleotide database using a proteinquery) as your search tool to find proteins similar to the first matching proteins retrieved from step 3. You use the amino acid sequence of the first matching protein (from step 3) as input and search from the database. Select the first matching result and list proteins IDs cross-referenced to the Protein Data Bank (0.5 pt), what are the SCOP classes of first three proteins (0.5 pt). Please list the common proteins retrieved from step 5 and step 6 (0.5 pt)

acttgtcatg gcgactgtcc agctttgtgc caggagcctc gcaggggttg atgggattgg

ggttttcccc tcccatgtgc tcaagactgg cgctaaaagt tttgagcttc tcaaaagtct

agagccaccg tccagggagc aggtagctgc tgggctccgg ggacactttg cgttcgggct

gggagcgtgc tttccacgac ggtgacacgc ttccctggat tggcagccag actgccttcc

gggtcactgc catggaggag ccgcagtcag atcctagcgt cgagccccct ctgagtcagg

aaacattttc agacctatgg aaactacttc ctgaaaacaa cgttctgtcc cccttgccgt

cccaagcaat ggatgatttg atgctgtccc cggacgatat tgaacaatgg ttcactgaag

acccaggtcc agatgaagct cccagaatgc cagaggctgc tccccgcgtg gcccctgcac

cagcagctcc tacaccggcg gcccctgcac cagccccctc ctggcccctg tcatcttctg

tcccttccca gaaaacctac cagggcagct acggtttccg tctgggcttc ttgcattctg

ggacagccaa gtctgtgact tgcacgtact cccctgccct caacaagatg ttttgccaac

tggccaagac ctgccctgtg cagctgtggg ttgattccac acccccgccc ggcacccgcg

tccgcgccat ggccatctac aagcagtcac agcacatgac ggaggttgtg aggcgctgcc

Question 2 (1 pt)

Assuming that you were asked to determine, from the sequences of pancreatic ribonuclease from hose (Equus caballus), minke whale (Balaenoptera acutorostrata), and red kangaroo (Macropus rufus), which two of these species are most closely related. The sequences information is given as follow, and you decide to use ClustalW ( multiple sequence alignment tool to find the answer. Please summarize the alignment results (0.5 pt) and conclude which two species are most closely related (0.5 pt)

>RNP_HORSE

kespamkfer qhmdsgstss snptycnqmm krrnmtqgwc kpvntfvhep ladvqaiclq

knitckngqs ncyqssssmh itdcrltsgs kypncayqts qkerhiivac egnpyvpvhf

dasvevst

>RNP_BALAC

respamkfqr qhmdsgnspg nnpnycnqmm mrrkmtqgrc kpvntfvhes ledvkavcsq

knvlckngrt ncyesnstmh itdcrqtgss kypncaykts qkekhiivac egnpyvpvhf

dnsv

>RNP_MACRU

etpaekfqrq hmdtehstas ssnycnlmmk ardmtsgrck plntfihepk svvdavchqe

nvtckngrtn cyksnsrlsi tncrqtgask ypncqyetsn lnkqiivace gqyvpvhfda

yv