CIS595: Homework 3

Assigned: February 25, 2003

Due: March 3, 2002 (in class)

Homework Policy

·  All assignments are INDIVIDUAL! You may discuss the problems with your colleagues, but you must solve the homework by yourself. Please acknowledge all sources you use in the homework (papers, code or ideas from someone else).

·  Assignments should be submitted in class on the day when they are due. No credit is given for assignments submitted at a later time, unless you have a medical problem.

Problems

(for all of the problems you might find it helpful to consult your textbook)

(50 points) 1. Read paper (you can download it from the course web page http://www.ist.temple.edu/~vucetic/cis595spring2003.htm):

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215:403--410, 1990.

Write a one-page summary of the paper (motivation, methodology, experimental results)

(30 points) 2. Carry out the University of Washington Blast Tutorial:

http://healthlinks.washington.edu/hsl/liaisons/yarfitz/BlastTutorial/

Do not forget to spend some time in understanding the BLAST Help Manual accessible from the tutorial! It will also be useful to check the Blast Guide at http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/guide.html. Answer the following questions:

a.  Describe FASTA sequence format.

b.  Comment the proper choice of scoring matrix in nucleotide and protein blast.

c.  What is the purpose of SEG program?

d.  What is non-redundant protein database?

e.  Comment the choice of gap penalty in blast. Compare the consequences of large vs. small penalty for gap opening and gap extension.

f.  What is the meaning of the E-value, and what is an appropriate choice of E-value threshold?

Summarize the steps you performed in the tutorial in up to ½ page.

(40 points) 3. Search the non-redundant protein database using the "Standard protein-protein BLAST [blastp]" (blast page) and the default settings (except to set the descriptions and alignments to 500) with the following protein sequence:

MSGSRKFFVGGNWKMNGSRDDNDKLLKLLSEAHFDDNTEVLIAPPSVFLHEIRKSLKKEIHVAAQNCYKV

SKGAFTGEISPAMIRDIGCDWVILGHSERRNIFGESDELIAEKVQHALAEGLSVIACIGETLSERESNKT

EEVCVRQLKAIANKIKSADEWKRVVVAYEPVWAIGTGKVATPQQAQEVHNFLRKWFKTNAPNGVDEKIRI

IYGGSVTAANCKELAQQHDVDGFLVGGASLKPEFTEICKARQR

Answer the following using the Blast output results:

a. Based on the matches observed in the blast output, could you conclude what is the name of the protein that was used to query the database? How many proteins would you characterized as significantly similar to the query protein – give your reasoning.

b. Describe the function and other properties of this protein by searching Swiss Prot database (http://ca.expasy.org/sprot/). Alternatively, use any web resource you choose to describe the function of the protein.

c. List the alignments, % identity, and E-values of the highest 5 matches. List the names of organisms which are the sources of the highest 5 matches.

d. Using the Taxonomy Reports link in Blast results answer how many proteins similar to the query come from archaea, bacteria, fungi, plant, insect, fish, bird, mammal other than human, human.

g. Do you think the protein family corresponding to the query protein spans great evolutionary distance? Explain.

(30 points) 4. Amino acid motifs that are conserved in proteins are often indicative of amino acid residues that are required for a particular function of a protein. Alternatively, as the database grows so does the number of chance occurrences of amino acid motifs. In some cases these motifs might spell out words or people's names in single letter amino acids codes. One such name motif might be ELVIS.

a. Find the number of occurrences of the protein motif ELVIS in the protein nr (non-redundant database).

In carrying out your blast analysis from the blast page, be sure to select within the Protein options (in this case we are comparing a protein against the non-redundant protein database) the option for "Search for Short Nearly Exact Matches". Leave most of the options as the default. However, in the output formatting be sure that the number of descriptions and alignments is set to 250.

b. How many hits to you get?

c. Examine the description of the proteins that ELVIS matches. Is there a common pattern in the type of protein or functional attribute of the protein matched that suggests a function for the motif ELVIS?

d. Alternatively, do the data suggest that ELVIS is just a chance occurrence of an amino acid motif (collection or group of amino acids) in the databases?

e. Return to be blast home page. From the options that are presented to you, note what the parameters are as the default settings for the "Search for Short Nearly Exact Matches". In particular note what the default options are for the following: Filtering, Expect, Word Size, Matrix, Gap Costs. Now return to the blast home page. This time select within the protein options the "Standard protein-protein BLAST [blastp]". Examine the default settings for the following parameter options Filtering, Expect, Word Size, Matrix, Gap Costs. Using the links provided within NCBI Blast pages and the help documents, provide an explanation for the difference in the options between the "Search for Short Nearly Exact Matches" and the "Standard protein-protein BLAST [blastp]" when comparing short motifs vs full-length proteins.

GOOD LUCK!!!