MB&B 208 Fall 2006

Holmes Homework 1

Due November 27

You’ve cloned a DNA sequence that you suspect contains a gene. In this exercise you will use standard bioinformatics programs and methods to identify and characterize the protein produced by the gene.

Last name starts with: / Sequence to use:
A-J / Sequence 1
K-R / Sequence 2
S-Z / Sequence 3

1. First use ORF finder to determine if the sequence contains any potential protein coding sequences:

  • Copy and paste your sequence into the window, then click on “OrfFind”
  • Potential ORF’s will be highlighted in green. Click on the one you wish to investigate. Remember, you are trying to identify the most likely protein-coding gene in your DNA sequence.
  • The selected ORF will now be magenta. Below the sequence you can click on “accept”.
  • Now the ORF you selected is shown in green. From the drop down menu that says “1 GenBank” choose “3 fasta protein”. Click on “view”.
  • You will see the amino acid sequence (in one letter abbreviations) of the ORF you’ve chosen. Copy and paste this sequence onto another page as part of your assignment to hand in.

2. Now use the BLAST program to identify your gene. The BLAST program uses an algorithm to compare the similarity of a given sequence to all known sequences. It can find the exact source of the sequence that you are testing (assuming the organism has been sequenced), as well as similar sequences from other sources (e.g., related proteins from the same or other species). For this exercise you can use the blast program specific to the yeast genome:

If you prefer to compare your sequence to all known sequences, use the NCBI BLAST server:

  • You have a choice of searching with either DNA sequence (using blastn) or amino acid sequence (using blastp). For this exercise, use blastp. Why is it better to look at amino acid similarity when looking for similar proteins? What types of questions are better answered using DNA similarity searches? Include answers to these questions in your assignment.
  • Paste your sequence into the box and click RUN BLAST. If you use the yeast-specific program you must make sure that you choose the correct database to search (protein sequence).
  • The output is ordered from closest match to farthest match, with a probability score reflecting the odds of such a match occurring by chance. Clicking on the probability score, or the bar reflecting the extent of similarity, shows you the match between the subject (the sequence you originally entered) and the query (the sequence being compared to the subject). Identical amino acids are shown between the subject and query; similar ones are designated with a “+”
  • Above the comparison there are links to information about the query sequence. The SGD locus link will tell you the identity of the query. Record this information for at least three similar proteins you find (everybody should include the most similar protein). For these same three proteins, also click on the retrieve sequence link. Scroll down on this page and get the protein sequence in fasta format (choose “ORF translation”). When copying a fasta file, include everything from the greater-than sign (>) to the asterisk (*).

3. The next task is to compare multiple sequences at the same time. After obtaining at least three protein sequences in fasta format (use more if you wish), use the Clustal program to perform a multiple alignment:

To use Clustal, simply paste each of your complete fasta protein sequence files in the box, one on top of the other, with no empty lines, then click submit. You should see an alignment of the sequences you entered. Finally, use the Boxshade program to highlight the most similar regions of your set of proteins. Scroll down the screen of your Clustal output, there will be a green box with a summary of the alignment in fasta format. Copy the contents of this box; paste them into the Boxshade program found here:

The “input sequence format” option must be set to “other”. The output, which will be downloaded to your computer, will probably be in postscript (.ps) format. On Macs you can double-click this file and it will be automatically converted to a pdf file. On PCs you may have to convert the .ps file to pdf format using the Adobe Distiller program. This program is on most of the computers in the Wesleyan computer labs. If you have any problems at this point send the .ps file to me, and I’ll send you back the file converted to pdf format. Print this file and include it with your homework.

4. Where did all that work get you? A. Describe the types of information this analysis produces; B. List some specific questions that sequence comparisons can address.

5. This exercise was performed with DNA sequences from budding yeast. Most yeast genes don’t contain introns. What are the challenges of gene identification in organisms like humans, which have many introns in each gene? Briefly discuss how you would identify genes and ORFs in human cells.