Collecting and Storing Sequences

BCB 444/544 Fall 07 Aug 30 Lab 1 p. 1

BCB 444/544

Lab 2 Name ______

Collecting and Storing Sequences

Objectives

Become familiar with the computer lab
Learn how to keep a log of your work
Learn how to search for information in online bioinformatics databases
Use online bioinformatics databases to solve biologically motivated questions

Introduction

It is strongly suggested that you keep a log of your work in the lab. A log is a handy way of keeping track of what web sites you visited, what your search terms were, any important results obtained, names and locations of files that you saved, etc. For those of you who have kept a lab notebook, the concept of a log should be very familiar. There are many ways to keep a log while you work online. One of the easiest ways is to open a new document in Word or another text editor. You can then jot down any notes you may have, a list of web sites visited (if you use Word or some other high level program, any URLs you include will automatically become links to the web site), the terms you used to search a database, and the results of your query. It is important to date each entry in your log because things in the web can change very quickly. Because any bookmarks you save on the lab computers will be deleted, the log can serve as your list of bookmarks for the course. As with any lab notebook, you will not later regret having too many details in your log nearly as much as you will regret leaving out the one important detail that you can’t remember.

Exercise

The following questions will familiarize you with several bioinformatics databases and their query syntax.

1.)Entrez This problem practices using the Entrez Search program at the National Center for Biotechnology Information (NCBI) to perform a search for the amino acid sequence of the human heat shock factor HSF1. Normally a large number of matches are found in such searches. We will use the Entrez Boolean search features, which restrict the reported matches to a series of required conditions. This feature allows us to narrow the search to the sequences that we want.

Go to the Entrez Web site ( Search around the site and in several sentences describe what Entrez is and several of its uses. **Hint: Start with the About Entrez on the left hand site and follow the site map**
Return to the Entrez main page and choose Protein from the search drop-down box in the upper left. Enter the term “heat shock factor” (with the quotes) to identify all entries with this phrase somewhere in their text. How many matches are found?
Now limit the search by clicking the mouse on the Preview/Index tab, go to add terms, choose organism in the first box, type human in the second, then click AND to limit the search to just human proteins, and then click Preview. The history will now show the results of a search for database entries with the term “heat shock factor” AND originating from humans as the organism. How many hits are there now?
We can limit the hits to matches to RefSeq, which is in Genbank’s annotated sequence database, to give a best representative sequence entry for each protein. Click the mouse on Limits. In the Limited To section of the page ignore the boxes on the left and choose RefSeq in the right-most box. Then click the GO button and then the history tab. Now we have all human heat shock factors in RefSeq.
The gene of interest is HSF1. Click on the Preview/Index tab and select Gene Name in the drop down menu. Type HSF1 in the textbox, then click the AND button and then the Preview Button. There should now be one entry left in History. Clicking on the number 1 provides the sequence.
There are other ways of arriving at this final sequence. As another example, pull out all human protein sequences in RefSeq and all HSF1 sequences in all organisms and then select the human one using another Boolean search feature of Entrez. Reload the Entrez page and choose Protein in the upper left box. Enter human in the text box at the top, click Limits and then in the Limit To area, choose Organism in the upper left box and RefSeq in the right box. Click GO and then History. Now we have a complete list of all human proteins in RefSeq. How many are there?
Now replace human with HSF1 in the upper text field, click Limits, and in the Limited To area, choose gene name in the upper left box and RefSeq in the right box. Click GO and then History. How many proteins result from this query?
Finally, note the numbers at the beginning of the two lines that start with a pound sign (#) in history that were found by the last two searches. Go to the upper text box and type <#1 AND #2> (assuming the numbers are 1 and 2) and omit the angled brackets. This now creates a new search in which only protein sequences are matched that are from humans and which are the HSF1 gene, i.e., the new search in an intersection of the previous two. Again, 1 protein should be left.
Click on the 1 in the result column. Note the RefSeq accession number starting with “NP”. Click on NP, the link to display the detailed information for the query item. Use the display menu to display the sequence in FASTA format. “NP” identifies the sequence as a curated protein sequence. The sequence may then be copied and pasted into your answer document.
While on the page with the target sequence, click on Links on the right side of the page and choose the Nucleotide option. Now the mRNA and genome sequence corresponding to the protein should become available. Copy the FASTA versions of thesefiles into your answer document as well.

2.)OMIM (Online Mendelian Inheritance in MAN) Open the browser and go to the NCBI home page ( Change the search pull-down from “All Databases” to “OMIM”. In the following steps we will diagnose the disease from observed symptoms presented in an example case study (from last year’s textbook).

Click on the go button to bring up the OMIM homepage (or click on OMIM in the horizontal navigation bar above the query box). Read through the page and in several sentences describe what information is present in this database and its uses.
Creatine kinase was present in extremely high levels in the blood test. Attempt to identify the disease by entering elevated creatine kinase into the search box. How many items were identified by the search?
The basic search functions work similar to those common to many web based search engines. Place quotes around elevated creatine kinase to search for the entire phrase. How many items does the search reduce to?

**Note: Other common search elements such as the Boolean operators AND, OR, NOT can be used to enhance your search**

Continue to narrow the search by adding “mental retardation” to the search. We are getting close to identifying the disease. How many items does this search return?
This disease appears to be linked to the X chromosome. Add the word x-linked to the search and record the remaining possible diseases.
Because this disease has affected multiple members of the family and symptoms always appear during childhood, add the keyword juvenile to the search. What disease does the boy have?
Click on the remaining entry to retrieve an in-depth description of the disease. List one more symptom of the disease that was also present in the patient.
Use the back arrow on the browser to return to the search results. Take note of the multiple tabs present in the results window. Click on them to see other available options and search restrictions. After looking through these tabs provide an alternate method other than typing x-linked in the search bar to restrict the search to an X-linked disease.

3.)UniProt(

Briefly describe UniProt and its contents (Hint: check out the “About” section)
Using any accession number found above, search UniProt using the “Text Search” tool under “Searches/Tools” by selecting “All Unique IDs/ACs” from the pulldown menu. Click on the ID/Accession to bring up the PIR View for this protein. Scroll down to view the variety of information available, then scroll back up and click on the linked “Entry Name” to go to the Swiss-Prot view for this entry. Open the FASTA sequence, then copy and paste it into your Word document.

4.)READSEQ ( ReadSeq is a very useful utility for converting among sequence formats. Read through the online help file before continuing.

Retrieve the mRNA sequence of the transcription factor Syn2 gene from GenBank (the accession number is NM_003178 , but try using other fields)
Now go to the Web-based READSEQ conversion page and copy and paste the whole GenBank file into the sequence input box and choose Pearson/FASTA format as the output format. Click Perform Conversion and a new box will appear with the sequence in FASTA format. Copy and paste the sequence into your word document.

The following SRS exercises are taken from your Essential Bioinformatics textbook, p. 301-302.

5.)SRS

Go to the SRS server ( )and find human genes that are larger than 200 kilobase pairs and also have poly-A signals. Click on the “Library Page” button. Select “EMBL” in the “Nucleotide sequence databases” section. Choose the “Extended” query form on the left of the page. In the following page, Select human (“hum”) at the “Division” section. Enter “200000” in the “SeqLength >=” field. Enter “polya_signal” in the “AllText” field. Press the “Search” button. How many hits do you get?
Use your knowledge and creativity to do the following SRS exercises.
Find protein sequences from Rhizobium submitted by Ausubel between 1991 and 2001 in the UniProt/Swiss-Prot database (hint: the date expression can be 1-Jan-1991 and 31-Dec-2001). Study the annotations of the sequences.
Find full-length protein sequences of mammalian tyrosine phosphatase excluding partial or fragment sequences in the UniProt/SwissProt database (hint: the taxonic group of mammals is mammalia). Once you get the query result, do a Clustal multiple alignment on the first five sequences from the search result…

The following sequence alignment exercises are from your Essential Bioinformatics textbook, p. 304

6.)Pairwise Sequence Alignment

In the NCBI database, retrieve the protein sequences for mouse hypoxanthine phosphoribosyl transferase (HPRT) and the same enzyme from E. coli in FASTA format. (Note: do NOT place the enzyme name in quotes, for some reason this does not work for this example. Also, note thatlimits you set during your previous Protein searches may still be active. Make sure to uncheck the box in the limits tab before performing this search)
Perform a dot matrix alignment for the two sequences using Dothelix ( Paste both sequences in the query window and click on the “Run Query” button. The results are returned in the next page. Click on the diagonals on the graphic output to see the actual alignment.
Perform a local alignment of the two sequences using the dynamic programming based LALIGN program ( Make sure the two sequences are pasted separately in two different windows. Save the results in a scratch file.
Perform a global alignment using the same program by selecting the dial for “global.” Save the results and compare with those from the local alignment.
Change the default gap penalty from “=14/-4” to “-4/-1”. Run the local alignment and compare with previous results.
Do a pairwise alignment using BLAST (on the BLAST homepage, select the bl2seq program by clicking “Align”). Compare results with the previous methods.
Do another alignment with an exhaustive alignment program SSEARCH (
Run a PRSS test to check whether there is any statistically significant similarity between the two sequences. Point your browser to the PRSS web page ( Paste the sequences in the FASTA format in the two different windows. Use 1,000 shuffles and leave everything else as default. Click on the “Compare Sequence” button. Study the output and try to find the critical statistical parameters. Describe in a few sentences what you learned about the importance of parameter settings in pairwise sequence alignments from this exercise.

You are expected to turn in answers to all of the questions in the exercises. Turn in the answers to the questions by emailing them to by 5 p.m. on Monday, Sept. 3rd.

If you have some time left feel free to browse around the web sites some more to see what is there. Also check out to see what is coming up soon.

This lab may include material from the following sources:

Mount, D.W. (2004) Bioinformatics Sequence and Genome Analysis

Campbell, M., & Heyer, L (2007) Discovering Genomics, Proteomics, & Bioinformatics

Xiong, J. (2006) Essential Bioinformatics