Bioinformatics Exercises

Over the last two decades, information has been gaining increasing importance in both teaching and learning biochemistry. The most obvious case is the sequencing of the human genome and many other complete genomes. In 1990, the determination of the sequence of a protein was often the topic of a full publication in a peer-reviewed journal such as Science, Nature, or The Journal of Biological Chemistry. Now entire genomes are the topic of individual research papers. The term "bioinformatics" is a catch-all phrase which generally refers to the use computers and computer science approaches to the study of biological systems. The main chapters where this information is discussed in the text are chapters 3 (Nucleotides, Nucleic Acids and Genetic Information), 5 (Proteins: Primary Structure), 6 (Proteins: Three-Dimensional Structure), 12 (Enzyme Kinetics, Inhibition and Regulation) and 13 (Introduction to Metabolism). Here we provide exercises appropriate to these chapters aimed at introducing the techniques of bioinformatics that involve the use of computers, Internet-accessible databases and the tools that have been developed to “mine” those databases.

General principles

1. Open ended questions. The exercises may include some questions that have definite answers, but in many cases there will also be questions which may be answered in a number of ways, depending on the approach you take or the topic you select.

2. Stable Internet Resources. As much as possible, the exercises will be based on well established, stable web sites. If it is necessary to use less reliable sites and/or resources, attempts have been made to provide multiple sites that perform similar functions.

3. Here are the stable online resources that will be used most frequently:

a. Genbank (http://www.ncbi.nlm.nih.gov/)

b. Protein Data Bank (http://www.rcsb.org)

c. Expasy Proteomics Server (http://us.expasy.org/)

d. European Bioinformatics Institute (http://www.ebi.ac.uk/)

e. Pfam (http://www.sanger.ac.uk/Software/Pfam/)

f. SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)

g. CATH (http://www.biochem.ucl.ac.uk/bsm/cath/)

h. PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi)

i. PubMed Central (http://www.pubmedcentral.nih.gov/)

4. Answer key. Where a definite answer is known, it will be provided in an answer key. For more open-ended questions, a typical correct answer will be presented.

5. Historical perspective. If historical resources are available online (including PubMed), there may be questions designed to help students identify some of the historical roots of biochemistry and molecular biology.

Project 3 Visualizing Three-Dimensional Protein Structures Using the Molecular Visualization Programs Jmol and PyMOL

There are a number of good free visualization tools available on the Internet. Each has strengths and weakness. You will have two options: Jmol and PyMOL. Jmol is written in Java, has a nice user interface and uses the command set that will be quite familiar to users of RasMol or Chime. PyMOL is written in Python; the standard PyMOL user interface can be quite challenging to use, but plug-ins are available to increase ease-of-use. Many users consider PyMOL’s graphics capabilities compelling and worth the challenge of using the program interface.

Project 3 Exercises with Jmol

Jmol can be used in two different formats – as an applet built into web pages, or as a standalone application. We will be using it as a standalone application. Jmol is a java based application and therefore requires that you have Java Virtual Machine (JVM) installed on your computer. It is frequently installed on computers before purchase, but you can also find it at http://www.java.com/en/download/index.jsp.

Downloading and installing Jmol. The Jmol wiki (http://wiki.jmol.org/index.php/Main_Page) has a terrific instruction page about the Jmol application (http://wiki.jmol.org/index.php/Jmol_Application).

Windows. These steps should work on most computers. If you have difficulty, please go to the Jmol wiki and search for more instructions there.

1. Download the latest stable release (not a pre-release) from http://sourceforge.net/projects/jmol/files/ in a zip format. Zip is a compressed file format that can be opened by the operating system in Windows XP or Windows Vista.

2. Create a folder for Jmol. Suggestion: c:\Program Files\Jmol.

3. View the compressed zip file from Windows Explorer. Extract only the jmol.jar file to the c:\Program Files\Jmol folder.

4. With the c:\Program Files\Jmol folder open, right click on the icon for Jmol.jar and select “Create shortcut”. Drag the shortcut to your desktop or your taskbar. You will now have access to Jmol from your desktop.

Macintosh. These instructions are taken directly from the Jmol wiki (http://wiki.jmol.org/index.php/Jmol_Application#Installing_Jmol_Application).

1. Download the Jmol package (either .zip or tar.gz format) and extract/uncompress only the Jmol.jar file to the folder of your choice.

2. Simply double click on the Jmol.jar file to open Jmol.

As you go through the exercises below, you are encouraged to return to the Jmol wiki (http://wiki.jmol.org/index.php/Main_Page) for instructions and links to useful information about using Jmol.

1. Obtaining Structural Information. Please review the materials in your textbook about secondary structure of proteins. Secondary structures include alpha helices, beta sheets and beta turns in proteins. Many programs have been written that will predict secondary structures that will be found in a protein, based only on the primary sequence. Let's start again with rabbit muscle triose phosphate isomerase. Here is the primary sequence:

>gi|136066|sp|P00939|TPIS_RABIT Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)

APSRKFFVGGNWKMNGRKKNLGELITTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNGAFTGEISPGMIKDCGATWVVLGHSERRHVFGESDELIGQKVAHALSEGLGVIACIGEKLDEREAGITEKVVFEQTKVIADNVKDWSKVVLAYEPVWAIGTGKTATPQQAQEVHEKLRGWLKSNVSDAVAQSTRIIYGGSVTGATCKELASQPDVDGFLVGGASLKPEFVDIINAKQ

a. There are a number of web servers that will predict secondary structure based on the primary sequence of a protein. Here is a list (in case one or more is not working on a given day). If all fail because their web addresses have changed, a Google search for “protein secondary structure prediction” should be successful.

i. PredictProtein (http://www.predictprotein.org/). To start, you will need to create an account on this site. You can actually request this site to predict secondary structure from 7 different web servers on line. If this site is available, it will enable you to complete this assignment by clicking on 2 or more of the optional services. Please note that results may take one or two days.

ii. JPred (http://www.compbio.dundee.ac.uk/www-jpred/). Click on the advanced link to the right of the sequence box. If you use the JPred server, be certain to check the box labeled “Skip searching PDB before prediction”.

Submit the rabbit muscle triose phosphate isomerase sequence to these two servers. Compare the results you receive from the different servers. Can you identify segments where the predictions are not consistent between servers?

b. The structure of rabbit muscle triose phosphate isomerase has been determined by X-ray crystallography. Please go to the Protein Data Bank web server (http://www.rcsb.org/pdb/home/home.do) and search for 1R2R (that is the PDB ID for this protein). To do so, go to the blue band at the top of the page and select “PDB ID or Text”, enter 1R2R in the box, and click on “Search”. The page that comes up contains several tabs: Summary, Sequence, Derived Data, Seq. Similarity, 3D Similarity, Literature, Biol. & Chem., Methods, Geometry, and Links. The page normally opens to the Summary tab. Click on Sequence tab. The results shown here for the secondary structure are from an analysis of the actual 3D structure (not a prediction), which has been calculated according to an implementation of the method of Kabsch and Sander (1983) Biopolymers 22, 2577-2637. The assignments are: H=helix; B=residue in isolated beta bridge; E=extended beta strand; G=310 helix; I=pi helix; T=hydrogen bonded turn; S=bend. Compare your predicted results with the results presented on the PDB site.

c. As a first attempt at molecular visualization, please return to the Summary tab and follow the links on the PDB site for "Download File." You can download the file in a number of formats. It is best to download the file in “PDB file (text)” format for use with Jmol. Save the structure file as 1R2R.pdb on your computer (suggested folder: My Documents/PDB Files). Open the Jmol program. Then use the drop down menu: File..open to open 1R2R.pdb. You will initially see a cartoon model which represents helices as magenta corkscrews, sheets as yellow arrows and waters as small red spheres. To rotate the image, hold down the (left) mouse button while dragging the mouse over the image. You can control the view in Jmol in three different ways: dropdown menus, right-click menus and scripting. Perform the following steps to clean up the image a bit using the dropdown and right-click menus (for a one-button mouse, use CNTRL-click).

i. Dropdown: Display..Select..Water

ii. Dropdown: Display..Atom..None

iii. Dropdown: Display..Select..Hetero

Now you should be able to see the alpha helix and beta sheet structures in rabbit muscle triose phosphate isomerase without the red water spheres. Take some time to experiment with the other drop-down menu options on Jmol.

In addition to dropdown and right-click menus, Jmol also has a Script Console window that enables you to select specific atoms or parts of a structure (amino acid residues for example), then change the way they appear. To open the Jmol Script Console, select File..Console.. from the Jmol Dropdown menu. Then enter these commands at the $ prompt.

iv. select hetero and not water (selects non-protein parts of the structure excluding water)

v. spacefill (a van der Waals radius representation)

vi. color CPK (standard chemistry color scheme)

vii. select protein

viii. cartoon off

ix. wireframe 30

x. spacefill 100 (These combined commands yield a ball-and-stick structure of the protein.)

xi. zoom 200 (This gives a 2X expansion of the view. You can also zoom in on the structure in the viewing window by holding down the Shift key on your keyboard while using a left-mouse click-and-drag from the top to the bottom of the window. Experiment with this.)

xii. Now convert the protein back to a cartoon with the following four commands

1. Select protein

2. Wireframe off

3. Spacefill off

4. Cartoon

2. Exploring the Protein Data Bank. In the first problem, we visited the Protein Data Bank (PDB). We will explore that site in more detail now. If you encounter difficulties at any point in this exercise, you may be able to find your way using the Search box on the main site page or the Help files (on the left side of the page).

The PDB is a repository of macromolecular structures. Perhaps the most important skill for a PDB site user is the ability to find the structures they are seeking. On the home page (http://www.rcsb.org/pdb/home/home.do), the Help menu on the left side of the page includes Video Tutorials. These Flash animations will instruct you on navigating the site, searching for proteins and using the tools and viewers on the site.

Structures in the PDB are assigned PDB IDs - 4 letter alphanumeric codes that uniquely identify each structure. So for example 4HHB is a hemoglobin structure and 8GCH is a chymotrypsin structure. If you know the PDB ID, then you can use that to search the PDB. You may ask - why would I know that code unless I was the crystallographer who determined that structure? Most scientists who determine macromolecular structures are highly motivated to publish their findings in journals such as Science, Nature, Journal of Biological Chemistry, Journal of Molecular Biology and Protein Science. These journals have an agreement with the PDB that requires authors to submit their structures to the PDB before they will publish the article in their journal. Also, the figures in the text showing structures of proteins and nucleic acids list the corresponding PDB ID. For our first PDB search, we're going to find a PDB ID in a journal article, then find that structure on the PDB site.

Go to the Journal of Biological Chemistry web site (http://www.jbc.org) and search for this paper using the QUICK SEARCH menu near the top of the page:

Sampathkumar Parthasarathy, Kandiah Eaazhisai, Hemalatha Balaram, Padmanabhan Balaram, and Mathur R. N. Murthy.

Structure of Plasmodium falciparum Triose-phosphate Isomerase-2-Phosphoglycerate Complex at 1.1-Å Resolution. J. Biol. Chem. 2003 volume 278, pages 52461-52470.

Download the article (Full text or PDF – it’s free). Go to the footnotes section and find the four character PDB ID code. Then go to the Protein Data Bank main page. Type the PDB ID in the search box and click the Search button. You should be taken to the Structure Summary page for this enzyme. The Structure Summary page contains links to many related resources. Try to do each of the following:

a. Download the PDB (structure) file for this protein to your computer. Remember where you put it (suggested folder: My Documents/PDB Files; suggested name: 1o5x.pdb). In problem 3, you're going to study this structure using Jmol.

b. Download the protein sequence in FASTA format – click on Download Files on the right hand side of the page and select FASTA sequence. Suggested file name: 1o5x_FASTA.txt.

c. Find the still images of this protein on the 1o5x Summary page in the Biological Assembly box. Click on the link to More Images…. To save an image on the page that appears, just right click on it (CNTRL-click for a one-button mouse) and select the option that lets you save the file (In Internet Explorer, the command is "Save Picture As.."; in Firefox and Safari, the command is "Save Image As..").

d. Return to the "Summary" page for 1o5x. Click on "Links" tab. Follow the links for 1o5x to the sites at PDBSum and the IMB Jena Image Library. Collect still images from each of these sites. Make sure you keep a record of where you found each image.

3. Examining Protein Structures. In Problem 2, you should have saved the PDB file for 1o5x, entitled "Plasmodium Falciparum TIM Complexed To 2-Phosphoglycerate." We're going to use Jmol to explore this structure. We'll be particularly interested in identifying secondary structures and looking at the active site.

a. You're going to expand on the Question 1c exercise. Open Jmol on your computer. If you have not installed it already, please see the opening paragraph for the exercises in this chapter.

b. Open the file 1o5x.pdb. When you first open it, you will see cartoon representation of the structure with the waters shown as small red spheres. Now it's time to explore the drop-down menus in Jmol. There are 7 drop-down menus in Jmol: File, Edit, Display, View, Tools, Macro, and Help. Spend a few minutes trying each command in each of the menus. Here are a few that are very helpful:

i. File..Export..Export Image enables you to export an image you have created as an image in jpg format.

ii. Edit..Copy Image copies the image to memory. You can then paste the same image into a word processor or presentation file.

iii. Display..Zoom allows you to enlarge or shrink your structure.

iv. Display..Axes brings the x, y and z axes into Jmol