Objective: To examine mutS/hMSH2 homologs and compare their amino acid sequences via a multiple sequence alignment. After constructing the alignment we will look at regions within the gene that appear to have been strongly conserved during the evolutionary process. We will test our observations in an empirical manner. We will learn to access tools that are available for making multiple sequence alignments.

NOTE: For this lab check that your Browser is Java 1.5 (or higher) enabled. JalView will not work for versions less than 1.5

Retrieving Amino Acid Sequences:

We will gather the amino acid sequences for four paralogs of the human hMSH2 gene and two orthologs from other species. In order to do this we need to acquire the SwissProt Accession Numbers for these genes. We can do this by visiting the SwissProt web site at

http://www.expasy.org/sprot/

However, in the interest of saving time and to also guarantee that we are all working with the same amino acid sequences we have located Accession Numbers and placed them in the following table

Gene / Swiss Prot
Accession #
MSH3 (Human) / P20585
MSH2 (Human) / P43246
MSH4 (Human) / O15457
MSH5 (Human) / O43196
MSH6 (Human) / P52701
MSH3 (Mus musculus) / P13705
MSH3 (Yeast) / P25336

Question 1: Which of the above sequences are paralogs and which are orthologs?

Question 2: Why might we consider doing multiple sequence alignments with paralogs and orthologs? What evolutionary information might be gained from such alignments?

For the next part of the investigation we will follow the information in Claverie and Notre Dame (BFD) p 294.

Enter the URL

http://www.expasy.org/sprot/sprot-retrieve-list.html

in the address line of your browser.

a.  On the Format line click the FASTA radio button.

b.  Enter the accession numbers for the mutS paralogs in the Sequence window of the page

c.  After entering these numbers, click the Create FTP file button

Question 3: Copy and paste all of the sequences that are generated into the space below. Make sure to include the header line that begins with the symbol ‘’. This is an essential part of the FASTA-formatted sequence file. You may also want to paste this information into a NotePad file. You will be using this data shortly to obtain your alignment.

Question 4: Repeat the above procedure for the 3 orthologs given in the table. If you are also creating NotePad files, create a separate file for this result.

The data that you have collected are now ready to be fed into the ClustalW program that will do the multiple sequence alignment of our mutS genes.

Question 5: Before beginning the multiple sequence alignment, which of the two groups (paralogs or orthologs) do you expect to be the more functionally constrained? Give the reasons for your choice.

For this part of our investigation we will be following the material in Claverie & Notre Dame (BFD) pp296 – 300.

Enter the URL:

http://www.ebi.ac.uk/clustalw

in the address window of your browser. You are presented with a fairly elaborate page with several options that can be set. Don’t panic (yet): we will be changing only a few of these from the default settings. In the mean time, scroll down to the Sequence window.

Question 6: Block and paste the sequences for the mutS paralogs from Question 3 above into the Sequence window. Make sure to include the header line with each sequence. After doing this:

a.  Choose Full from the Alignment pull-down menu.

b.  Choose aln w/numbers from the Output Format menu.

c.  Choose Input from the Output Format window

d.  Click on the Run button at the bottom of the page

e.  Review the output and make sure that the Alignment Section appears in the center of the output. This is important for the rest of our investigation.

f.  Save the web page to the Laboratory 6 section of your H drive. Do not close the page.

On page 305 of your lab manual is an explanation of the markings that appear below each line of the multiple alignment. We review them here. The markings are a star (*), a colon (:) and a period (.). Their meanings are as follows:

1.  (*) The column is conserved for all of the sequences in the multiple sequence alignment.

2.  (:) All amino acid residues in the column have roughly the same size and the same hydropathy, i.e., they appear to be functionally constrained.

3.  (.) The size or the hydropathy (but not both) was preserved in the course of evolution.

Your overall goal in a multiple sequence alignment is to identify important positions. In particular you want to find the amino acids that have not mutated or are functionally constrained. A good block for starting such an investigation is one that has a block with at least one to three stars, five to seven colons and a few periods sprinkled about for every 10 – 30 amino acids. The sequence may extend over more than one line of the displayed alignment and may be over 100 amino acids long.

Question 7: Identify the conserved region(s) in your alignment. Give the approximate locations of these regions relative to the hMSH3 sequence.

Question 8: Is any one of these more promising than the others, i.e., seem to have a higher percentage of the so-called important or conserved positions?

Open the JalView portion of the ClustalW results. This is actually a Java Applet that is running on your computer. It is used for editing the alignment generated by ClustalW. We are not planning to do that now. Our purpose is just to compare its presentation to the ClustalW results.

Question 9: What is shown in the graph below the sequence alignment in JalView? How does this information compare to your answers to questions 6 and 7 above?

Our final observation concerns the Guide Tree or Cladogram shown at the end of the ClustalW page. DO NOT CONFUSE THIS WITH A PHYLOGENETIC TREE. The tree shown here merely indicates the order in which ClustalW compared the sequences by taking the two most similar sequences first and then adding in the others.

Question 10: In what order were the sequences added to the comparison? (Start with the sequences more closely related to hMSH3 and add those that are least closely related).

Save this web page to the Lab6 folder in your Bioinformatics folder in your H drive as ClustalW1.

Now we will repeat the ClustalW process for the three orthologs of MSH3. Add the three sequences to the Sequence window and choose the same options that you chose for the alignment of paralogs.

Question 11: Using the location numbers from hMSH3 what regions in this alignment seem to exhibit strongly conserved regions?

Question 12: Which of your two multiple sequence alignments seem to be more strongly aligned?

Save this web page to the Lab 6 folder in your Bioinformatics folder in your H drive as ClustalW2.

If our goal is to find the strongly conserved regions within the proteins then it does not make sense to deal with the paralogs and orthologs separately. Return to the ClustalW home page and once again paste the sequences for the five paralogs into the Sequence window and then add the two orthologs to these sequences. Now, using the same options as in your first two runs, press the Run button. This will generate a third multiple sequence alignment for all 7 protein sequences. Save this web page in your Lab 3 folder as ClustalW3.

Question 13: Using the numbering scheme for the hMSH3 gene, identify the strongly conserved regions of this alignment.

Question 14: What does your observation in your answer to Question 13 say about the relative rates of evolution between the orthologs vs that between the paralogs? Briefly explain your reasoning.

Finally, we can test the strength of evolutionary conservation in the region(s) you have identified. To do this, we will test our alignment against a sequence that is even more distantly related to the human hMSH3 sequence. Return to SwissProt to find such a sequence. We will follow the procedure laid out earlier in Chapter 9 of our lab manual on pp290 – 295. We begin by BLASTing the hMSH3 gene.

Enter the URL

http://www.expasy.org/cgi-bin/BLASTEMBnet-CH.pl

After the ExPASy server appears.

a.  Enter the Accession Number P20585 (for hMSH3) in the box that is provided.

b.  If it is not highlighted (it probably is) click on the blastp radio button.

c.  Click on the check box “exclude fragment sequences”.

d.  Slide down to the Options section and set the number of best scoring sequences and best alignments to 1000.

e.  Set the E-value threshold to 0.1

f.  Click the Run BLAST button

g.  Click NiceBlastView when the next screen appears.

This will generate a very long list of information. Scroll down the list until you get to the lower valued Scores say around 100 – 110. This should have e-values in the 10-20 to 10-30 range. Choose a sequence from a non-human that is similar along the full range of hMSH3 and that has at least 800 amino acids. Check the box to the left of the score.

Question 15: Which sequence did you choose? What is the e-value of the sequence comparison of this sequence with hMSH3?

In the pull-down menu at the top of the page, choose Retrieve Sequences (FASTA format) and click Submit. This is located next to the “Send Selected Sequences to”: phrase.

Question 16: Paste your result here:

Question 17: Return to ClustalW and add this sequence to the other 7 sequences and run ClustalW again.

Question 18: What can be gained from comparing this sequence, which is rather distantly related in terms of score and e-values from MSH3, with the other 7 aligned sequences?

Question 19: Are there any regions that seem to be functionally conserved? (You may have to relax your criterion on *’s a bit.) Identify the region(s) using the ID numbers from hMSH3.

For Homework

Return to your third alignment that you saved as ClustalW3. Open the JalView window and look at the color coding of the sequence alignment and also the graph below the sequences. Move towards the end of the alignment. Notice that the numbering goes beyond that of the sequence alignment numbering that appears on the main ClustalW page.

Question 20: Why is there a difference in the numbering?

Around notation 1130 on the JalView presentation of the alignment is a column that is highlighted in blue. It reads top to bottom V, C, I, L, C, M, I. We want to consider this sequence of amino acids. Please be aware that the JalView display may be temporary (ClustalW only keeps your results for 24 hours). Therefore, we should locate this column in the alignment section of the ClustalW web page that we saved as ClustalW3.

Question 21: What is the number for the ClustalW alignment column that corresponds to the JalView column containing V, C, I, L, C, M, I.

Question 22: What is the most direct and simple route taken by natural selection to install these hydrophobic, non-polar amino acids at this location in each gene? In other words, determine the most likely pathway (most parsimonious pathway) of codon substitutions (minimum number of nucleotide substitutions) that would interconvert Methionine, Leucine, Isoleucine, Valine, and Cysteine. Build a pathway, starting from one of these amino acids, that shows how each of the other four could be obtained by a minimum number of changes. From this analysis, which amino acid(s), and which codon, is more likely to be ancestral, i.e., which amino acid and codon was more likely to reside at this position in the common ancestor of each of these genes?