Aligning protein sequences by hand
The most powerful tools in the bioinformaticist's toolbox is sequence alignment. Let’s see why this is so with the following example:
Well, lets give a few examples.
- Suppose we have cloned and sequenced a protein, which we believe to be a protease. Which protease could it be? Search using PUBMED to find out more about proteases and protease families. A database search using BLAST tells us that this protein is a remote family member of the serine protease family. So we make an alignment against the most similar serine protease. This tells us that the overall sequence identity is about 29%. That is not very much, but the local sequence identity around the three active site residues is considerably higher, and because of that we know for sure that our new protein is a serine protease.
- Suppose we have a protein that we can easily obtain in large quantities, and we want to use it in a bioreactor. Unfortunately, the industrial process requires a temperature of 65 oC, but our protein is heat labile and denatures at temperatures higher than 52 oC. What will you do? Introducing some mutations may make the protein more stable. Simple, but which of the 298 amino acids should be mutated? The only thing we know is that we should not mutate in or near the active site because that would alter the specificity. This is when sequence alignments come in. We align our protein against a series of family members that have been purified from thermophilic members of domain Bacteria and Archaea. We look at the multiple sequence alignment, and if we see positions where all the thermostable stable proteins have one type of residue, and our protein another, we may have a site which we could mutate. If one such position is also far away from the active site, and not in an unpleasant position (like the first residue because of cleavage of the pro-peptide, or just in the middle of the epitope that our monoclonal antibody recognizes), we have potentially found a stabilizing mutation.
- A third example will pop up in due time.
But, lets start with some examples on alignments:
The question with the first example given below: "Write down in your own words why the green alignment is better than the red one, and why this seems to be wrong at first.
If we have two sequences, with two different alignments:
A TVTVTGNSITIT A TVTVTGNSITIT
B1 TVTVTG--ITIT B2 TVTVT—GITIT
then the left alignment looks much better, but look at the corresponding structures that are shown below:
Structure ATVTVTGNSITIT
the structure that would lead to alignment B1
TVTVTGNSITIT
TVTVTG--ITIT
the structure that would lead to alignment B2
TVTVTGNSITIT
TVTVT--GITIT
1 Aligning sequences by hand.
The alignment given below is very straight forward to achieve and does not require software.
-ASTRGFHILTYHGVCIPPYILRTSA
AATTKGFHVISYHGICLPPYMIRT--
However, the following alignment of the two sequences has been not straightforward and required some thinking.
-ASTRGFHILTYHGVCIPPYILRTSA
AATTQPF--ISFHSICLGNFMIRS--
Nevertheless, I think that this alignment is the best that can be achieved for these two sequences. How can I know that? How did I make this alignment?
Lets think about an alignment. An alignment is a representation of a whole series of events that took place during evolution and that left their traces in the sequence. So, the more likely it is that something happens (or does not happen!) during evolution, the more important is it to have this "something" show up in the alignment.
What kind of "something"s is important? lets give a few examples:
- It is much easier to mutate than to insert or delete (indel).
- Once nature decided on an indel, its length is less important, but longer indels are more difficult to make than shorter ones.
- Active site residues don't mutate.
- Residues tend to mutate into similar residues (e.g. V <-> I; S <-> T; etc).
- Residues mutate more easily to residues encoded by similar codons.
- Cysteines that sit in cysteine bridges don't mutate easily.
- Surface residues mutate more easily than core residues.
- Core residues mutate easier when they make fewer contacts.
- It is hard to mutate a glycine that sits somewhere with torsion angles that other residues cannot have.
- Etc.
We will now start working on sequence alignments. We will slowly add one rule after the other, and learn a few new physico chemical properties of amino acids while we are doing this.
2 Hydrophobicity in sequence alignment
For each of the following examples, work out which is the better alignment, the one at the right or the one at the left.
CPISRTWASIFRCW CPISRTWASIFRCW
CPISRT---LFRCW CPISRTL---FRCW
CPISRTSASIFRCW CPISRTSASIFRCW
CPISRT---TFRCW CPISRTT---FRCW
CPISRTGASIFRCW CPISRTGASIFRCW
CPISRTA---FRCW CPISRT---AFRCW
CPISRTRASEFRCW CPISRTRASEFRCW
CPISRTK---FRCW CPISRT---KFRCW
CPISRTIASNFRCW CPISRTIASNFRCW
CPISRTH---FRCW CPISRT---HFRCW
CPISRTEASDFRCW CPISRTEASDFRCW
CPISRT---NFRCW CPISRTN---FRCW
CPISRTEASNFRCW CPISRTEASNFRCW
CPISRTQ---FRCW CPISRT---QFRCW
CPISRTFASTFRCW CPISRTFASTFRCW
CPISRT---YFRCW CPISRTY---FRCW
3 Secondary structure and sequence alignment
Sometimes the secondary structure of at least one of the sequences is known. This can either be the secondary structure as derived from a PDB file (remember, those are the files in which coordinates are stored) or it can be a predicted secondary structure.
Before we use this information lets look at some aspects of secondary structure. By now we know that secondary structure elements fall in four categories:
- Helix
- Strand
- Turn
- The rest
And if you look at the Chou and Fasman parameters (and some other very useful data) you see that there is relation between residue type and secondary structure.
Of course, as always in bioinformatics, the rules that are suggested by these parameters aren't very hard, and exceptions are everywhere. Nevertheless, they make some sense. So we will study them.
Using these rules, 'predict' the secondary structure of the following sequences:
- ELMKIAQLAKRGP
- VVICETTWYVEVT
- VTITVEGPKITVE
- SRGGEPTRHEAKE
- ELLALKLLTVTVT
And select from each of these pairs the better helix:
ALLKAMEAALL ALLNAMQAAGL
KRAAEALLEAE DEAAEALLKAR
ALLLAALLLAL AAEALAKALLR
And which are the better strands in:
VVKISVTIKSG LLKISLTIILI
VVTTVVTTVVTT VTVTVTVTVTV
VVICFFWIIFVI VKICFKSIYVR
4 Using secondary structure information in sequence alignment
Now, how do we use this information? Well, lets start with an example. Predict and sketch the structure of:
VTVTVTGNTVTVTV
and make the alignment with:
VTVTVSGVTVTV
That alignment requires two deletions in the middle. However, after you made the alignment, predict and sketch the secondary structure of this VTVTVSGVTVTV. And finally, compare the secondary structure predictions (and sketches) with the alignment. Do you now see how secondary structure can help?
Align the two sequences:
LLAELALAAMKGSTPNGS
LLLEALMRGTTPNGG
Now predict the secondary structure of the first sequence and look at the alignment again. What is the problem? How do we solve this?
5 The last example
In this last example, we show everything in pictures again. The question with this examples is again: "Write down in your own words why the green alignment is better than the red one, and why that seems funny at first"
If we have two sequences, with two different alignments:
A ALLELAMKLAIGNSGP A ALLELAMKLAIGNSGP
B1 ALLELAMK--IGNSGP B2 ALLELAMKIG--NSGP
then the left alignment looks much better, but look at the corresponding structures that are shown below:
Structure AALLELAMKLAIGNSGP
the structure that would lead to alignment B1
ALLELAMKLAIGNSGP
ALLELAMK--IGNSGP
the structure that would lead to alignment B2
ALLELAMKLAIGNSGP
ALLELAMKIG--NSGP
And, if by now it does not seem clear that knowledge about the structure can help with the fine-tuning of the alignment, you are in trouble.