Comp602 Bioinformatics Algorithms Labs Spring 2011

Comp602 Bioinformatics Algorithms – Laboratory Exercises – Spring 2011 (draft)

Laboratory Rules

Teams:

Scoring:

10 points when correct and submitted on time. Lose 2 points for each week late.

Laboratory Submissions:

Please submit labs by Blackboard as a single word or pdf document. Keep a copy for yourself. The document contains:

a) Problem statement. You can usually copy this from the lab assignment.

b) Algorithm used: You may refer to the textbook or some other source. The algorithm should be stated in standard algorithmic pseudocode as used in the Levitin book.

c) Big-O complexity information (if specified).

d)Program code. Submit only code you have written. Never submit generated code, object code or executables. Acceptable languages are C++ or Java. Some labs may allow submission in other languages including Perl and Matlab. All files should have a comment header including your name(s), the lab number, a brief description of the code and the date written. If the assignment is to modify code given to you, be sure to credit the original author in the header. The header should also contain a brief description of the modifications you made. Throughout the code clearly mark the modifications you made with comments. For example, use your initials as in: //@mw …

e)Sample runs. Include at least two. Some problems will be run against data sets consisting of fragments of genetic code or amino acid sequences. Clearly identify the data and state how it was obtained. Show the output from each run. Also give timing information. Use the system clock time or some more sophisticated method such as the query performance counter to show the time used.

f) Your comments (optional). You may wish to make suggestions on how to improve the algorithm.

Lab 1 Brute ForcePattern Matching

In this lab you will search for patterns in a string downloaded from a public dna database. You will use a brute force algorithm such as described in Levitin p. 104. You are not allowed to use built-in regular expression matching found in languages like Perl.

The brute-force patternmatching algorithm comparesthe pattern P with the text T

for each possible shift of Prelative to T, until eithera match is found, or all placements of the patternhave been tried. Brute-force pattern matchingruns in time O(nm), where n is the length of the test and m of the pattern.

Example of worst case:

T = aaa … ah

P = aaah

BruteForceMatch(T, P, m, n)

//Input text T of size n and patternP of size m

//Output starting index of asubstring of Tequal to P or -1

//if no such substring exists

for i  0 to n-1 do

/* test shift i of the pattern */

j  0;

while (j <m & T[i + j] = P[j])

j  j + 1;

if ( j = m)

return i ; /* match at i */

return -1; /* no match */

Part 1) Implement the BruteForceAlgorithm in C++ or Java.

Do at least 2 sample runs searching for the pattern: AAACTGAAAAAGAACGAAACTGTCin different chromosomes. The database resource is Genbank. You can search Nucleotide for NC_004353.

For the first sample run use the genome for chromosome 4 of the drosophila.

Part 2) Searching for Metalloenzymes Motifs

a) Go to NCBI and search the protein database for “zinc-dependent metalloprotease”.

b) Narrow the search results by pressing “Homo sapiens” under top Organisms.

c) Under the second result:

metalloprotease [Echis pyramidum] 617 aa protein, choose FASTA.

d) Use the “Find in this Sequence” tool to locate the pattern “HExxHxxGxxH”.

e) Try this search on some related species such as “Group III snake venom metalloproteinase [Echis ocellatus]”

f) Modify the brute force pattern matching program from Part 1 so that letter ‘x’ in the pattern matches any letter in the text.

g) Compare the results of your program with those from the “Find in this Sequence” tool on several sequences.

Lab 1A (Extra Credit) Boyer-Moore-Horspool Pattern Matching

Pattern matching is useful in working with dna sequences and also in searching medical and other literature. Read the research article Fast Exact String Pattern-matching Algorithms Adapted to the Characteristics of the Medical Language. Then repeat the pattern searches in Lab 1 using the Boyer-Moore-Horspool algorithm.

Brief description of Boyer-Moore-Horspool:

Brute force slides the pattern alongside the text, matching letters left-to-right until either:

all match – returns index of start of pattern in text
mismatch – slides pattern rightwards by 1

Boyer-Moore and similar algorithms do better by sliding the pattern more than 1 square at-a-time. Letters are matched right-to-left. How much to slide depends on the first text letter that doesn’t match. If it is not in the pattern at all, then slide by the length of the pattern. If it is in the pattern slide by the distance from the right of the first occurrence of the letter in the pattern.

For example, consider the pattern: TCGCT. If the text letter is anything but T, C or G then slide by 5. If the text and the pattern agree, don’t slide but continue to match right-to-left. If they don’t agree slide by 1 if the text is C, slide by 2 if G, by 4 if T.

Compare the time taken by brute force and Boyer-Moore-Horspool on the same data.

To achieve this easily, you need to preprocess the pattern to make a skip table, with entries for all possible letters, and the amount to slide for each. For example:

Table for “TCGCT” over the alphabet {A,C,G,T}

A / C / G / T
5 / 1 / 2 / 4

Here is the algorithm copied roughly from Levitin for preparing the skip table.

SkipTable(P[0 .. m-1])

//Input: Pattern P with length m

//Output: Table[0 .. n-1] with skip values for all n letters in alphabet

for i  0 to n-1 do

Table[i] = m

for j  0 to m-2 do

Table[P[j]] = m – 1 – j

return Table

To make the lookup fast, you can use the ASCII values of the letters. Make an array with entries indexed from 0 to 127, set all the values to the length of the pattern (in this case 5) then modify the values for the letters appearing in the pattern. In this case the entry for 65 is 5, for 67 is 1, etc. Then when deciding to slide, look up the text char in the array to see how much to slide by.

Lab 1B (Extra Credit)PROSITE pattern notation

a. Acquaint yourself with the PROSITE pattern notation. Start with the Sequence motifarticle in Wikipedia. It is excerpted here:

______

The PROSITE notation uses the IUPAC one-letter codes and conforms to the above description with the exception that a concatenation symbol, '-', is used between pattern elements, but it is often dropped between letters of the pattern alphabet.

PROSITE allows the following pattern elements in addition to those described previously:

The lower case letter 'x' can be used as a pattern element to denote any amino acid.
A string of characters drawn from the alphabet and enclosed in braces (curly brackets) denotes any amino acid except for those in the string. For example, {ST} denotes any amino acid other than S or T.
If a pattern is restricted to the N-terminal of a sequence, the pattern is prefixed with ''.
If a pattern is restricted to the C-terminal of a sequence, the pattern is suffixed with ''.
The character '' can also occur inside a terminating square bracket pattern, so that S[T>] matches both "ST" and "S>".
If e is a pattern element, and m and n are two decimal integers with m <= n, then:
e(m) is equivalent to the repetition of e exactly m times;
e(m,n) is equivalent to the repetition of e exactly k times for any integer k satisfying: m <= k <= n.

Some examples:

x(3) is equivalent to x-x-x.
x(2,4) matches any sequence that matches x-x or x-x-x or x-x-x-x.

______

b. Read some lecture notes on regular expressions and automata. Also on Pattern Matching.

Express the pattern signature of the C2H2-type zinc finger:

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

in the form of a finite state automaton.

c. Write a program to match this pattern in any protein sequence. The program should have a number of states and transitions corresponding to the automaton in Part b.

d. Is it possible to write a more general program? This program would take in any PROSITE pattern and generate the matching states and transitions. It could then be used against any sequence.

Lab 2 Keyword Trees

Read Sections 9.3, 9.4, 9.5.
Write a program to implement multiple pattern matching by use of keyword trees. Proceed as follows:
The space delimited list of patterns and the text are stored in two separate files. You need to be able to open the files and read the strings.
Construct the keyword tree by adding patterns one-at-a-time. Each node may have multiple children labeled by single letters. A flag should indicate if a node terminates one of the keywords. So you need to write a function like addChild(char c) which creates a new child node labeled by ‘c’ only if one doesn’t already exist.
You may choose to simplify the problem by assuming that no pattern is a prefix of any other in the set. This makes sense since if the longer pattern is matched, the prefix is matched too. This way all patterns terminate at leaves.
As described on page 319, traverse the keyword tree using letters from the text. If a terminal node (leaf) is found output the pattern matched and the position of the first letter in the text matching the pattern. Back up the text to the position following the last starting position whenever there is a mismatch or a leaf node has been reached. This way you can continue searching for other matches. i.e. Suppose pattern “that” is matched on a terminal leaf. Back up the text to the ‘h’ to match patterns such as “hatter”.
Test you code on some contrived examples as in Figure 9.4.
Further test using the set of patterns matching the amino acid arg see p.66) and a long string of DNA.

Lab 3 – Frequencies from the data due

The goal is to gain familiarity in working with DNA data sets.

1. You will divide a long DNA sequence into blocks, say 1000 letters long. For each block you will compute 4 counts, namely the numbers of A, C, G, T in the block. The user will enter the file name for the sequence and the desired block size. The output will be in the form of a text file, formatted so that it can be read into an Excel spreadsheet, as in this example.

>gi|17981852|ref|NC_001807.4| Homo sapiens mitochondrion, complete genome

Block size: 1000

Block Start A C G T

0 0 309 311 149 231

1 1000 353 253 188 206

2 2000 347 259 171 223

3 3000 283 329 149 239

4 4000 316 312 115 257

5 5000 312 317 122 249

6 6000 260 306 168 266

7 7000 299 300 138 263

8 8000 324 319 115 242

9 9000 271 322 139 268

10 10000 314 294 102 290

11 11000 297 337 111 255

12 12000 309 313 116 262

13 13000 299 359 112 230

14 14000 349 354 81 216

15 15000 297 322 124 257

2.Open the file you created in Part 1 in Excel. Use the chart wizard to create a chart. It should look like this.

3.You will now focus on transitions from one letter to the next, i.e. count the number of transitions from A to C, from A to A, etc. With a 4-letter DNA alphabet there are 16 possible transitions. As in Part 1, allow the user to input a block size and output a text file with the frequencies for each transition.

4.Focus on GC and AT transitions. Obtain the DNA for the virus Bacteriophage lambda, which is 48502 bases long. Create an Excel chart showing the frequency of GC and AT transitions. What do you notice from examining the chart?

Lab 4 Open Reading Frames due

An Open Reading Frame (ORF) begins with a START codon and ends with a STOP codon. In this lab you will examine a strand of RNA and find all possible ORF’s. Even though it is RNA, we will use the DNA alphabet ACTG instead of the RNA alphabet ACUG. An ORF may or may not code for a protein. Usually, we establish a threshold, say 70 residues long, and claim that ORFs above the threshold are unlikely to occur by chance alone (null hypothesis) and therefore represent a protein.

1.Write a program that reads a file of RNA and creates two arrays, the first containing the locations for all START (ATG) codons, and the second containing the locations of all STOP (TAA, TGA, TGA) codons.

2.Allow the user to input a threshold. Output a report of all ORF’s above the threshold. The report should simply list the START position in the sequence and the length (in residues) of the ORF.

3.Using the genetic code, translate the ORFs into proteins. Print a report showing all proteins thus obtained.

Test your code on the Sars virus. You can find it in GenBank, accession number AY274119.3.

Lab 5Partial Digest Problem due

Write a program to implement the partial digest problem using the algorithm on page 90 in the Pevzner book. Test your program on at least 2 data sets.

1. The data set in Figure 4.1 on p.85, namely: 2 2 3 3 4 5 6 7 8 10. Your algorithm should produce 2 solutions, the one shown in Figure 4.1, namely {0,2,4,7,10} and also its homometric twin: {[0,3,6,8,10}.

2. The dataset: 1 3 3 3 3 4 4 4 6 7 7 7 7 10 10 10 11 13 14 14 17. Be careful to get this right since I know the answers.

Hints: Create a class named Multiset. I did this in C++. You need to have a way to store the values. I simply used an array. The array index is the value. The number stored at that index is the count. So if a[5] is 2 that means there are 2 fives in the multiset. This is a bit crude since many array slots aren’t used but it gets the job done. I also used variables to hold the current count of values and the current largest value.

My class has the following operations:

Multiset(void); //builds an empty multiset

~Multiset(void);

void addValue(int value);//increments the count for that value

void removeValue(int value);// decrements the count for that value

void printMultiset(ostream& out);// prints the nonzero values and counts

int readValues(istream& in);//builds a multiset by reading a file of space delimited integers

bool subset(Multiset superset);// returns true if all values in this multiset are also in superset

Multiset* delta(int y);// Given a point y and set X, returns the multiset of distances (made positive) between y and elements in X

void subtract(Multiset* other);// subtracts values in other multiset form this one

void add(Multiset* other);// Adds values in other multiset to this one

int computeLargest(int oldLargest);// If the oldLargest is removed a new one is computed

The original sequence X that you are trying to recover is actually a set so you should make a set class too; I was lazy and used the Multiset class. Once you have created the Multiset class you can implement the PartialDigest(L) and Place(L,X) functions according to the algorithm. I made these stand-alone functions. Good Luck.

Lab 5a(Extra Credit) Working with public databases in biology

This lab is meant to familiarize you with some of the important public databases used in biology and medicine. The lab is taken from the textbook site reproduced here for your convenience.

Practical Problem Set #1: Working with Databases

Part A: Green Fluorescent Protein

One of the most well studied proteins in molecular biology is the green fluorescent protein, or GFP. For this problem, you will visit popular online biological databases websites and gather information on GFP.

Genbank:

Genbank is a database of nucleotide sequences. It can be accessed at the NCBI website (NationalCenter for Biotechnology Information) at In the search pull down menu at the top, make sure "nucleotide" is selected. In the text box at the top of the screen where it solicits input for searching, type "GFP" and hit the Go button.
This search will bring up over 1000 results. To narrow the search, click on "Limits" just below the box where you typed "GFP". Limit the search to "gene name" (in the dropdown box) and click the "Go" button again. You will now have approximately 50 results. Go to the end of the list (you will have to click "next" one time (the "next" link appears to the right).
The last two entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653 (the last entry), look over the Genbank record, and answer the following questions:
How long is the nucleotide sequence?
How many "guanines" appear in the gene's DNA sequence? Is there a bias towards any particular nucleotide?
What is the Latin name of the organism whose DNA was sequenced for this GFP?

Swissprot:

Swissprot is a database of amino acid sequences that can be accessed at At the Swissprot homepage, type GFP and click the Search button. The last link in the Swissprot section (not the trembl section) should be GFP_AEQVI P42212).
Examine the web page for this protein, and answer the following:
How many references are cited?
This Swissprot record has links to other databases. Pfam (Protein Families) is a database of multiple alignments. Pfam accession numbers begin with the letters PF, followed by five numbers (e.g. PF12345). What is the Pfam accession number for GFP_AEQVI? (NOTE: An accession number is simply a tag that you can use to refer to a particular item in a database. Many of the databases you will use will have accession numbers. There is no standard formatting for accession numbers across databases.)

The Swissprot database is available via ftp. To see the data in its textual format (i.e. what you get when you ftp), scroll down to the bottom of the GFP_AEQVI web page, and click the link that says "View entry in raw text format (no links)." Answer the following questions:
The first two letters on each line identify what kind of line it is (e.g. ID = Identifier, DT = date, etc.) Find the line that has the Latin name for the species. What two letters appear at the beginning of the line?
What two symbols, which appear on a line by themselves at the bottom of the file, indicate the end of the record for GFP_AEQVI? (In the ftp file that you can download, these symbols are the "record separators".)

Protein Data Bank:

The PDB (Protein Data Bank) is a database of protein structures at From that page, click the “SearchLite” link. On the resulting page, type “GFP” into the text box and click the “Search” button. Look at the first result (1EMB) and click the “Explore” link to the right. Then click on the “Download/Display File” link (on the left). Then, click on the link to display the structure file in PDB file format complete with coordinates as HTML.
In this file the majority of lines are “ATOM” lines. Scroll down until you see those lines and note how the atoms are numbered (in this case, 1 to 1908). Answer the following questions:
What kind of atom is #16 (3rd column)
What kind of amino acid is atom #16 in? (4th column)
What are the (x,y,z) coordinates of atom #16?

ENSEMBL:

ENSEMBL is web-based genomic resource available at The first website is the ENSEMBL home page. How many species are available on this website?
Search for anything with “zinc finger” (a structural motif in proteins). Find the first mouse GENE, and browse to that page. Please note that mouse is Mus musculus, and that the results are grouped in several ways. You are looking for GENE index. Follow the first link. Record the following:
The ensembl mouse gene id for this first link.
The genomic location for this mouse gene.
The cDNA transcript for this mouse gene - found by following link to view gene in genomic location and looking over the basepair view. Move the pointer slowly.
The ensembl human gene id for homologous protein (back to the mouse gene specific page and look down to homology).
The genomic location for this human gene.
How many cDNA transcripts are given? Record for ENSESTT00000205818.
What is the Hamming distance between the first ten nucleotides?
Human and mouse genomes can be partitioned into a large number of synteny blocks, with each human synteny block corresponding to a mouse synteny block. What mouse synteny block is the mouse gene located on? What human synteny block does the human homolog belong on? What is the correspondence between these two synteny blocks?

Part B: Searching A Nucleotide Sequence Database

Go to the following web page:
Copy the DNA sequence marked JurassicPark DinoDNA from the book JurassicPark. (Read the text to learn the story behind this particular DNA).
Go the NCBI Blast home page at Go to the link that says Nucleotide-nucleotide BLAST [blastn]
Paste the DinoDNA DNA sequence into the text box and hit the Blast! button.
What is the gi number of the first result?
What is the length of the match of the first result?
What is the e-value of the first result?
Is the DNA sequence in Jurassic park fictional (i.e. made up / random) or “borrowed” (i.e. copied from real DNA)?

Lab 6Dynamic programming and the Manhattan tourist problem due

Write a program to solve the Manhattan tourist problem. It should hard-code a graph consisting of nodes (street intersections) and edges (city blocks). Edges containing a tourist attraction should be weighted 1 (or n if there are n attractions). Other edges are wighted 0. A node edge has an int value field named score and an enum valued field named from. The enum values are {WEST, NORTH, DIAGANOL). The from field indicates the last direction used to reach the node. The idea is to fill in the node values starting from the northwest corner and ending at the southeast corner. When filling in a node consider the maximum score obtainable by reaching the node from the north, the west or from a diagonal (if there is one). At the end, the southeastern-most node contains the highest possible score. A path can be recovered by following the from pointers back to the northwest node.