Assignment 3 – Longest Common Subsequence
Dean Zeller[1]Due: Thursday, September 21st by 2:00 pm
CS3300110 points
Fall, 2006
Objective:The student will write a C++ program to implement a dynamic programming approach to the longest common subsequence problem.
Topics covered
This assignment is a review of your skills as a programmer using the string and two-dimensional array data structures.
Background: Bioinformatics
From Wikipedia:
Deoxyribonucleic acid (DNA) is a nucleic acid — usually in the form of a double helix — that contains the genetic instructions (or genocode) monitoring the biological development of all cellular forms of life. DNA is a long polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code of nucleotides. DNA is thought to date back to between approximately 3.5 to 4.6 billion years ago.
Computers execute machine code, a series of 0’s and 1’s. The machine code for living organisms is DNA, a sequence of four nucleotides: adenine, cytosine, guanine, and thymine. Machine code and DNA are very similar in theoretical structure. Thus, a technique that is useful in computer science can also be useful in genetics. DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests to determine genetic similarity is the longest subsequence of nucleotides common to both species. Determining this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree. (Evolution trees will be discussed later in the semester.)
It is a common problem in computer science to determine the longest common substring of two strings, particularly useful in web search engines. In bioinformatics, the largest subsequence of two strings is far more useful (and difficult) to determine. It is simple to write a brute-force program to solve the problem. DNA strings can be millions of characters long, and thus a brute-force method could take years (or even centuries) to give results for real-world data. This assignment implements a dynamic programming method of efficiently solving the longest common subsequence problem.
Program Requirements
InputPrompt the user to enter two strings, representing the DNA sequence of two species. Alternately, the strings can be read in from a file.
Error checkingBefore executing the algorithm, check each character in both strings to ensure only acceptable DNA letters are entered. Do not run the algorithm on input with erroneous data. In the event of an error, indicate how many characters are illegal, and recreate the string with the illegal letters highlighted (example: AGCTSACT agctSact).
OutputThe program should output the dynamic subsequence table and the length of the longest subsequence. The actual subsequence string can be printed for extra credit.
Program Architecture
FunctionsUse functions to modularize your code. The following is a suggestion on how to organize your functions.
functions.hheader file for functions
functions.cppimplementation of the functions
main.cppmain program implementation that calls the functions
DocumentationEach function must include a documentation block describing the input, output, and method of solving the problem. Create documentation for the main program describing the program overall. Use proper indentation and self-descriptive variable names.
Grading
You will be graded on the following criteria:
AccuracyCorrectly implementing and using the LCS algorithm
ReadabilityDocumentation, descriptive variable names, and indenting
TestingEnsuring the code works for a variety of inputs
CreativityUsing new methods to solve the problem
Extra Credit
Extra points will be given for including the following features:
File inputAllow the user to enter a filename with the two input strings
SubsequencePrint the longest common subsequence in addition to the length
Brute-force In addition to the dynamic programming method described in this assignment, also implement a brute-force solution to the problem. Create a short report that describes the program duration for a variety of input lengths.
Experiment Reconstruct the evolution tree on the species described in Input Set #5. Run your program on each pair of species in Input Set #5 below. Use the similarity values to reconstruct a possible evolution tree, in which species that are more similar are closer on the tree.
Turning in your assignment
- Print all C++ and header files.
- Print a test of your program with input sets given below.
Algorithm Pseudocode
Given below is the pseudocode to solve the problem. You are to implement the pseudocode in object-oriented C++ functions. Note that this pseudocode has not yet been tested.
Step 1: initialize variables
string s = string1
string t = string2
Twidth = s.length + 1
Theight = t.length + 1
int T[Twidth][Theight]
for (i=0; i<Twidth; i++)
T[i,0] = 0
end-for
for (j=0; j<Theight; j++)
T[0,j] = 0
end-for
Step 2: Run LCS algorithm
for (j=1; j<Theight; j++)
for (i=1; i<Twidth; i++)
if (s[i] == t[j]) // match in subsequence
T[i,j] = T[i-1,j-1] + 1
else // no match, use previous value
T[i,j] = max (T[i-1,j], T[i,j-1])
end-if
end-for
end-for
subsequenceLength = T[Twidth][Theight]
Step 3: Determine subsequence string (extra credit)
To recreate the string, you must keep track of the times the algorithm found a matching letter in the sequence. The exact method to keep track of the sequence matches is up to you.
Test Input Sets
Test your program on the following sets of input data.
Input Set #1
String1:ATCCGACAAC
String2:ATCGCATCTT
Output:7 (ATCGCAC)
Input Set #2
String1:AACGTTCOGMA
String2:GGATACCASAT
Output:errors in String1 (aacgttcOgMa)
error in String 2 (ggataccaSat)
Input Set #3
String1:AAAATTTT
String2:TAAATG
Output:4 (AAAT)
Input Set #4
String1:TAGTAGTAGTAGTAGTAG
String2:CATCATCATCATCA
Output:8 (ATATATAT) or 8 (CACACACA)
Example of Method
Rule for T[i,j]:
if (string1[i] == string2[j])
T[i,j] = T[i-1,j-1] + 1
otherwise
T[i,j] = maximum (T[i-1,j], T[i,j-1])
string1 = “ATCCGACAAC”
string2 = “ATCGCATCTT”
[1] Written by Dean Zeller. Edited by Jasmine Boscom, Daryl Popig, and John Withers.