Assignment 3 – Longest Common Subsequence

Dean Zeller[1]Due: Thursday, September 21st by 2:00 pm

CS3300110 points

Fall, 2006

Objective:The student will write a C++ program to implement a dynamic programming approach to the longest common subsequence problem.

Topics covered

This assignment is a review of your skills as a programmer using the string and two-dimensional array data structures.

Background: Bioinformatics

From Wikipedia:

Deoxyribonucleic acid (DNA) is a nucleic acid — usually in the form of a double helix — that contains the genetic instructions (or genocode) monitoring the biological development of all cellular forms of life. DNA is a long polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code of nucleotides. DNA is thought to date back to between approximately 3.5 to 4.6 billion years ago.

Computers execute machine code, a series of 0’s and 1’s. The machine code for living organisms is DNA, a sequence of four nucleotides: adenine, cytosine, guanine, and thymine. Machine code and DNA are very similar in theoretical structure. Thus, a technique that is useful in computer science can also be useful in genetics. DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests to determine genetic similarity is the longest subsequence of nucleotides common to both species. Determining this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree. (Evolution trees will be discussed later in the semester.)

It is a common problem in computer science to determine the longest common substring of two strings, particularly useful in web search engines. In bioinformatics, the largest subsequence of two strings is far more useful (and difficult) to determine. It is simple to write a brute-force program to solve the problem. DNA strings can be millions of characters long, and thus a brute-force method could take years (or even centuries) to give results for real-world data. This assignment implements a dynamic programming method of efficiently solving the longest common subsequence problem.

Program Requirements

InputPrompt the user to enter two strings, representing the DNA sequence of two species. Alternately, the strings can be read in from a file.

Error checkingBefore executing the algorithm, check each character in both strings to ensure only acceptable DNA letters are entered. Do not run the algorithm on input with erroneous data. In the event of an error, indicate how many characters are illegal, and recreate the string with the illegal letters highlighted (example: AGCTSACT  agctSact).

OutputThe program should output the dynamic subsequence table and the length of the longest subsequence. The actual subsequence string can be printed for extra credit.

Program Architecture

FunctionsUse functions to modularize your code. The following is a suggestion on how to organize your functions.

functions.hheader file for functions

functions.cppimplementation of the functions

main.cppmain program implementation that calls the functions

DocumentationEach function must include a documentation block describing the input, output, and method of solving the problem. Create documentation for the main program describing the program overall. Use proper indentation and self-descriptive variable names.

Grading

You will be graded on the following criteria:

AccuracyCorrectly implementing and using the LCS algorithm

ReadabilityDocumentation, descriptive variable names, and indenting

TestingEnsuring the code works for a variety of inputs

CreativityUsing new methods to solve the problem

Extra Credit

Extra points will be given for including the following features:

File inputAllow the user to enter a filename with the two input strings

SubsequencePrint the longest common subsequence in addition to the length

Brute-force In addition to the dynamic programming method described in this assignment, also implement a brute-force solution to the problem. Create a short report that describes the program duration for a variety of input lengths.

Experiment Reconstruct the evolution tree on the species described in Input Set #5. Run your program on each pair of species in Input Set #5 below. Use the similarity values to reconstruct a possible evolution tree, in which species that are more similar are closer on the tree.

Turning in your assignment

  1. Print all C++ and header files.
  2. Print a test of your program with input sets given below.
Algorithm Pseudocode
Given below is the pseudocode to solve the problem. You are to implement the pseudocode in object-oriented C++ functions. Note that this pseudocode has not yet been tested.
Step 1: initialize variables

string s = string1

string t = string2

Twidth = s.length + 1

Theight = t.length + 1

int T[Twidth][Theight]

for (i=0; i<Twidth; i++)

T[i,0] = 0

end-for

for (j=0; j<Theight; j++)

T[0,j] = 0

end-for

Step 2: Run LCS algorithm

for (j=1; j<Theight; j++)

for (i=1; i<Twidth; i++)

if (s[i] == t[j]) // match in subsequence

T[i,j] = T[i-1,j-1] + 1

else // no match, use previous value

T[i,j] = max (T[i-1,j], T[i,j-1])

end-if

end-for

end-for

subsequenceLength = T[Twidth][Theight]

Step 3: Determine subsequence string (extra credit)

To recreate the string, you must keep track of the times the algorithm found a matching letter in the sequence. The exact method to keep track of the sequence matches is up to you.

Test Input Sets
Test your program on the following sets of input data.

Input Set #1

String1:ATCCGACAAC

String2:ATCGCATCTT

Output:7 (ATCGCAC)

Input Set #2

String1:AACGTTCOGMA

String2:GGATACCASAT

Output:errors in String1 (aacgttcOgMa)

error in String 2 (ggataccaSat)

Input Set #3

String1:AAAATTTT

String2:TAAATG

Output:4 (AAAT)

Input Set #4

String1:TAGTAGTAGTAGTAGTAG

String2:CATCATCATCATCA

Output:8 (ATATATAT) or 8 (CACACACA)

Example of Method

Rule for T[i,j]:

if (string1[i] == string2[j])

T[i,j] = T[i-1,j-1] + 1

otherwise

T[i,j] = maximum (T[i-1,j], T[i,j-1])

string1 = “ATCCGACAAC”

string2 = “ATCGCATCTT”

[1] Written by Dean Zeller. Edited by Jasmine Boscom, Daryl Popig, and John Withers.