Bio/CS – 251
Final Project 2007
Computing Option
BACKGROUND: For your final project you will be assuming that you are in the role of the programmer and CS consultant for a group that is traveling in Africa studying the evolution of HIV related retroviruses. Because the team will be working in areas that do not have access to the networks and subsequent large databases, you have been asked to download some of the retroviral genomes that are relevant to the group’s study and also to supply programs that will allow the team to compare the sequences derived from the samples that they have collected to the database that you have downloaded onto their computers. Your task is to write and test the software for the group’s project.
ACADEMIC PURPOSE: You will be implementing versions of programs that already exist and are readily available through the NCBI Entrez web site. So while you will not be breaking new research ground, you will be wrapping up the semester-long study that blended biological techniques with the algorithms that have been designed to analyze the data obtained through the application of those techniques. You will be doing this by implementing some of the most important algorithms as computer programs. During the course of this project you will also be investigating retroviruses that are behind one of the world’s most insidious pandemics.
CREATE YOUR OWN BLAST PROGRAM
I. The contents of your target database will be five retrovirus (RTV) genomes that you will download and store in a file on your local machine. These five constitute a closely related family of RTVs that includes the human AIDS-causing viruses, HIV-1 and HIV-2, and the simian immunodeficiency viruses from which the HIV viruses evolved.
Point your Browser to http://www.ncbi.nlm.nih.gov/. Under the “Hot Spots” column choose Retrovirus Resources and click on the link. Set the Search window to Genome and find the complete genomes for the following Primate Lentiviruses:
HIV-1
HIV-2
SIV
SIV-2
Simian-Human Immunodeficiency Virus
- Copy the complete FASTA formatted sequence for each genome
- Store each genome as a separate record in a Random Access file.
- Place a header on each genome that gives the record number and identification of the genome.
- Download the gag-pol protein from HIV-1 and the pol protein to your computer and save them in a separate file. These will be your query sequences for testing your BLAST program
- Using the sequences in your target database, create a HASH TABLE of the least common three letter “words” in the file.
- You need to determine a reasonable Hash Function that will reasonably distinguish amongst the three letter words.
- The criterion for common should be that they appear at no more than 10% of the time in the database.
- The entry in table should be the word, the sequence containing the word, and the position of the word in the sequence.
- Save this Hash Table for use by your BLAST program.
- Write an ungapped BLAST program using the algorithm presented in class.
- Your program should accept the users preference for word threshold and window cutoff as parameters. If they are unspecified by the user, your program should have default values assigned. These parameters should be assigned relative to the Jukes-Cantor scoring scheme.
- Of course, the program must require a query sequence as input.
- The output should also contain the name of all matching local alignments of length at least 50 characters long. In addition you need to report the Jukes-Cantor score for the alignment.
- (Extra Credit) Compute an e – value for your local alignment.
- Repeat steps 2 – 4 twenty times
- Randomly scramble the nucleotides in your query sequence
- Using the Jukes-Cantor scoring scheme, score the randomly scrambled sequence against the target sequence that was found in step C.
- Save this score as an element in an array.
- Using the list of saved scores, compute the mean and standard deviation of your 20 scores. Use a T – distribution to compute these statistics.
- Your e – value will be the probability that the score you obtained for your query sequence was obtained by pure chance.
- (Extra Credit) Read material about Gapped BLAST alignments and implement the algorithm to create such alignments.
COMPUTE JUKES-CANTOR DISTANCES
- Choose an ‘interesting’ sequence and use your BLAST program to find all significantly related sequences (e-value of < .0005 NOTE: If you did not do the Extra Credit IV-D use the NCBI BLAST to find an e-value) in your genome database.
- Compute the Jukes-Cantor Distances between each pair of sequences.
CREATE A PHYLOGENETIC TREE
- Using the distances found in the previous section, create a Phylogenetic Guide Tree for your sequences.
- Draw the tree indicated by this Guide Tree by hand and indicate the distances of the various branches in terms of the Nucleotide differences between the sequences. Assume that a Molecular Clock is in operation.