A motif-finding experiment (optional for extra credit) due May 12, 5:00pm
Please don't turn in your code, but your answers to these questions instead. If the GSI wants to see your code to verify it does what you say it does, she will ask you.
Get help early if you have any problems with this assignment.
1. The ideas in motif-finding algorithms are based on local improvement: trying to find a substring of length k from one of the n input strings, which most closely matches the motif instances identified in the other n-1 strings, until this procedure terminates. Here we will see how fast this procedure is, and how effective as well.
(a) Write a short program to generate n random DNA sequences of length m, each with a motif instance embedded in it. The way to do this is to start with completely random noise strings. Then define a k-letter motif model (Position Weight Matrices (PWMs) are a standard approach to describe such motifs that were described in class); you'll probably want to pick that randomly as well. Pick a place in each n-letter long string to place an instance of your motif, and replace the original noise string with the motif instance generated from your PWM. Describe how you did this, and if there are any interesting details to your implementation.
(b) Now, implement a simple motif-finding algorithm. Your algorithm will maintain a current guess of the motif, which is a substring from each of the n input strings, and will iteratively try to improve that guess, by kicking out one of the current guessed motif instances (you can pick how to do this), identifying the PWM that best describes the n motifs found.
Describe if there were any challenges in this implementation. Are there details in your program that are important to describe?
(c) Does your algorithm find the right answer to motif finding problems? If n = 10, m = 100, and k = 8, how often does the algorithm find the right motif?
(d) For especially weak embedded motifs, does the algorithm ever find a random motif that is stronger than the embedded one?
(e) What's the empirical runtime of your algorithm--what is it as a function of useful parameters, like the strength of the motif, the number of sequences, or their lengths? When it runs for a very long time, does that tend to mean it is successful at finding the correct motif instances, or not, or is that unrelated?
2. Summarize the motivation and basic strategy in motif detection viaa Phylogenetic Footprinting approach.