Pattern Matching with DNA Computers (Shah, Niemier)
Pattern Matching with DNA Computers – DRAFT
Shetu N. Shah, Michael T. Niemier
College of Computing, Georgia Institute of Technology, Atlanta, Georgia USA
14 September 2004
Introduction
We explore implementing a pattern matching application using DNA as the computational medium. Given a string of input and a specified pattern, the application should return the location of all matches of the pattern in the input. DNA will not simply serve as the substrate; it will be used to represent the inputs and to drive the logic for the application. While this implementation does not utilize a systolic approach, it shall exploit the inherent parallelism in DNA.
The inputs to the application will be two single-stranded oligonucleotides (short strings of nucleotides): one oligo will be the string we want to search, the other oligo—which is shorter than the search string—will be the DNA string of the pattern we want to find. The underlying procedure in our approach is the Sanger sequencing procedure (Sanger).
Background
Since 1994 when Leonard Adleman first published his work on solving an instance of the Hamiltonian Path Problem using custom-synthesized strands of DNA, many researchers have been exploring ways to hone the computational power of DNA (Adleman).
Briefly DNA (deoxyribonucleic acid) is comprised of nucleotides, or bases, strung together to form a chain. These four bases are adenine, thymine, guanine, and cytosine (abbreviated as A, T, G, and C). These bases make strong but selective bonds: adenine will only pair with thymine, and guanine will only pair with cytosine. Furthermore, DNA computing is attractive because of its potential for massive parallelism. For example, if a probe strand of DNA is introduced into a test tube containing other strands of DNA, the probe will “test” itself with each of the other stands for a complement in a parallel fashion. That is, the probe strand will be chemically attracted to its complement strand and will not have to be tested with each strand individually. The proposed approach will have a O(1) versus a O(n).
Let us define the following:
The DNA alphabet ΣDNA = {A, T, G, C}.
The language LDNA = {s | s є ΣDNA+}.
The search string w є LDNA.
The pattern wpattern є LDNA.
Tools
DNA computing has borrowed many of its techniques and procedures from the fields of biochemistry and molecular biology. This section overviews some of these tools and is discussed in more detail in (Maley).
Anneal. Also known as “hybridization”, annealing is when two complementary, single-stranded DNA join to form a double strand of DNA when suspended in solution. This occurs through the hydrogen bonds that arise when complementary base pairs are brought into proximity (see Figure X).
Figure X: Annealing is when two complementary single strands of DNA join to form one double strand of DNA.
Melt. The temperature of a solution is raised beyond the point where the longest double strands of DNA are stable, and the weak hydrogen bonds are broken. The double-stranded DNA will separate into single-stranded DNA (see Figure X).
Figure X: Melting is when the hydrogen bonds in a double strand of DNA are broken, usually by heating the DNA, to result in two single strands of DNA.
Ligate. This concatenation of DNA strands is most efficiently performed by allowing single strands to anneal together and then using ligase to seal the covalent bonds between the adjacent fragments.
Figure X: To connect strand y to strand x, a “glue strand” is used to bring the two strands into proximity via annealing. Then a ligase enzyme is added to fuse the two strands together.
Polymerase extension. When a short strand is annealed to a longer strand, polymerase enzymes can attach to the 3’ end of the shorter strand to “extend” the 3’ end, by 1 base, in order to allow the building of a complementary sequence to the longer strand (see Figure X).
Figure X: Polymerase extension is when a polymerase enzyme attaches to the 3’ end of a short primer sequence and allows the complement of the longer sequence to be constructed.
Amplify. Often an experiment is equipped only with a single strand of DNA, a very small and fragile sample to have at hand. The Polymerase Chain Reaction (PCR) is a process by which a single strand of DNA is replicated to create and exponentially large sample. PCR if often used to amply DNA so that it can be seen by the naked eye through separation techniques, such as gel electrophoresis (see Figure X).
Figure X: PCR starts when a double-stranded DNA template is melted into two single strands. Primers then anneal to both strands. Polymerase enzymes extend the 3’ end of the primers to create double-stranded replicas of the template. This process is then repeated to exponentially increase the number of templates; it can be repeated as long as there are enough primers to catalyze the reaction.
Chain-Termination Sequencing Procedure
The chain-termination sequencing procedure, developed by Frederick Sanger et al. in 1977, determines nucleotide sequences by generating populations of DNA fragments that all have one end in common and terminate at each possible position. The procedure uses in vitro DNA synthesis in the presence of specific chain-terminators. Specifically, 2’,3’-dideoxyribonucleoside 5’-triphosphate (ddXTP)—where the nucleoside (X) can take the form of either A, T, G, or C—is the most commonly used chain-terminator (Snustad).
The normal DNA precursors are 2’-deoxyribonucleoside 5’-triphosphate (dXTP), which has a hydroxyl group (OH) at the 3’ position. This hydroxyl group is an absolute requirement for chain elongation with DNA polymerase. The ddXTPs lack the 3’-OH; thus, chain elongation cannot continue, and the chain is said to terminate (see Figure X).
Figure X: Comparison of the structures of the normal DNA precursor 2’-deoxyribonucleoside triphosphate and the chain terminator 2’,3’-deoxyribonucleoside triphosphate used in DNA sequencing reactions (Snustad).
Using ddATP, ddTTP, ddCTP, and ddGTP as the chain-terminators in a DNA synthesis reaction will result in a population of all possible substrings, which start with the first nucleotide at the 3’ end of the original strand, of the complement to the original strand (see Figure X). To obtain a suitably high probability that the population will contain fragments terminating at each respective base, the ratio of dXTP:ddXTP in a given reaction is approximately 100:1 (Sanger).
3G C A T G A T C G G5
5C G T A C T A G C Cdd
5C G T A C T A G Cdd
5C G T A C T A Gdd
5C G T A C T Add
5C G T A C Tdd
5C G T A Cdd
5C G T Add
5C G Tdd
5C Gdd
5Cdd
+ + + + + + + + + + +
Figure X: As the template strand (blue) is replicated using the chain-termination procedure, all possible substrings (red) of the complement to the template that start with the first nucleotide at the 3’ end of the template are produced with almost certain probability. When these substrings are separated using gel electrophoresis, the shorter strands travel the farthest through the gel toward the positive charge.
After the population has been generated, the fragments are denatured (melted) from their template strands and separated by gel electrophoresis (see Figure X).
Gel electrophoresis is a method of separating DNA molecules by size. An agarose or acrylamide gel is used as the medium for this procedure; agarose gels are better sieves for larger molecules (larger than a few hundred nucleotides) while acrylamide gels yield better resolutions for separating smaller DNA molecules (Snustad). The DNA populations are loaded into wells at one end of the gel. Since DNA molecules hold a negative electric charge, a positive charge is applied to the opposite end of the gel to attract the DNA. The structure of the gel is similar to a sieve. As the DNA molecules travel toward the positive charge, the gel makes it more difficult for larger molecules to pass and easier for smaller molecules to pass. Therefore, the closer the DNA molecule is to the positive end of the gel, the shorter the oligonucleotide (Khalsa).
The gel-separated fragments are then transferred onto a membrane via Southern blotting (see Figure X), a technique developed by Edward Southern (Sanger).
Figure X: Southern blot procedure used to transfer DNA strands, separated by gel electrophoresis, onto nylon membranes (Snustad).
A nitrocellulose membrane or other positively charged nylon membrane is typically used. Transfer is usually done by capillary action, although a vacuum blot apparatus may be used instead. The vacuum blot apparatus works similarly to capillary action except the vacuum sucks more of the transfer solution (usually SSC, a solution containing sodium chloride and sodium citrate) through the gel and the membrane, so the transfer process only takes about an hour instead of several hours with capillary action. Once the DNA fragments are transferred onto the membrane, they should be dried with ultraviolet light. The UV light will create covalent bonds between the DNA and the membrane (Khalsa). The membrane is now ready for probing with a radioactively labeled DNA strand (hybridization).
Matching the Pattern
The probe used is wpattern, which represents the pattern we wish to match, labeled with radioactive 32P. The probe will anneal to the immobilized DNA on the membrane due to the binding of complementary strands. The nonhybridized probe is then washed off the membrane, and the membrane is exposed to X-ray film to detect any presence of radioactivity from the probe.
Interpreting the Results
When the X-ray film is developed, the dark bands will show the positions of the DNA sequences that have hybridized with the probe. The film should be read from bottom to top (i.e., from the shortest fragment to the longest). If no dark bands are present, the pattern was not found in the string.
If the pattern is found, the position of the first band will reveal the position of the pattern because this will be the shortest substring that contains the complete pattern. Note, however, that all substrings longer than this will also detect at least this one instance of the pattern because the probe also will have annealed to all subsequent fragments. By measuring the light intensities with a spectrophotometer, one can determine whether multiple instances of the pattern were found on the same strand. Because each probe emits radiation independent of other probes, it follows that the light intensity of a fragment with exactly two matches should be twice that of a fragment with only one match. Empirical data should show that a graph of light intensities will be a step function.
<figure of expected results of spectroscopy>
An Improvement
The aforementioned analysis, while feasible, can be rather tedious and complicated. A spectrophotometer is expensive and may not be readily available. If we could somehow guarantee that the probe would attach only to the 3’ end of the strand, then each band on the X-ray film would indicate a distinct instance of the pattern. Even if multiple instances of the patterns overlap in the string we want to search, each instance should be easily identifiable by the naked eye. This modification would provide more clarity and would eliminate the need for a spectrophotometer or other special tools.
To accomplish this, we must first dedicate one of the nucleotides a “terminating nucleotide” to mark the 3’ of the strand. If the probe ends in the complement of this terminating nucleotide, then the probe will only anneal to the 3’ end of the strand, as desired. However, we must reserve T as our terminating nucleotide and reduce our set of nucleotides to A, G, and C. We select these nucleotides because a higher ratio of Gs and Cs makes for more stable stands (reference needed). Note that the roles of A and T could be interchanged and still produce the same results. Therefore, the chain-terminators Add, Gdd, and Cdd are now “upgraded” to ATdd, GTdd, and CTdd, respectively. The results of PCR with Sanger’s chain-terminating procedure are all substrands of the original input strand, which start at the 5’ end of the original strand. Each of these substrands will have an extra Tdd at the 3’ end (See Figure X).
3G C T G T C G G5
5C G A C A G C CTdd
5C G A C A G CTdd
5C G A C A GTdd
5C G A C ATdd
5C G A CTdd
5C G ATdd
5C GTdd
5CTdd
+ + + + + + + + + +
Figure X: Results of gel electrophoresis with modified chain-terminators. Each substring now has a Tdd marking the 5’ end of the strand. Note, however, that ∑ = {T, G, C} to prevent Tdd from annealing to the strand.
We now create a probe wpattern concatenated with A, the complement of our terminating nucleotide T. Since each of the substrands have exactly one T from the chain-terminator, the A from the probe will complement this T and cause the probe to anneal only at the 3’ ends of the substrands.
If these probes are radioactively labeled, we can detect all instances of our pattern by exposing the DNA to X-ray or ultraviolet radiation in the same manner as before. If no bands react to the radiation, then no instances of the pattern were found in the input strand. If any bands do react to the radiation, the length of the respective substrand reveals the position of the pattern on the strand.
Size Estimates
The current length of oligonucleotides that can be chemically synthesized without much error is 70-80 nucleotides long (Ogihara). While it is possible to synthesize oligos 150 nucleotides long and longer, there is an extremely low yield of long oligos due to DNA coupling, the probability that the nucleotides will correctly attach. Additional losses result from purification processes (HPLC, PAGE, or other), which are highly recommended with long oligos (“Metabion”).
The maximum length that can be synthesized routinely and economically is approximately 80 nucleotides long (“Metabion”). This may be far too short to represent a search space of practical length. However, the past ten years have shown great advances in DNA synthesis, and additional progress in this area may increase the length of usable, synthesizable oligos.