Fluorescent in situ Sequencing on Polymerase Colonies
Robi D. Mitra1, Jay Shendure1, Jerzy Olejnik2, and George M. Church1*
1Lipper Center for Computational Genetics and Department of Genetics, Harvard Medical School, 200 Longwood Ave., Boston, MA 02115.
2Ambergen
*Corresponding author:
Abstract
Integration of DNA isolation, amplification, and sequencing can be achieved by the use of polymerase colonies and fluorescence labeledcyclic fluorescent dNTP incorporation. Four innovations bring us closer to sequencing genomes cost-effectively. A polymerase trapping technique eliminates loss of polymerase during the cycles and enables efficient nucleotide extension by DNA polymerase in an acrylamide matrix. Two different types of reversibly dye-labeled nucleotides can be incorporated by DNA polymerase and the dye removed by thiols or light. We have used these nucleotides to sequence multiple femtoliter volume polonies in parallel. By limiting free primer in the polony amplification reactions, we found that high densities of polonies can be achieved with minimal overlap between adjacent polonies (polony exclusion principle). Finally, we have developed software for automated image alignment and sequence calling.
Introduction
Kurzweil's version of Moore’s law observes that the number of calculations per second per dollar has been doubling every 2.3 years since 1900 1. Analogous exponential improvements have occurred in macromolecular sequencing since the 1940s 2. Can this trend continue? Although over one hundred genomes have been sequenced, we have barely scratched the surface of the valuable range of sequence space. The ability to routinely and cheaply sequence cell-lineage specific transcriptomes, the diversity of antibody responses, or microbial biomes would provide a treasure trove of information for biomedical research and diagnostics. Most exciting is the prospect that the cost of sequencing could be reduced to the point where the full determination of one’s genome would become part of routine health care. Of course, for this to be feasible, the cost of DNA sequencing must be reduced by three to five orders of magnitude, as re-sequencing just a single human genome is estimated to cost more than ten million dollars 3. Achieving this goal would have an almost immediate impact onin our understanding of the genetics of common and rare diseases, as well as comparative genomics, mRNA expression analysis, population genetics, and functional genomics.
Several groups are working to develop disruptive sequencing technologies to enable cost-effective whole-genome resequencing. Many of the approaches involve sequencing a single DNA molecule, either by repeated cycles of single base extension 4, real time monitoring of the incorporation of dye-labeled nucleotides by DNA polymerase 5-7, or monitoring conductivity changes in a nanopore as a single stranded DNA molecule is passed through it8,9. Other approaches first amplify the DNA and then sequence the product by pyrosequencing 10,11, or by hybridizing DNA to a high density oligonucleotide microarray 12,13. These approaches all have the potential to greatly speed the rate of DNA sequencing; however, none are mature technologies, and many have yet to demonstrate proof of principle.
We are adapting polony technology to enable the cost-effective sequencing of the genomes and transcriptomes 14,15. This technology enables the amplification of millions of individual DNA molecules via the polymerase chain reaction (PCR) in an acrylamide gel attached to the surface of a glass microscope slide. Because the acrylamide restricts the diffusion of the PCR amplification products, each single molecule included in the reaction produces a spherical colony of DNA, which we have termed polonies (for polymerase colonies). A similar technology, the molecular colony technique, grows nucleic acid colonies in acrylamide using Q beta replicase or PCR 16,17. Here we describe our progress in developing a protocol for sequencing polonies in parallel by performing repeated cycles of primer extension with fluorescent deoxynucleotides. To enable this protocol, we developed and characterized two types of reversibly dye-labeled deoxynucleotide analogues. We found these nucleotides were correctly incorporated by DNA polymerase, and that the fluorophore could be quickly removed. We also describe a technique, polymerase trapping, that greatly increases the efficiency of extension by DNA polymerase in acrylamide gels. Finally, we demonstrate short (8 base pair) sequencing reads on multiple polonies amplified from single DNA molecules, and discuss the remaining hurdles to fully enabling this technology. Because tens of millions of femtoliter-scale polonies 14 can be amplified on a single microscope slide, this technology has the potential to enable extremelycould enable very low cost DNA sequencing.
Results
Overview of Polony Sequencing
Polony sequencing is performed in three steps:
Step 1:Make library of linear DNA molecules with universal priming sites (Fig. 1A).
Each molecule in the library will contains a variable region flanked by two constant regions. The constant regions contain primer-binding sites to allow universal amplification by PCR. This type of library was first created to perform SELEX experiments 18.
Step 2:Amplify polymerase colonies (polonies) in an acrylamide gel (Fig. 1A).
A thin polyacrylamide gel is poured on a glass microscope slide. The linear library made in step 1 is amplified in this gel by performing PCR. At the end of the reaction, each single template molecule gives rise to a polony. As many as 15 million polonies can be amplified on a single slide. An acrydite modification 19,20 is included at the 5' end of one of the primers, so that one strand of the amplified DNA is covalently attached to the polyacrylamide matrix, allowing further enzymatic manipulations to be performed. The details of polony amplification have been previously described 14.
Step 3:Sequence polonies by fluorescent in situ sequential quantitation (FISSEQ). (Fig. 1b 2).
The immobilized DNA is denatured, the unattached strand is washed away, and a universal primer is hybridized to the template. A primer extension reaction is performed 21,22. The gel is then scanned using a scanning fluorescence microscope. If a polony has incorporated the added base, it will fluoresce, revealing the identity of the template base immediately 3' of the annealed primer. The fluorescence is then removed by cleaving the linker between the fluorophore and the nucleotide, and washing away the fluorophore. The cycle is then repeated by adding a different dye labeled base, washing away unincorporated nucleotide and scanning the gel. In this way, the sequence of every polony on the gel can be determined in parallel.
The Efficiency of Nucleotide Extension
Because FISSEQ requires a number of sequential base extensions, the incorporation of the correct base must be highly efficient to enable accurate sequencing. For example, if only 85% of the primer:template molecules in a polony are extended each time a correct base is added, then after 6 extensions, only (0.85)6, or 38%, of the primer:template molecules will have correctly incorporated the added nucleotides. The remaining 62% of the molecules will be "out of phase" because they did not incorporate the correct base at an earlier cycle. To estimate the efficiency of nucleotide incorporation by DNA polymerase, polyacrylamide spots containing immobilized oligonucleotide templates were polymerized on three glass slides, and a sequencing primer was annealed (Fig. 2a). One template required dTTP addition for correct incorporation (or its analogue dUTP). The other, a negative control, required the addition of dCTP for correct incorporation. Slides were incubated with unlabeled dTTP and DNA polymerase for varying amounts of time, ranging from fifteen seconds to six minutes. A control slide was not extended with dTTP. Cy5 labeled dUTP and DNA polymerase were then added to all slides, and the amount of fluorescent signal on each slide was compared to estimate how efficiently the unlabeled dTTP was incorporated (Fig. 2a). If the extensions went to completion, we would expect to see no fluorescent signal on slides incubated with unlabeled dTTP before the fluorescent extension. However, even after 6 minutes of incubation, there was still significant fluorescent signal incorporated (Fig. 2a and 2b), indicating that the extension with dTTP did not go to completion. By quantifying the signal, we estimated the efficiency of the 6 minute dTTP extension to be approximately 85%. As discussed above, this is too low a value to sequence more than just a few bases. By plotting the measured signal versus the time of dTTP incubation (Fig. 2b), we observe that the fraction of primer:template molecules extended can be approximated by the function, F = 1 - (0.66e-t/12.8 sec + 0.34e-t/500 sec)(remind me to ask you to explain this equation to me at some point). From this equation, we see that even significantly longer extension times will not increase extension efficiency. We repeated the experiment using thermostable polymerase and extending at higher temperatures, but this did not increase extension efficiency (data not shown). To solve this problem, we developed a technique called "polymerase trapping", in which DNA polymerase is allowed to bind to the primer:template molecule, acrylamide monomer is added to the reaction, and polymerization is initiated, trapping the polymerase in a complex with the primer:template molecule. The polyacrylamide matrix prevents the polymerase from diffusing away from the primer:template, so that every primer that is extended during the first cycle of extension will continue to be extended at later cycles. We repeated the experiment described above using polymerase trapping and estimated that 99.8% of the molecules that have a polymerase molecule trapped on them are correctly extended (Fig. 2c). Trapping the polymerase within a polyacrylamide matrix also has the advantage that the enzyme does not need to be replenished every time a nucleotide addition is performed. We determined that trapped polymerase will remain bound to the primer:template complex for over 72 hours without diffusing away (supplementary information).
As a second method to estimate the extension efficiency when polymerase trapping is employed, we amplified a 87 bp synthetic oligonucleotide in a polony reaction. The immobilized polonies were denatured, a sequencing primer was annealed, and DNA polymerase was "trapped" on the primer:template complex. Serial, single nucleotide additions were performed with unlabeled nucleotides for a number of cycles until the next base required to extend the growing strand was either a C or a T. A mixture of FITC labeled dCTP (green in Fig. 3) and Cy3 labeled dUTP (red in Fig. 3) was then added to the slide to determine if the correct base was incorporated. If the primer:template molecules in each polony had become dephased during the course of the unlabeled extensions, one would expect to see high levels of the incorrect base incorporated and therefore signal in both the red and green channels; however, even after 26 single base additions, which caused the DNA polymerase to extend the sequencing primer by 34 bases, only the correct channel displayed significant signal (slide 4 in Fig. 3).
Characterization of fluorescent nucleotides with a sulfhydryl-cleavable crosslinker
We synthesized four fluorescent deoxynucleotide analogues containing a dithiol linker between the fluorophore and the nucleotide (Fig. 4a and supplementary information). The design of these analogues was inspired by the published structure of the chemically cleaveable biotinylated nucleotide analog, Bio-12-SS-dUTP 23. To determine if these fluorescent analogues would be recognized by DNA polymerase, we took one of them, Cy5-SS-dCTP, and performed a nucleotide extension reaction on amplified polonies. We used 5 different templates for growing polonies, with only one template requiring a "C" to correctly extend the sequencing primer. The scanned gel is shown in figure 4b3b. In this image, the green (532nm) and red (635nm) channels are merged. The sequencing primer was labeled with a cy3 dye, so that all polonies on the slide can be seen in the green channel. (Fig. 4b). A fraction of the polonies display fluorescent signal in the red channel (yellow polonies), indicating a successful extension. We verified that only the correct polonies incorporated the fluorescent base by performing additional single base extensions with dye-labeled nucleotides, and we were able to match every polony on the gel with one of the five template sequences. This experiment demonstrated that the cy5-SS-dCTP nucleotide analogue is recognized by DNA polymerase. To establish that we could remove the incorporated fluorescent signal, we incubated the slide in a buffer containing beta mercapto-ethanol and monitored the fluorescence decay (Fig. 4c and d). The fluorescent signal decayed exponentially, with a time constant of seven seconds. To determine the incorporation efficiencies of these fluorescent analogues relative to the corresponding natural deoxynucleotide, we performed extension reactions using varying ratios of each and quantified the incorporated fluorescence (supplementary information). In addition to the Cy5-SS-dNTP molecules just described, we also synthesized a dye-labeled nucleotide with a photocleaveable crosslinker 24, Cy5-PC-dUTP. This molecule was also recognized by DNA polymerase and the dye could be removed by exposure to light. Its structure and characterization can be found in the supplementary materials.
Sequencing Polymerase Colonies
As a proof of principle for polony sequencing, we used five synthetic oligonucleotides (templates T1-T5, see methods) as templates for polony amplification on two slides. Approximately 40 polonies were amplified on each slide. We denatured the polonies, hybridized a cy3-labeled sequencing primer, and trapped polymerase onto the immobilized DNA. We then performed serial base additions with the cy5-SS-dNTPs according to the protocol outlined in Fig. 1b (no unlabeled nucleotides were used for this reaction). Slide images from the first four cycles of base addition are shown in Fig. 5a. Here, the green (cy3 sequencing primer) and red (cy5 labeled base) channels are merged. Polonies that have not incorporated the added base are green, and polonies that have incorporated the added base are yellow. We performed 10 extension cycles on slide 1 and 14 extension cycles on slide 2 before the polyacrylamide gel detached from the surface of the slide.
To computationally identify polonies and call a sequence for each polony, we wrote a software package, PolCall. This package aligns the images collected from each cycle, and determines a sequence for every pixel. Pixels are then combined into objects based on their signature sequence, and two filters are applied to remove objects due to background noise and objects that represent the overlap of two polonies. This software was used to analyze the images from the polony sequencing cycles. Only a portion of each slide was analyzed, as some regions of the gel were torn during the sequencing process. This analysis was performed for slide 1 (Fig. 5b, Table 1) and slide 2 (Fig. 5c, Table 2).
In total, 24 polonies were analyzed, and 5 unique sequences were identified. The called sequences are labeled as S1-S5 in table 1. Sequences S1 - S3 exactly match the template sequences T1-T3, indicating these molecules were accurately amplified and sequenced. Sequence S4 has the same base composition as the template sequence T4, however we were unable to correctly identify repeated base pairs. This was expected since 100% fluorescent base was used and fluorophore quenching affects signal linearity. S5 is similar to the expected sequence T5, but there are two bases that were called present in S5 that are not present in the template sequence (Table 2), a G in the second position and a T in seventh position. These errors were systematic, as there were a total of five polonies that all called asthe sequence S5. The longest accurate read achieved in this experiment was eight base pairs (sequences S2 and S3 on slide 2).
Polony Exclusion at High Densities
In the polony sequencing experiment described above, pixels corresponding to regions of the slide where two polonies overlapped were automatically removed from the analysis by the sequencing software. This is necessary to maximize accuracy. However, we would also like to maximize the number of useful pixels per scan because we project the cost of scanning to be the most costly component of this technology. We hypothesized that by decreasing the concentration of free primer in the polony amplification reaction, we could reduce the amount of overlap between adjacent polonies, because as the polonies grew near one another, the free primer would be consumed and further growth inhibited. To test this, we amplified two polony templates using a reduced concentration of free primer. Primer extension was performed using cy3 and cy5 labeled nucleotides to distinguish the two species. Figure 6 demonstrates that polony overlap can be reduced using this strategy as numerous adjacent polonies display a deformed shape indicating growth inhibition due to the presence of a neighbor.
Discussion
There are three biochemical sources of error that will likely determine the read length attainable using the FISSEQ protocol. These sources of error are mispriming, misincorporation, and incomplete extension. Mispriming occurs when the sequencing primer anneals non-specifically to the template molecule and is extended upon addition of nucleotide. Mispriming can also occur when the 3' end of the template molecule loops onto itself and forms a hairpin, which is then extended upon base addition. Misincorporation occurs when a nucleotide is incorporated opposite a non-complementary template base. Once this happens, subsequent incorporation will be less efficient. Incomplete extension occurs when the incorporation reaction does not go to completion so that only a fraction of the primer:template molecules in a polony have correctly incorporated the added nucleotide. It is important to minimize misincorporation and incomplete extension as these types of error accumulate exponentially with the number of extensions.