George: I have used the ‘Track changes’ function to mark my edits and suggestions. In case you are unfamiliar with this function, it can be found under Tools in the Standard toolbar. My queries are in bold type. Terms that could do with a glossary definition are in small caps.

(we can place the URL of your homepage in the online version of the article, where it can be linked to your affiliation via your autobiographical sketch (I’ll ask you to provide this later))

Personal genomes and the next generation of DNA sequencing technology.

Advances in DNA sequencing technology: methods and goals Personal Genome Project Technology

Jay Shendure^, Rob Mitra*, George M. Church^+

^ HarvardMedicalSchool, 77 Ave Louis Pasteur, Boston, MA02115, USA.

* Dept. of Genetics, Washington University School of Medicine, 4566 Scott Avenue, St. Louis, MO 63110, USA

+ Corresponding author, email: (contact: (we can place the URL of your homepage in the online version of the article, where it can be linked to your affiliation via your autobiographical sketch (I’ll ask you to provide this later))

)

* Dept. of Genetics, Washington University School of Medicine, 4566 Scott Avenue, . St. Louis, MO 63110, USA

Preface

Nearly three decades have passed since the concurrent invention of the Maxam-Gilbert and Sanger methods for DNA sequencing. Automation and miniaturization of Sanger sequencing led to the development of high-throughput sequencing machines, the workhorses that delivered the human genome under-budget and ahead of schedule. A new generation of technologies aspire to push DNA sequencing to a new level, such that genomes of individual humans can be sequenced at a cost compatible with routine health care. Here we review technologies-under-development, and discuss the potential impact of the “Personal Genome Project” on both the research community and society.

Introduction

In the Human Genome Project (HGP), early investments in the development of cost-effective sequencing methods undoubtedly contributed to its resounding success. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing methods, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per finished base to 10 finished bases per dollar [Col03]. The relevance and utility of high-throughput sequencing and sequencing centers in the wake of the HGP has been a subject of recent debate. The number of species for which canonical genomes are sequenced and assembled is rapidly growing, but the list of species that are both interesting and relevant is ultimately finite. Nevertheless, a number of academic and commercial efforts are pushing for new ultra-low-cost sequencing (ULCS) technologies that aim to reduce the cost of DNA sequencing by several orders of magnitude. Why? First, whereas many biomedical and bioagricultural goals are not practical with current cost structures, they are quite justifiable at a costs such as 1 million bases per dollar. Second, achieving sequencing at the low costs achieved by the HGP requires involvement of a large sequencing center, and these are few in number and under heavy demand. Many of the new technologies aim to place high-throughput sequencing within reach of individual labs or core-facilities. Finally, such technologies have the potential to bring genome sequencing out of the laboratory and into the clinic, as they approach costs at which the complete sequencing of “personal genomes” might be affordable.

Here we review novel sequencing technologies that are under development, and discuss their relative feasibility and progress-to-date. The technologies can be generally classified into one of the following: (a) micro-electrophoretic methods, (b) sequencing-by-hybridization, (c) cyclic-array sequencing, and (d) single-molecule sequencing. The field is still relatively small; probably less than 10 academic labs and less than 10 companies are focused in this area. All technologies are at an early stage of development, such that it is difficult to gauge the time-frame before any given method will truly be practical and living up to expectations. Nevertheless, significant progress has been made with relatively limited resources such that substantial further investment is likely justified.

Boosting the capacity research-oriented DNA sequencing was the original motivation for pursuing new technologies. Since 2001, the primary justification for these efforts has gradually shifted to the notion that the technology could become so affordable that sequencing the full genomes of individual patients would be justified from a health-care perspective [Jon01][Pra02] [Pen02][Sal03]. Here we use the phrase “Personal Genome Project” (PGP) to describe this goal. What are the potential health-care benefits? At what cost-threshold does the PGP become justified? With respect to issues such as consent, confidentiality, discrimination, and patient psychology, what are the risks?

Preface: please provide a short and snappy standalone preface – no more than 100 words.

Introduction

The introduction could be expanded a bit. See comments within.

The success of the Human Genome Project (HGP) illustrates how even modest investments in developing cost-effective sequencing methods can have payoffs for the whole biomedical community. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing technologies, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per base to 10 finished bases per dollar [Col03]. However, for a wide range of biomedical and bioagricultural goals, a strong need is evolving evolving to accelerate the pace of these advances. Please pause to explain what these biomedical and bioagricultural forces are. This has been accompanied by calls for to reduce the cost of sequencing a human genome at a cost ofto less than $5000 per human genome, a price threshold below which individual patients could conceivably afford "personal sequences" [Jon01][Pra02] [Pen02][Sal03]. Please explain succinctly why this goal might be desirable – ie do we need it or is there simply a demand for it? This will impact health care, both directly by providing diagnostic and prognostic markers for the clinical setting, and indirectly by accelerating the pace of basic and clinical biomedical research. To meet this challenge for cheaper and more accurate high-throughput sequencing,, scientists and engineers will have to further reduce DNA sequencing costs by a factor of 1,000. The PGP has to be introduced in more detail here – when was it first proposed, who are its proponents, what is the rationale behind the technological advances, how feasible is it and what is the timescale. Several technologies are being developed that could make the Personal Genome Project (PGP) a reality. This will impact health care, both directly by providing diagnostic and prognostic markers for the clinical setting, and indirectly by accelerating the pace of basic and clinical biomedical research. Here we review potential PGP sequencing technologies and discuss the impact they could have on the biomedical community – could you expand this sentence a little – eg to say that the impact is not just technical or biomedical but also ethical..

We are increasingly seeing the potential fruits of genomics in the clinical setting, but clearly there is a long way to go. Sequencing of large numbers of human genomes has the potential to itself be a means of expediting this process.

The idea that the cost of sequencing a full human genome (effectively, the sum possibility of all genetic tests) could approach the current costs-to-patients of many single genetic tests is certainly appealing (i.e. more for your money). Here we consider the For example, what are the potential benefits of sequencing the genome of a perfect healthy baby? From multiple perspectives (anonymity, psychological, insurance, etc.) what are the risks?

A Brief History of Sequencing (This title is a bit grand! How about: ‘Traditional sequencing’ - ?)

In 1977 two research groups — one led by F. Sanger and the other by W. Gilbert — that were familiar with peptide and RNA sequencing methods, made a technical leap forward for sequencing in general and DNA sequencing in particular, based ofby harnessing the amazing separation power of gel electrophoresis [Gil81,San88]. Although dideoxy and Maxam–Gilbert sequencing was costly relative to today (can you give numbers, for comparison?), many scientists adopted and improved these techniques. By 1985, enough progress (qualify this progress please) had been made that a small group of scientists set themselves the audacious goal of sequencing the entire human genome between 1990 and 2005. [Coo89][Col03]. This declaration met with considerable resistance from the wider community. At the time, many felt the cost of DNA sequencing was far too high and the sequencing community too fragmented to complete such a vast undertaking. However, the critics did not account for the rapid pace of technical and organizational innovation – what, specifically, was achieved? Ie what allowed the draft human sequence to be produced so quickly and cheaply?. In 2000, five years ahead of schedule, a useful draft sequence of the human genome arrived was published (ref) at a cost of $300 million dollars for the bulk of the sequence, which was significantly under budget. Possibly even more significant was the appearance of a culture of technology and "open" data and software [Col03]. This sentence is rather terse but makes an important point (and that is picked up on later in the piece) and so should be expanded.

Why continue sequencing?How many nucleotides are out there?

I have suggested a way of linking this section to the next – by all means change the link, provided we have one of some sort.

In addition to Homo sapiens, the genomes of over 160 organisms have been fully sequenced, plus as well as parts of the genomes offrom over 100,000 taxonomic species. There are currently 20 billion bases in international databases [Gen03], reflecting the complexity of life on earth. But is sequencing absolutely essential to understand life? Or are there other means by which we can learn about the biosphere and our own species? We argue that although sequencing the biosphere is unnecessary and impractical, whole-genome sequencing — when the technology is made faster and cheaper — will improve existing biological and biomedical investigations and help to develop new genomic and technological studies (see also Box 1).

Impact of sequencing on existing and new studies. By comparing the genomic sequences between organisms, we are learning a great deal about our own molecular program, as well as that of the other organisms in the biosphere [Ure03,Car03]. This approach will become more powerful as we sequence more genomes [Tho03]. As we increasingly perceive the significance of inter-individual and tissue specific variation in sequences and their interactions, it is both humbling and amusing to compare the current 20 billion bases in international databases [Gen03] to the complexity of sequences on earth. A global biomass of over 2e18 g contains an estimated total biopolymer sequence of 1e38 residues. While sequencing the biosphere is unnecessary and impractical, we clearly have a great deal to learn.

Why Continue Sequencing?

An ultra low cost sequencing technology (ULCS) would greatly impact the way biologists work. Mutagenic or population genetic screens in model and non-model organisms would be more powerful if one could sequence genomes for responsible mutations in crosses.

In addition to making well-established techniques more powerful, improved sequencing power could open up new research fields. For example, one could directly query the antibody diversity that is generated in response to disease. ULCS would benefit synthetic biology and genome engineering, ranging from selecting new enzymes to building??? new chromosomes, both of which are powerful tools for perturbing or designing complex biological systems. Whether the goal is the accurate synthesis of DNA/RNA?? or combinatorial scrambling (please paraphrase), these large syntheses will have many unknowns (meaning?), and ULCS would help determine or ensure engineered DNA design specifications. Subject to the forces of mutation and selection, and it will be important to monitor base changes as they occur (meaning?). Even further beyond normal genomes than the above synthetics, lies DNA computing [Bra02] and DNA as ultracompact memory (nm3/bit rather than conventional 1e11 nm3/bit storage media like DVDs) (ref?). Please say a bit more about what DNA computing is and the use of DNA in ultracompact memory - readers might not have heard of these applications.

The introduction mentions bioagricultural goals – could these be covered here?

The personal genome project and human health. Perhaps the most compelling reason to pursue the goal of achieving low cost, high-throughput sequencing technology PGP is the impact it could have on human health. Inexpensive (defined as ?) sequencing would give us access not only to our diploid inherited genetic make-up ance but to many aspects of our daily changing internal environment including how our body responds to pathogens , and allergens and,the genetic changes associated with cancer and immune cells. It is occasionally claimed that all that we can afford (and hence all that we need) is information on "common" single nucleotide polymorphisms, SNPs, or haplotype the arrangements of these (haplotypes) [Gib03] in order to understand so-called "multifactorial" or "complex" diseases [Hol00]. In a non-trivial sense all diseases are "complex". As we get better at genotyping and phenotyping we simply get better at finding the factors contributing to ever lower penetrance and expressivity. A focus on "common" alleles will probably be successful for alleles maintained in human populations by heterozygote advantage (such as the textbook relationship between sickle cell anaemia and/ malaria relations) but would miss most of the genetic diseases documented so far [Vit03]. In any case, even for diseases that are amenable to the haplotype mapping approach, ultra high throughput sequencing would allow geneticists to move more quickly go from a haplotype that is linked to the disease to the causative SNP. Additionally, many candidate loci could be investigated by sequencing them across large populations. For diseases caused by multiple but rare mutations, it is absolutely critical to directly sequence the causative SNP, and a whole genome sequencing technology would provide geneticists with the ability to do this. What advances in terms of speed, cost & accuracy would make it possible to achieve this goal?

Another medically important area that ultra high throughput sequenceULCS could make a large impact is cancer biology. Cancer is fundamentally a disease of the genome: the gradual accumulation of somatic mutations during normal cell division is thought to give rise to malignant cells. Epidermiology suggests that mutations in three to seven genes are necessary to cause cancer [Mil80]. It is now becoming clear that different sets of genes are mutated in different cancers, andthe order in which these genes acquire mutation can vary. In addition, genomic instability may play a role in cancer progression[Raj03]. Therefore, there are many different paths to cancer. Although a large number of distinct genomic mutations and aberrations have been found in cancers, patterns are starting to emerge. For example, there are thought to be six essential features that a cancerous cell must acquire, and by looking for disrupted pathways rather than just individual genes, a better understanding of tumorigenesis has been acquired[Han00]. The ability to sequence and compare complete genomes from a large number ofmany normal, neoplastic, and malignant cells will greatly increase our understanding of tumorigenesisThe comprehensive detection of all somatic mutations (base changes, genomic rearrangements, and epigenetic variation) in the genome should capture most, if not all, of the factors involved in cancer initiation and progression, and allow us to. We could exhaustively catalogue the molecular pathways and checkpoints that are inactivated as a tumor develops. The comprehensive detection of all somatic mutation (base changes, genomic rearrangements, and epigenetic variation) in the genome would capture most,if not all, of the causative factors of cancer.

An ultra low cost sequencing technology (ULCS) would greatly impact the way biologists work. Mutagenic or population genetic screens in model organisms (yeast, nematode, fly, zebrafish, mice) and non-model organisms would be more powerful if one could sequence genomes in crosses for responsible mutations. In addition to making well established techniques more powerful, new possibilities would be made possible. For example, one could directly query the antibody diversity generated in response to disease. ULCS would benefit synthetic biology and genome engineering ranging from selecting new enzymes to new chromosomes, powerful tools to perturb and design complex biological systems. Whether the goal is accurate synthesis or combinatorial scrambling, these large syntheses will have many unknowns, and ULCS would help determine or ensure engineered DNA design specifications. Subject to the forces of mutation and selection, and it will be important to monitor base changes as they occur. Even further beyond normal genomes than the above synthetics, lies DNA computing [Bra02] and DNA as ultracompact memory (nm3/bit rather than conventional 1e11 nm3/bit storage media like DVDs).