A Probabilistic Approach to Single-Nucleotide Polymorphism Discovery

The Biology of Genomes Marth

INFORMATICS TOOLS FOR HUMAN GENOME RESEQUENCING

Gabor T. Marth, Michael Stromberg, Chip Stewart, Weichun Huang, Aaron Quinlan

BostonCollege, Chestnut Hill, MA02467

Next-generation sequencing technologies are now capable of producing over a gigabase of useful data per machine per day. This vast throughput led to the sequencing of several notable individual human genomes, and the 1000 Genomes Project is gearing up for en-masse sequencing of thousands of more individuals. Because of rapid technological changes software tools for mammalian-scale resequencing analyses are currently in a flux. As next-generation human resequencing becomes more routinethere is a growing neednot only for efficient software but also for clear algorithmic behaviorand well characterized performance.

We developed a completesuite of software tools for mammalian-scale variation discovery. (1) For read alignmentswe updatedour aligner/assembler program, MOSAIK, to work with paired fragment-end reads including 454 and dibase-encoded SOLiD sequences. The final MOSAIK read alignments are constructed withthe Smith-Waterman algorithm producing gapped alignments necessary for short-INDEL discovery and for aligning reads that contain INDEL errors. This highly sensitive technique alsoallows us to accurately identify reads that map to a unique genome location, and report every alignment position for reads thatmap to multiple regions, a behavior critical for accurate SNP calling.(2) We have completely re-engineered our polymorphism discovery program, POLYBAYES, for heterozygousSNP and short-INDEL detection in diploid, whole-genomeshort-read sequence, and added algorithms for accurate individual genotype calling based on the aligned reads. (3) We developed a new program, SPANNER, fordetecting structural variation eventsfrom paired-end read map positions, and quantifyingcopy number from the depth of read coverage. (4) We customized our assembly viewer program, EAGLEVIEW, for visual data validation. These tools form an integrated informatics pipeline using efficient, standardized read, assembly, and annotation data file formats.

We also developed a benchmarking suite that allows us to test the performance of alignment and variation detection software based on synthetic datasets generated from informed models of sequence variations and technology-specific sequencing error profiles.Benchmarking has allowed us to measure, and subsequently improve, the accuracy and sensitivity of our analysis software, as we report in this presentation.

We describe the application of our tools for SNP and short-INDEL discovery inwhole-genome human short-fragment paired-end Illumina/Solexasequencing reads collected from a normal human genome. We also demonstrate our pipeline for SV discovery in a whole-genome, normal human resequencing dataset consisting of 45 million 2x25-bp paired-end reads from ~2kb fragments (25x physical clone coverage) sequenced with the AB SOLiD system, and in SOLiD paired-end datasets from human disease genomes for which tiling microarray data is available to evaluate our computational structural variation candidates.