Supplemental Methods s3

Supplemental Methods:

Variant Calling: All scripts used in this pipeline are available at: https://bitbucket.org/soccin/hajava

They can be downloaded by cloning this repository. N.B., the scripts contain hard coded paths to our local computer system and code to run in our parallel environment (SGE) so they will not be transparently runnable. However there is an INSTAL doc provides installation instructions. Any script below described as custom is available in this repository. We however cannot redistribute code that belongs to others. In particular, the GATK and MUTECT JARs used may not be easily available.

The pipeline was modeled after the GATK best practice document (~2012), which unfortunately is no longer available. So we will attempt to reproduce the key portions of the pipeline here. The raw sequencing reads (FASTQ) files were preprocessed to remove the Illumina sequencing adapters. We used the following maximal common prefix sequence AGATCGGAAGAGC that is the common prefix for all barcodes and for both the P5 and P3 adapter sequence and also preserved over the various library prep kits used in this study. Note we discovered a strand specific artifact when using an adapter sequence that is too small that will suppress mutations that change the reference sequence to match the small adapter. This effect was seen with a hexamer but was replaced with the longer one used here. To do the clipping the fastx_clipper (ver 0.0.13) from the FASTX-Toolkit was used[1] called as follows:

fastx_clipper -M $MIN_ADAPTER_LEN -Q33 -v -n -a $ADAPTER

Reads shorter then 10bp after clipping were discarded. We then re-paired the two reads files to drop any read pair whose mate was discarded in clipping using a custom script (matchPE.py).

The clip reads were then mapped to the appropriate genome using BWA ALN (ver 0.5.9-r16). The three different mouse models were mapped each to a slightly different version of the genome. The Kras model was mapped to the standard mm9 (NCBI37) genome (download from UCSC) and all chromosomes (include _random ones) were used. For the human transgene models we constructed a hybrid genome of mm9 plus either the hEGFR or the hMYC construct. The addition of these sequences is necessary to eliminate false positive from reads arising from the human sequences that would be mapped to the homologous regions in mm9 in the absence of the human target sequence in the genome index. The mapper was called with default options:

bwa aln $GENOME_BWA $IN1 >$OUT1

bwa sampe -f $OUT12 $GENOME_BWA $OUT1 $OUT2 $IN1 $IN2

After mapping the output sam file was filtered with a custom script to remove improperly paired reads (filterProperPair.py). Besides checking that the reads were mapped in opposite orientation and that the proper pairing flags were set we removed reads that had a TLEN greater than 80bp or less than 500bp. Read groups were then added using the PICARD toolkits (ver 1.55) AddOrReplaceReadGroups which also sorted the BAM file in coordinate order. The next step in the single sample part of the pipeline is to mark duplicate reads using PICARD MarkDuplicates. We set the remove duplicates option to reduce the size of the output files. The BAM files were then filtered to remove reads whose MAPQ was less than 30 using samtools (ver 1.1.18) and finally BAM indices were built with PICARD BuildBamIndex.

The next part of the pipeline operates on tumor/normal BAM pairs and uses the Genome Analysis Tookit (GATK) suite of programs to post process the BAMs to prepare them for mutation calling. The first step is to realign the BAM files using a local realignment method to attempt to reduce mismatches. This is usually due to nearby insertions or deletions. This was done in a two step process using the GATK RealignerTargetCreator which finds problematic regions followed by the IndelRealigner which does the actual realignment in the previous determined target areas. We used GATK version 1.6-7-g2be5704 (with JAVA version 1.6) as follows and processed the BAM pairs by chromosomes independently. Note in the following program calls show key options or those that are different from default. Log files, parallel options, etc were removed for brevity. For the full set of options used in the pipeline please see the source code in the repository referenced above. The target creation and realignment steps were run as follows:

java –jar GenomeAnalysisTK.jar –T RealignerTargetCreator \

-R $GENOME_FASTQ -L $CHROM -S LENIENT \

-o ${CHROM}__output.intervals \

-I $NORMAL -I $TUMOR

java –jar GenomeAnalysisTK.jar -T IndelRealigner \

-R $GENOME_FASTQ -L $CHROM -S LENIENT \

-targetIntervals ${CHROM}___output.intervals \

--maxReadsForRealignment 500000 --maxReadsInMemory 3000000 \

-I $NORMAL -I $TUMOR \

-o ${CHROM}___Realign.bam

The next step in bam post-processing is to recalibrate the base level quality scores generated by the sequencing instrument. Again this is a two step process with GATK: first a table of covariates is computed using CountCovariates. This step must process all BAMS (tumor and normal and all chromosome parts) together.

java –jar GenomeAnalysisTK.jar -T CountCovariates \

-R $GENOME_FASTQ -L $TARGET_REGION -S LENIENT \

-cov ReadGroupCovariate \

-cov QualityScoreCovariate \

-cov CycleCovariate \

-cov DinucCovariate \

-cov MappingQualityCovariate \

-cov MinimumNQSCovariate \

--knownSites:BED $KNOWN_SNPS \

-I $INPUTS \

-recalFile recal_data.csv

We used the standard covariate recommended by the GATK best practices document and for the known sites we used dbDNP version 128, which was acquired from UCSC Table Browser. After the covariate table is computed the BAMs are recalibrated with the TableRecalibration module:

java –jar GenomeAnalysisTK.jar -T TableRecalibration \

-R $GENOME_FASTQ -S LENIENT \

-L ${CHROM}__TARGET.bed \

-recalFile recal_data.csv \

-I ${BAM} \

-o ${BAM%.bam},Recal.bam

This step is done in parallel over tumor and normal and individually over chromosomes. At this point the tumor and normal BAMS are each merged from their chromosome pieces and are now ready for mutation calling.

For mutation calling we used a combination of two callers. Evidence has been accumulating that combining the output of multiple mutation calling methods can results in higher performance than any individual caller. We should in our dilution series that combining the two callers we used resulted in better specificity. Others have shown similar results ref 21 and ref [[Ewing2015]][2]. In our pipeline we used two callers and took the intersection of events between the two to increase specificity. The first caller was MUTECT (ver 1.1.4), which was called as follows on the tumor/normal bam pairs:

java -jar $SDIR/bin/muTect-1.1.4.jar \

--analysis_type MuTect \

--read_filter BadCigar \

--reference_sequence $GENOME_FASTQ \

--dbsnp $DBSNP_VCF \

--intervals $TARGET_REGION \

--input_file:normal ${NORMAL}.bam \

--input_file:tumor ${TUMOR}.bam \

--enable_extended_output \

--vcf mutect.vcf \

--out mutect.out

The second caller was a custom caller built around GATK’s UnifiedGenotyper (UGT). It was used to call mutations in the tumor normal pairs and then a custom script was used to post-process this output to filter for somatic mutations that passed various quality thresholds. The first step was to run UGT on the BAM pairs:

java –jar GenomeAnalysisTK.jar –T UnifiedGenotyper \

-R $GENOME_FASTQ \

-L $TARGET_REGION \

-A DepthOfCoverage -A AlleleBalance \

-glm SNP \

-stand_call_conf $STAND_CALL_CONF \

-stand_emit_conf $STAND_EMIT_CONF \

-dcov $DCOV \

-mbq $MBQ \

$INPUTS \

-o ${SBASE}_UGT_SNP.vcf

This file will contain both germline and somatic mutations. To filter for somatic mutations a custom script was used (getSomaticEvents.py) which applied the following filters:

· Read depth of Tumor > 13 AND read depth of Normal > 7 (this is the same thresholds used by MUTECT)

· GenoType of Normal == 0/0 and GenoType of Tumor == 0/1 or 1/1

· NRAF of Normal < 0.05

· Strand Support; the mutations had to be seen on at least 1 read from both strands

There was no filtering on the NRAF of the tumor. The mutations from both algorithms were combined by taking the intersection of the two sets. This combination maximizes specificity. The merged list was then post-filtered to remove likely germline events and possible sequencing artifacts using a black list comprised of:

1. dbSNP128 (from UCSC)

2. JAX CGD Imputed SNPs database from Jackson Labs available at[3]
http://cgd.jax.org/datasets/popgen/imputed.shtml

3. Pooled normal database from the HaJaVa samples.

The third list was generated using all the normal samples in this study and calling mutations on them using the UnifiedGenotyper (same options as for the somatic calling above). Then a list was compiled from events that had:

· Coverage greater than or equal to 14 reads

· Occurred in more than 2 normal samples

· Kras G12D was removed from this list

We whitelisted Kras G12D, which was introduced into the normals of the Kras model. This final mutation list was annotated using Annovar (version 2012-10-23):

annotate_variation.pl -buildver mm9 -geneanno $EVENTS

annotate_variation.pl -buildver mm9 -filter -dbtype snp128 $EVENTS

The full mutation list is available as supplemental table eventsVer13__Intersect__20150528.xls

DNA Copy Number analysis

Copy number was determined by first computing normalized log ratios between tumors and matched normals. We used the bedtools multicov program from the bedtools package (version 2.16).[4]

echo –e "Probe\tChr\tPos\t"$BASEBAM >$OUT

bedtools multicov -p -q 1 -bed $BED -bams $BAM \

| awk -F"\t" '{print "'$BASEBED'___"$4,$1,int(($3+$2)/2),$5}' \

| tr ' ' '\t' >${OUT}

This tool will count the read coverage over all the target intervals in the BED file filtering for properly paired reads and a MAPQ greater than zero (i.e, remove multiple mapping reads). The reads counts are then assigned to the midpoint of target range. The targets used were the exon probe intervals for the capture array.[5] The coverage was computed for each tumor/normal pair of BAMs from the variant pipeline. The counts from the pair were normalized to account for GC fluctuations and different overall depth using R’s loess fitting function and fitting to the %GC content of the exon probe intervals:

model=loess(counts ~ pctGC,

dat=data.frame(counts=counts,pctGC=pctGC),

control=loess.control(surface="interpolate",

statistics="approximate",

trace.hat="approximate",

cell=0.2, iterations=4))

center=median(counts)

return(counts-model$fitted+center)

The log Ratio for each probe was computed using:

logR=lm(dT$normCounts ~ offset(dN$normCounts))$residuals

The logR was then segmented using the Circular Binary Segmentation algorithm (CBS) described in ref 27 and implemented in the R/Bioconductor package DNAcopy as follows:

cna=CNA(dd$logR,dd$Ch,dd$Pos,"logratio",sampleName)

smCNA=smooth.CNA(cna)

smOut=as.matrix(smCNA)

d=segment(smCNA,verbose=T,alpha=0.01,nperm=10000,

undo.splits='sdundo', undo.SD=1.75)

The output segmentation data was then post normalize to center the highest peak in the density of segment means to logR == 0 (diploid centering). Then to convert the logR values of the segments into copy number calls the Rae algorithm was used (ref 28), which computes a sample dependent soft threshold (sigmoid function) for each Tumor/Normal pair based on the noise of that pair. This gives a value of 0-1 for both amplifications and deletions, which roughly indicates fractional amount of each alteration. These values were then average over all samples to give the fraction of each region that was amplified or deleted in a given set of samples.

To look for regions of copy number recurrence we focused on the Kras model since this had the largest number of samples. However due to the low sample number (N=38) and more importantly the fact that many of the tumors share the same matched normal sample (tumors from the same mouse) computing significantly recurrent regions with standard methods does not work. Instead we used a simple thresholding scheme looking for contiguous regions that had a fractionally recurrence of 20%. To eliminate a large number of potential CNV’s we actually filter on difference between the fraction of the major event minus the minor event. For example, if a region was amplified in 25% and deleted in 7% of the samples then the corrected value of 25%-7% = 18% was compared to the threshold in this case not passing. This simple mechanism removed the majority of likely CNV’s. The remaining events are in Supplemental Table S5. The events where curated and annotated for other potential artifacts. Additional for each region we have indicated which samples have that event in Supplemental Table S6.

Standardization of variant identification using inbred murine strains

To generate an orthogonal set of validated events for testing we used a combination of three sources. The inbred mouse libraries were run on the Affymetrix Murine SNP arrays. From this array data we identified 2,996 true positive events and 10,966 true negative events. We noted significant overlap and non-overlap of these variants in all mixed and un-mixed libraries. Since the precision at which we could measure the false positive rate is directly proportional to the number of true negatives we expanded this set by using data from both the Sanger Mouse Genome project and the Jackson Laboratory SNP database. We took positions that were covered at 30x and homozygous for MM9 for both B6 and 129 and removed any event that was marked as homozygous mutated in the Jackson database. This resulted in 438,622 true negative sites.

Using this set of true and false positive we then used the dilutions libraries to compute the performance of both callers as a function of mixing ratio (effective tumor purity). Figure S1C and S1D show the performance of each caller individually and the combined version which takes the intersection of the two. muTect has significantly higher sensitivity especially as the tumor purity is decreased. However this increased sensitivity comes at the price of a higher false positive rate.

[1] http://hannonlab.cshl.edu/fastx_toolkit/

[2] Ewing AD, Houlahan KE, et. al; Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat Methods. (2015) doi: 10.1038/nmeth.3407.

[3] We downloaded this database from http://cgd.jax.org/cgdsnpdb/ but at the time of writing this link now appears to be broken. We will distribute this file if permissible. The file at the current link is a more uptodate version of this database but is not exactly the same as the one used here.

[4] http://bedtools.readthedocs.org/en/latest/

[5] The file used is available in the hajava repository and in the data folder 110624_MM9_exome_L2R_D02_EZ_HX1___MERGE_SRTChr__FEATLabels.bed