Andrew T. Magis1,2, Cory C. Funk1, and Nathan D. Price1,2

SNAPR: a bioinformatics pipeline for efficient and accurate RNA-seq alignment and analysis

Andrew T. Magis1,2, Cory C. Funk1, and Nathan D. Price1,2§

1Institute for Systems Biology, Seattle, WA

2Center for Biophysics and Computational Biology, University of Illinois, Urbana, IL

Supplemental Section

SNAPR Alignment

SNAPR incorporates the annotation directly into the alignment process to improve alignment accuracy. Each mate (one for single-end, two for paired-end) is aligned to both the genome and the transcriptome independently, creating a set of unique putative alignment positions for mate(s) to both indices. The final alignment position and subsequent mapping score is determined based on criteria described below. Aligning to both the transcriptome and the genome simultaneously serves a dual purpose: it ensures that all possible alignment positions for a mate are considered, and it allows paired end reads to cross transcriptome/genome boundaries. While an annotation provides critical prior knowledge about the likelihood of alignment positions, gene boundaries are often truncated at the 5’ and 3’ ends. SNAPR can also align paired end reads for which one mate occurs in an intron and the other in an exon, a situation resulting from unannotated exons or sequenced pre-mRNA. The current version of SNAPR is designed for genomes with curated annotations, and does not attempt to find novel splice junctions.

SNAPR Error Model

The mapping quality (MAPQ) field in the SAM file format specification is defined as -10log10X where X is the probability that the mapping position is incorrect. This value is rounded off to the nearest integer in the SAM file. Accurately calculating MAPQ necessitates the identification of all alternative alignment positions for a read to a given assembly. Genomic read aligners such as BWA and novoalign estimate this value by finding as many alternative alignments positions as possible in a reasonable amount of time while reporting the highest quality alignment. Other factors such as repetitiveness of the genome and base qualities are also taken into account. As reported in the companion paper, SNAP effectively calculates the MAPQ value for genomic alignments such that Area Under the Receiver Operator Characteristic (AUROC) is among the highest of all aligners tested. This accuracy is enabled by SNAP’s ability to rapidly find alternative alignments using the hash index. Transcriptomic alignments present greater difficulty in estimating MAPQ, due to the fact that reads can be spliced in many different ways and therefore accurately estimating alternative alignment positions is difficult. Tophat, for example, only reports five values for MAPQ in the output SAM file: 0, 1, 2, 3 and 50. If two alignments are identified for a read, the probability of any one of the alignments being correct is assumed to be 0.5, irrespective of mismatches, yielding a MAPQ value of -10log100.5 = 3, rounded off to the nearest integer. A MAPQ value of 50 is reported if an alignment is unique. As such, the MAPQ value is not particularly useful in determining whether a given read is accurately aligned. Ultimately, most end users are interested in mapping uniqueness, but alignment uniqueness is not well defined when one allows for mismatches between reads and their target. Without allowing for mismatches, any read containing a sequencing error or derived from a transcript containing any form of a single nucleotide variant (SNV) will be discarded as an imperfect match. Alternatively, allowing for N mismatches in a read of length N enables the alignment of any read to any position, regardless of sequence. Aligners must therefore strike a balance between these two extremes, allowing for a reasonable number of mismatches that preserves the ability to identify variation while constraining alignments enough to yield meaningful information. Compounding the problem is the repetitiveness of the target genome, which leads to high sequence similarity in distal regions, potentially confounding even conservative alignment algorithms. SNAPR makes use of an error model in which the uniqueness of a particular alignment is determined through a confidence threshold that is defined by the user. All alignments are identified within a user-specified edit distance (-d) between the read and the reference, and reverse sorted by the number of mismatches. If the top-scoring alignment has at least n fewer mismatches than the second-best scoring alignment, where n is a confidence interval specified by the user (-c), the alignment is considered unique and is reported with the SNAP-computed MAPQ value. Otherwise, the alignment is considered non-unique and is reported with a MAPQ value of 1. All analyses in this paper were generated using the default confidence interval of 2.

Alignment Classification

A major feature of the SNAPR pipeline is the manner in which the annotation is incorporated into the categorization of putative alignments. All valid alignments for each read are compared to the list of annotated genes. If both mates align within the boundaries of an annotated gene, that alignment is categorized as an intra-gene alignment. Putative alignments that are not intra-gene but occur on the same chromosome are categorized as intra-chromosomal. Finally, putative alignments that cross chromosome boundaries are categorized as inter-chromosomal. Each read may have valid alignments that occur in one or more of these categories. SNAPR prioritizes alignments in the following order: intra-gene, intra-chromosomal, and inter-chromosomal. If a read generates an alignment that is categorized as intra-gene within the user-specified error model, no alignments in any of the other categories are considered valid, even if those alignments contain fewer mismatches with the reference genome. Similarly, if a read generates no valid alignment in the intra-gene category, but a valid alignment in the intra-chromosomal category exists, no alignments are considered from the inter-chromosomal category. This categorization scheme is designed to leverage the likelihood of alignments occurring within annotated genes while biasing the algorithm against inter-chromosomal alignments. As a result, any intra- or inter-chromosomal alignments that do pass through these filters are subsequently more likely to result from real biological events.

Preparation of genome and transcriptome indices

SNAPR indices were generated using the GRCh37 genome assembly and the Ensembl v68 human genome annotation using the following commands:

snapr index Homo_sapiens.GRCh37.68.genome.fa genome20 –t16

snapr transcriptome Homo_sapiens.GRCh37.68.gtf Homo_sapiens.GRCh37.68.genome.fa transcriptome20 –t16 –O1000

Preparation of contamination database

All sequenced viral genomes (1376 genomes), all sequenced fungal genomes (35 genomes), and all sequenced bacterial genomes (2646 genomes) were downloaded from the NCBI FTP site (ftp.ncbi.nih.gov/genomes). A custom Python script was used to select one bacterial genome from each genus to be part of the contamination database, for a total of 1145 genomes. If all bacterial genomes were included the size of the contaminant database would exceed 232 bases, the current limit on genome size for SNAPR. Note that we do not claim this method to be the most effective way to identify bacterial contaminants. A more targeted approach involving common or expected contaminants would be the most effective, and SNAPR allows the user to specify any contaminant database they create. Our default contaminant database is freely available on our Amazon EC2 snapshot.

The genome FASTA files described above were concatenated into a single file. All spaces in the header sequences for each genome were replaced by underscores so they would be part of the SNAPR index rather than truncated. The contamination database was built using the following command:

snapr index contamination_db.fa contamination20 –t16

Sample analysis

Each sample in BAM format was downloaded directly to an Amazon EC2 cr1.8xlarge instance using CGHub GeneTorrent v3.8.5 and the sample analysis_id. Analysis_ids were obtained as described for each cancer type below. SNAPR genome and transcriptome indices were copied to local (ephemeral) storage from an EBS volume. Each sample was processed using the following command (replacing ${UUID} with the analysis_id of the sample):

snapr paired /mnt/snap/genome20 /mnt/snap/transcriptome20 /mnt/snap/Homo_sapiens.GRCh37.68.gtf *.bam -o ${UUID}.snap.bam -t 16 -M -c 2 -d 4 -rg ${UUID} -I -ct /mnt/snap/contamination20

All output files were stored back to a shared EBS volume for downstream processing. No external software packages were used in any analysis step, although some custom Python scripts were used to compile the output for this study. All scripts are provided in our Amazon EC2 snapshot.

Generation of Mason Reads

The complete Ensembl v68 transcriptome was generated from the GRCh37 human genome assembly using a custom Python script. For each transcript in the annotation, a spliced FASTQ sequence was written to the output file transcriptome.fa

The VCF file containing the Venter genome was converted into transcriptome coordinates, such that if a variant overlapped with one or more annotated transcripts, one variant per transcript appears in the output VCF file. In this manner all variants were included in annotated transcripts. For example, the following heterozygous variant occurs in exon 4 of the gene EGFR (Ensembl ID: ENSG00000146648) in the Venter VCF file:

CHR / POS / ID / REF / ALT / QUAL / FILTER / INFO / FORMAT / GT
7 / 55214348 / . / C / T / 200 / PASS / . / GT / 1|0

There are eight Ensembl v68 EGFR transcripts that intersect this position, but only seven of them are in exons, so seven variants appear in the output file in transcriptome coordinates. POS fields now reflect the spliced distance from the beginning of the transcript.

CHR / POS / ID / REF / ALT / QUAL / FILTER / INFO / FORMAT / GT
ENST00000442591 / 634 / . / C / T / 200 / PASS / . / GT / 1|0
ENST00000275493 / 651 / . / C / T / 200 / PASS / . / GT / 1|0
ENST00000342916 / 720 / . / C / T / 200 / PASS / . / GT / 1|0
ENST00000344576 / 719 / . / C / T / 200 / PASS / . / GT / 1|0
ENST00000420316 / 718 / . / C / T / 200 / PASS / . / GT / 1|0
ENST00000450046 / 622 / . / C / T / 200 / PASS / . / GT / 1|0
ENST00000454757 / 498 / . / C / T / 200 / PASS / . / GT / 1|0

Reads were generated using Mason with the following lines of code, using the FASTQ file of the entire Ensembl v68 transcriptome as the input ‘genome’:

mason illumina –vcf venter.ensembl.vcf -N 20000000 --mate-pairs -n 100 -sq --output-file mason.venter.100.transcriptome.fastq transcriptome.fa

Genomic reads were generated from the standard Venter VCF coordinates and the GRCh37 genome assembly using the following command:

mason illumina -vcf venter.vcf -N 5000000 --mate-pairs -n 100 -sq --output-file mason.venter.100.genome.fastq Homo_sapiens.GRCh37.68.genome.fa

Pseudogene reads were generated in the same manner as ordinary transcriptome reads, except only sequences from the Ensembl v68 annotation were used that had a ‘source’ field of one of the following:

IG_C_pseudogene

IG_J_pseudogene

IG_V_pseudogene

polymorphic_pseudogene

processed_pseudogene

pseudogene

transcribed_processed_pseudogene

transcribed_unprocessed_pseudogene

TR_J_pseudogene

TR_V_pseudogene

unitary_pseudogene

unprocessed_pseudogene

Alignment with STAR, Tophat2/Bowtie2, and Subjunc

STAR

STAR v2.3.0e was used for all analyses. The STAR index was built using the GRCh37 human genome assembly and Ensembl v68 annotation using the following command, using an overhang of 99 as indicated by the STAR manual for optimal alignments of 100 bp reads:

STAR --runMode genomeGenerate --genomeDir star_db/GRCh37.overhang99 --genomeFastaFiles Homo_sapiens.GRCh37.68.genome.fa --sjdbGTFfile Homo_sapiens.GRCh37.68.gtf --sjdbOverhang 99 --runThreadN 16

STAR alignments were performed using the following command, replacing ${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output filename prefix:

STAR --genomeDir star_db/GRCh37.overhang99 --readFilesIn ${FILE1} ${FILE2} --runThreadN 16 --outFilterMismatchNmax 8 --outFileNamePrefix ${PREFIX}.star. --outFilterMultimapNmax 10

Subjunc

Subread/Subjunc v1.3.6-p1 was used for all analyses. The Subread index was built using the GRCh37 human genome assembly using the following command:

subread-buildindex -o GRCh37 Homo_sapiens.GRCh37.68.genome.fa

Reads were aligned to this index with the following command, replacing ${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output filename prefix:

subjunc -i GRCh37 -r ${FILE1} –R ${FILE1} -o ${PREFIX}.sam –T 16

Tophat2

Bowtie2 v2.1.0 and Tophat2 v2.0.9 were used for all analyses. The Bowtie2 index was built for the GRCh37 genome assembly using the following command:

bowtie2-build –f Homo_sapiens.GRCh37.68.genome.fa GRCh37.genome

Alignments with Tophat2 were performed using the following command (using the most sensitive search option and not considering novel splice junctions), replacing ${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output filename prefix:

tophat --b2-very-sensitive --output-dir ${PREFIX} -p 16 -r 100 --mate-std-dev 50 -g 10 -N 4 --read-edit-dist 4 --library-type fr-unstranded --no-novel-juncs -G Homo_sapiens.GRCh37.68.gtf --transcriptome-index GRCh37.transcriptome GRCh37.genome ${FILE1} ${FILE2}

Sorting and BAM conversion (for STAR, Subjunc, and Tophat2)

SAMtools 0.1.19 was used for sorting and BAM conversion.

samtools view -uST Homo_sapiens.GRCh37.68.genome.fa ${PREFIX}.sam | samtools sort - ${PREFIX}

Read group assignment with Picard (not counted in running times)

SNAPR can automatically assign read groups to the output BAM file, as can Tophat, but STAR and Subjunc do not have this functionality. It should be noted that although read group assignment was not included in the running time for the paper, this step is still necessary for many downstream analyses.

java -jar AddOrReplaceReadGroups.jar I=${PREFIX}.bam O=${PREFIX}.RG.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=Illumina RGLB=Library RGPU=Platform RGID=1 RGSM=${PREFIX} VALIDATION_STRINGENCY=LENIENT

Mapping Quality Reassignment (not counted in running times)

STAR uses MAPQ=255 to refer to ‘uniquely’ mapped reads, similar to old versions of Tophat. The standard interpretation of MAPQ=255 is ‘unknown’ mapping quality, and GATK will ignore these reads by default. It is therefore necessary to reassign these mapping qualities. It should be noted that although reassignment of mapping qualities was not included in the running time for the paper, this step is still necessary for many downstream analyses.

java -jar GenomeAnalysisTK.jar -T PrintReads -rf ReassignOneMappingQuality -RMQT 60 -RMQF 255 -I ${PREFIX}.RG.bam -o ${PREFIX}.RG.MAPQ.bam

Variant Calling with Genome Analysis Toolkit UnifiedGenotyper

The reader may wonder why the UnifiedGenotyper was used instead of the HaplotypeCaller. At the time of this study, the HaplotypeCaller does not support the ‘N’ CIGAR operator which indicates splicing, and therefore cannot be used for RNA-seq data:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens.GRCh37.68.genome.fa -T UnifiedGenotyper -I ${PREFIX}.bam -o ${PREFIX}.vcf -stand_call_conf 30.0 -stand_emit_conf 30.0 -nt 16