Normalization of the cDNA library substantially enhanced the sensitivity of the transcriptome assembly

To evaluate the normalization efficiency, we first sequenced both the normalized and non-normalized cDNA library using Illumina GA. Approximately 5 million and 6 million 36 nt single end reads from the normalized and non-normalized library (Table 1) were then mapped on the S. mediterranea genome and used to estimate the abundance of the “MAKER” transcripts from SmedGB. As shown in Supplementary Figure 1, the abundance of highly (lowly) expressed genes was reduced (enhanced) after normalization.

Supplementary Figure 1. Efficiency of cDNA library normalization.

The evaluation of normalization efficiency was based on the Illumina single end sequencing results obtained from normalized and non-normalized cDNA libraries. Sequence reads were mapped to the genome reference sequences and used to estimate the abundance of MAKER transcripts in either normalized or non-normalized samples. 15,756 of 30,930 MAKER transcripts were detected both in the normalized and the non-normalized datasets and were binned into 200 groups of equal size based on their similar abundance in non-normalized sample. These groups were sorted by increasing abundance along the X axis representing mean number of reads per kilobase of transcript per million mapped reads (RPKM) (Mortazavi et al. 2008)of each group, whereas Y axis represents the average log2 fold change of the abundance within respective transcript group between normalized and non-normalized samples.

Although we have shown and, in fact, quantified to which degree we could normalize the cDNA library, this does not show how important the normalization was for the assembly per se. We therefore used computer simulations to show that the normalization dramatically increased the probability of sequencing lowly expressed transcripts and is essential to achieve high sensitivity.

The abundance of the BIMSB transcripts represented in our non-normalized and normalized cDNA library of S. mediterranea is estimated based on RPKM value determined by the single-end Illumina sequencing of the two libraries. Out of 18,618 transcripts, 16,734 transcripts with measurable expression in both the non-normalized and normalized library were used for simulation. The number of copies per transcript is the empirical RPKM value multiplied by a scaling factor to end up with a total of 10,000,000 transcript copies in each library. We then simulated the number of sonicated fragments derived from these 10,000,000 transcript copies based a poisson distribution with the lamda equaling to the transcript length divided by the average length of sonicated fragment, which was set to 600 nt here. The sonication breakpoints were sampled at each position of the transcript with equal probability. The resulting fragments with length between 300nt and 800 nt were retained to be sampled with equal probability for sequencing, in which the 454 sequencing reads could start from either end and extend up to 400nt. We incorporate sequencing errors based on the error model described in (Balzer et al.).

We then randomly sample 0.5 million, 1 million, 1.5 million and 1.95 million reads for assembly using Newbler with the default parameter as described before. The assembled reads were finally aligned with BIMSB transcripts to calculate the number of transcripts assembled at a certain minimum percentage of length.

As shown in the Supplementary Figure 2, for example, at a realistic depth of 1.5 million 454 sequencing reads an additional >60% of transcripts can be assembled at almost full (>=95%) length when sequencing the normalized library.

Supplementary Figure 2.. Transcriptome assembly sensitivity gained by cDNA library normalization.

X-axis represents the number of simulated 454 sequencing reads used for assembly (Methods). Y-axis marks the number of transcripts that can be recovered at certain minimum percentage (95%: red, 75%: blue, 50%: black) by sequencing the normalized (solid line) or non-normalized cDNA libraries (dash line).

Full Length cDNA library construction and normalization

First strand cDNA was synthesized using MINT cDNA synthesis kit (Evrogen) according to the manufacturer’s instructions with the following modifications: (a) the reverse transcription oligo-dT primer contained a GsuI restriction site and was biotinylated at its 5’-end

(5'-biotin-AAGCAGTGGTATCAACGCAGAGTCTGGAG(T)16VN-3') and (b) 5-methyl-dCTP was used instead of dCTP in the first strand cDNA synthesis. After removal of unincorporated dNTPs using Illustra MicroSpin S-300 HR Columns (GE Healthcare, Life Sciences), second strand cDNA was obtained using the RNase H-DNA polI-mediated second strand cDNA synthesis instead of PCR. The reaction mix containing 3 μl dNTPs (10 mM each), 10 U of E. coli DNA Ligase, 40 U of E. coli DNA Polymerase and 2 U of E. coli RNase H in a total volume of 150 μl was incubated at 16ºC for 2 hours. The resulting double stranded (ds) cDNA was then purified using the Qiaquick PCR purification kit (Qiagen).

GsuI (Fermentas) digestion was performed overnight at 30ºC. The enzyme was inactivated at 65ºC for 20 minutes and the digested products were purified on M-280 Streptavidin Dynabeads (Invitrogen) following the manufacturer’s instructions. The non-biotinylated ds cDNA lacking poly A:T tails was then ethanol precipitated and resuspended in nuclease free water. Adaptor ligation was performed in a reaction mix containing 10 μl 5X T4 DNA Ligase buffer, 4 µl adaptor (4 μM) (obtained by annealing the 2 following oligos: 5’- AAGCAGTGGTATCAACGCAGAGTT-3’; 5’-CTCTGCGTTGATACCACTGCTT’-3’), 29 μl ds cDNA, and 20 U T4 DNA Ligase (Invitrogen) in a total volume of 50 μl. After incubation at16ºC overnight, the ligation products were then amplified using advantage 2 polymerase mix (Clontech) and primer M1 (5’-AAGCAGTGGTATCAACGCAGAGT-3’) with optimization of the number of PCR cycles to minimize amplification biases (Evrogen manual). Amplified products were purified with Qiaquick PCR purification kit (Qiagen) and quantified by Nanodrop 7500 spectrophotometer to generate the non-normalized full-length cDNA library.

Normalization of the cDNA library was performed using the Trimmer kit (Evrogen) according to the manufacturer’s instructions. For each amplification steps, we used Advantage 2 polymerase mix (Clontech) and determine the minimum necessary number of cycles for amplifying to minimize amplification biases. Normalization efficiency was assessed by single-end cDNA sequencing using Illumina GAIIX.

Comparison of BIMSB assembly with reference sequence-guided transcript assembly

To further understand how well our de novo transcriptome assembly worked, we compared BIMSB transcripts with reference sequence-guided transcript assembly. Briefly, we first mapped all sequencing reads, consisting of 2 lanes of paired-end Solexa sequencing, 2 lanes of single-end Solexa sequencing and 1.5 runs of 454 sequencing with TopHat (Trapnell et al. 2009) using default parameters (version 1.2.0 BETA). Then, we used Cufflinks (Trapnell et al. 2010) to assemble transcripts using default parameters (version 0.9.3 BETA). In total, we obtained 32235 transcripts, with a mean length of 903nt and a maximum length of 18510nt (see supplementary table 1).

To compare BIMSB transcripts and the transcripts assembled by cufflinks (referred here as cufflinks transcripts), we performed mutual blat between the two sets and classified those with >90% sequence similarity along >80% of the shorter transcript length as ‘similar’ transcripts. With the criteria, 17,000 BIMSB transcripts can find counterparts with 24443 cufflinks transcripts. 1619 and 7792 transcripts remain to be BIMSB and Cufflinks specific. The length distribution of the overlapping, BIMSB specific, cufflinks specific transcripts were shown in supplementary table 1. Compared with BIMSB transcripts, Cufflinks set contained much more shorter contigs. The disadvantage of the relatively shorter contigs is apparent when we used Cufflinks transcripts to annotate our shortgun proteomic data (see below).

Finally, we analyzed our massive shotgun proteomic data using the protein database derived from Cufflinks transcripts. At a false discovery rate of 1%, we could identify 41,339 MSMS spectra with a positive match on the 3,782 transcripts, corresponding to 9,495 different peptide sequences. Comparing with BIMSB transcripts, not only fewer Cufflinks transcripts were supported with peptide matches, but also in average the number of peptide matches per transcript is much lower. The poor performance of Cufflinks transcripts is most likely due to their shorter length and therefore more fragmentary nature.

Supplementary table 1, Numbers and Length distribution of all BIMSB transcripts, all Cufflinks transcripts, overlapping BIMSB and cufflinks transcripts, BIMSB and cufflinks specific transcripts.

number / 1stQ (nt) / Median (nt) / Mean (nt) / 3rdQ (nt) / Max (nt)
BIMSB / 18619 / 643 / 1078 / 1228 / 1625 / 14740
CUFFLINKS / 32235 / 370 / 654 / 903 / 1163 / 18510
overlapping
BIMSB / 17000 / 683 / 1123 / 1335 / 1670 / 14740
CUFFLINKS / 24443 / 422 / 766 / 1018 / 1333 / 18510
specific
BIMSB / 1619 / 483 / 631 / 792.7 / 993 / 4077
CUFFLINKS / 7792 / 288 / 450 / 543.7 / 679 / 6271

Inference of gene loci

To evaluate the number of gene loci covered by the de novo assembled transcripts, the set of all candidate transcripts was aligned to the assembled genomic regions of S. mediterranea (Cantarel et al. 2008).

For this task, the spliced alignment tool SPALN (Gotoh 2008) was used with standard settings. Of the 26,669 candidate transcripts, 15,168 transcripts could be aligned to genomic sequence this way. In order to increase the sensitivity of gene loci identification, the sequences that could not be aligned by SPALN were mapped to the genome using BLAT (Kent 2002), eligible for discontinuous alignments with high sequence identity, run with default parameters. Since the assembly of genomic regions is presumably incomplete, a mapped locus was accepted, whenever at least half of the nucleotides of a candidate transcript could be aligned to the genome. The strand of the putative gene was inferred by examining splice junctions between adjacent alignment blocks. Among the junctions between all blocks belonging to the same transcripts the occurrence of the canonical splice junction GT/AG and the less frequent known junctions GC/AG, AT/AC were recorded for the transcript sequence and its reverse complement, respectively. The direction with the higher number of canonical splice junctions was assumed to indicate the strand of the respective gene. If no canonical junction was observed, the same procedure was repeated, first with GCAG and then with ATAC, supposing that the frequency of the former is higher than the one of the latter. If none of the candidate junctions was observed, always the same arbitrary strand was assigned in order to keep the gene locus.

Of the remaining 11,501 candidate transcripts that could not be aligned to the genome by SPALN, 9,481 could be mapped to the genome by BLAT, requiring at least 50% of the sequence of a given transcript to be aligned. Raising this fraction to 80% would decrease the number of valid alignments to 4,560. The incompleteness of the transcript alignments to the genome can be explained by the possible mis-assembly of the transcriptome as well as the state of the genome assembly. Without additional supporting information (e. g. orthology annotation) it is difficult to distinguish the influence of either source of errors (Table 3). Removing redundant transcripts from the set of 25,009 sequences that could be mapped by either SPALN or BLAT yielded 17,546 transcripts.

5’ and 3’ RACE validation

RACE (Rapid Amplification of cDNA Ends) was performed using SMARTer RACE cDNA Amplification Kit (Clontech) according to the manufacturer’s User Manual. Poly (A) RNA extracted from total RNA of planarian whole worms was used as starting material. Gene specific primers were chosen to be located 200-800 bp to either 5’ or 3’end of the assembled transcripts (primer sequence see supplementary table 6). RACE products were purified with Qiaquick PCR purification kit (Qiagen). The purified products were A tailed by Gotaq polymerase (Promega) and cloned using pCR4-TOPO TA Cloning Kit (Invitrogen) following the manufacturer’s instructions. The colonies picked out were screened by colony PCR. Plasmids with positive insertions were extracted by using Zippy Plasmid Miniprep Kit (Zymo Research) following the manufacturer’s instructions and were sequenced using conventional Sanger sequencing with M13 forward or M13 reverse as sequencing primer.

Identification of potential frame shift errors

We used Inparanoid (Remm et al. 2001) to infer orthology relations between the translated BIMSB transcripts and the human proteome (EnsEMBL release 57). We realigned the nucleotide sequence of each transcript and the corresponding human protein sequence with the 'protein2dna' model of Exonerate (Slater and Birney 2005). This model reports frameshifts in the nucleotide sequence, which we used to estimate frequency of frameshift for the whole data set.

Detection of chimera formation / transcripts fusion (over-assembly)

We performed BLASTP searches of the translated BIMSB transcripts versus the human proteome and a self-alignment BLASTP search of the human proteome (Score cutoff: 50 bits). Based on the Planaria-Human BLASTP results, we annotated the BIMSB transcripts with best local non-overlapping similarity regions on human proteins. Subsequently, we considered all BIMSB transcripts with non-overlapping matches to at least two human proteins as candidates for our chimera screen. We then assessed the intra-species similarity of the assigned human proteins for each candidate BIMSB transcript. We predicted a chimera or transcript fusion event if at least one pair of human proteins did not show any similarity to one another.

Detection of transcript fission (under-assembly)

We performed BLASTP searches of the human proteome (EnsEMBL 57) versus the translated BIMSB transcripts and a self-alignment BLASTP search of the translated BIMSB transcripts (Score cutoff: 50 bits). Based on the Human-Planaria BLASTP results we annotated the human proteins with best local non-overlapping similarity regions to BIMSB transcripts. We assigned each transcript to the best matching human protein record. Subsequently, we considered all the human proteins with at least two assigned BIMSB transcripts as candidates in our fission screen. We then assessed the intra-species similarity of the assigned BIMSB transcripts for each candidate. We predicted a fission event if at least one pair of transcripts did not show any similarity to one another.

Manual search in 50 randomly chosen transcripts for frame shift and fusion events

We randomly chose 50 transcripts from 18,619 BIMSB transcripts. Using blastx search with a cutoff e-value of 1e-10, we align the transcripts to protein sequences in the NCBI non redundant protein database (nr Release May 11, 2010)., We considered the transcripts with non-overlapping matches to at least two proteins from the same organism as candidates for fusion transctipts. We considered the transcript to contain frame shift errors if the peptides translated in its different ORFs can be mapped to the different parts of the same protein sequence.