-

Summary: High throughput RNA sequencing

High throughput RNA sequencing

Paolo Ribeca1, Vincent Lacroix1,2, Michael Sammeth1 and Roderic Guigó1

1Centre de Regulació Genòmica, Universitat Pompeu Fabra, Dr. Aiguader, 88, 08003 Barcelona, Spain

2Université de Lyon, Lyon ; Université Lyon 1 ; Projet BAMBOO, INRIA; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France

Abstract For many years, array-based methods have been successfully used to interrogate the transcriptome of the cell. These methods rely on the quantification of the hybridization intensity of the interrogated RNAs to predesigned probes. Recently emerging methods have adopted massively parallel sequencing technologies (also known as Next Generation Sequencing technologies, NGS) to interrogate the transcriptome. So-called RNA-seq experiments, in contrast to the array based methods, do not rely on hybridization probes, and therefore aim at the exhaustive identification and quantification of all the RNAs present in a sample – without the requirement of knowing their sequences a priori.The continuously increasing throughput of the NGS instruments allows to cope with the large dynamic range of RNA abundances within a cell, and the sequencing depth needed to detect even very rare transcripts should soon be available and affordable.

Bioinformatics challenges, however, arise from the very short length of the sequenced reads, which complicates their assignment to their parent transcript sequences due to the redundancy in alternatively transcribed sequences and sequencing errors. In this chapter we focus on computational methods that address these challenges in order to identify and quantify transcripts, variants and alternative splicing (AS) events from RNA-seq data.

Theoretical Background

On reads, mapping and mappability: An RNA-seq experiment typically produces millions of short sequences called reads. Depending on the sequencing platform employed, reads are derived from arbitrarily one (single reads), or from both ends (paired-ended reads) of a cDNA molecule; some specialized protocols additionally preserve the information about the strandedness of the original transcript. Additional variability usually comes from the different read length provided by each platform (typically 30-100nt for Illumina and SOLiD, and >200 for Roche 454).

In contrast to genome sequencing, reads from an RNA-seq experiment stem from different sequences (RNAs) that usually vary by 5 orders of magnitude in their abundance. Consequently, at the sequencing depth provided by the typical second-generation experiment attempts to assemble the original transcripts by exclusively overlapping reads usually fail, except for a small number of genes having high sequence coverage. Therefore, the first step in analyzing RNA-seq data usually consists in assuming that a reference genome or transcriptome exists, and aligning (or mapping) the reads to it. Clearly, the assumption of a transcriptomic reference is more demanding than the assumption of a reference genome, and its main drawback is that it makes impossible the discovery of novel exons or exon-junctions; however, transcriptome references usually provide the advantage of being shorter, and allow for the direct recovery of reads that do not entirely fall into single exons. On the other hand, when mapping exclusively to a genomic reference reads spanning introns need to be split-mapped: segments of a read are mapped to different genomic locations, with a consequent additional burden put on the alignment step.In general, combined mapping of RNA-seq to the transcriptome and to the genome (see below) can produce the quantification of known transcripts, and possibly identify novel exons and exon junctions.

In general, given the large number of reads and the size of a typical mammalian reference genome the alignment step may be computationally challenging. Although various mapping strategies can be devised, the most efficient mappers to date typically rely on a pre-indexing of the genome or transcriptome to speed up the subsequent mapping; this technique requires a few expensive pre-processing steps per organism (i.e. reference genome, transcriptome, etc.), but is otherwise able to provide an efficient read-wise alignment after pre-processing.

Different mappers adopt different assumptions as to which information should be considered to produce a reliable assignment of the reads. During mapping, the maximum number of allowed mismatches is a crucial parameter; a reasonable value for it depends on read length, sequence-identity between the sample and the reference, and quality of the experiment. In order to better account for sequencing errors, modern mappers often take into account qualities, that is confidence values for each base call or sequence call produced by the sequencing machines (Li et al., 2008).

One aspect which is particularly important is the way mappers deal with reads which, given a chosen set of mapping parameters, align to multiple locations. This situation occurs as an inevitable consequence of the redundancy of the reference sequence: in general, considering a specified read length and a number of allowed mismatches, each position in the genome will exhibit a different mappability, that is, a different number of locations to which a sequence of the given length starting there aligns in the reference. Pre-computed mappabilities of all the positions in a reference genome or transcriptome (Mangan et al., 2009) should always be taken into account when designing an NGS experiment and the subsequent bioinformatics pipeline, to make sure that the reference loci under consideration will be accessible to downstream data analysis (Derrien et al., submitted). Sometimes additional information from the experiment (like paired read mates, strandedness of reads or rescuing strategies as in Cloonan et al., 2009), can be used to reduce mapping ambiguities, but unfortunately there is no guarantee that in general such ambiguity will be avoided altogether. When resolving the origin of a read that maps to multiple locations, basically three strategies can been applied:

  1. the read is disregarded and consequently only unique mappings are considered (which makes the ~25% of the human genome inaccessible when adopting a read length of 36 nt and mapping with 2 mismatches)
  2. the read is replicated and distributed amongst all candidate locations, with a possible weight/probability to account for mapping redundancy
  3. the read is assigned to one of the possible locations: either to the first (with a bias according to the indexing order) or to a random one (with a bias according to the redundancy of a region  Li et al., 2008), by deriving extra information from reads that map uniquely in a certain region (Faulkner et al., 2008; Cloonan et al., 2009), or by considering locally all reads that (multi-)map in a certain region (as it is done in the Flux Capacitor, see below).

Quantification of AS events and read deconvolution: A common downstream analysis of RNA-seq data is to use aligned reads as a digital measure of abundance for genes, transcripts and AS events (for instance the commonly observed alternative inclusion of an exon, cfr. exon skipping); such abundance can then be compared between different conditions. Each AS event naturally consists of variants, that is two or more alternatively included parts (see Sammeth et al. 2008 for a precise defintion). In a straightforward approach, reads that align to regions which are uniquely assignable to one variant of an event are counted, and the statistical confidence of an expression change can be determined by a Fisher exact test as outlined in Fig. 1. However, given possibly short variants (which may comprise not more than one exon junction) and the lack of sequence coverage in lowly expressed genes, not all variants may be identifiable from unique reads. In these cases, reads that fall into exons flanking the event or more generally the reads that fall within exons of all alternative transcripts are usually recruited to detect significant changes in the expression of single variants w.r.t. the basal expression level of the gene.

/ Figure 1: Estimation of significant changes in the abundance of variants. Mapped reads from RNA-seq experiments on two experimental condition (top and middle) are used to quantify the alternative inclusion of an exon (white square) between flanking constitutive exons (green squares). In the first condition (top), 9 reads support the inclusion, and 1 read the exclusion of the alternative exon, in the second condition (middle) 2 inclusion and 4 exclusion reads are observed. A Fisher Exact Test is used to assess the statistical significance of the change in abundances between both conditions (p-value 0.036).

All the aforementioned techniques, however, are prone to errors which stem from statistical fluctuations of read falling within the local region of a variant, and from the fact that different RNA-seq experiments possibly show different read distributions. More advanced techniques aim at stabilizing read counts along the whole transcript; this requires to resolve ambiguities about the origin of reads that fall into regions shared by several transcripts, in a process called read deconvolution. Recently, various deconvolution methods have been proposed, either based on statistical models (Jiang and Wong, 2009), or on optimization frameworks (Bohnert et al. 2009) In general, aside from read coverage and distribution the feasibility of the deconvolution depends highly on the exon-intron structure of an alternatively spliced locus. For instance-under certain theoretical assumptions (i.e., completely uniform read distribution along transcripts, reads overlapping not more than one exon junction and expression of all annotated transcripts), the expression of some alternative transcripts in the GenCode (Harrow et al. 2006) annotation cannot be told apart regardless of the sequencing effort (Lacroix et al. 2008). A currently ongoing workshop (RGASP - RNA-seq Genome Annotation Assessment Project, within the ENCODE (ENCyclopedia Of DNA Elements) project (Birney et al. 2007) aims at the evaluation of the robustness and limitations shown by the methods available so far.

Protocol

The core of this section is devoted to the illustration of one possible strategy to quantify the abundance of transcripts after an RNAseq experiment. By way of introduction, however, we set off discussing how sample preparation can influence the downstream computational analysis.

Experimental protocol.

Different experimental protocols, which vary depending on the purpose of the experiment, have so far been proposed and employed for the preparation of the sample for subsequent RNA sequencing; here, we sketch a typical protocol for sequencing the poly-adenylated (poly-A+) RNA complement of a cell. By Watson-Crick base-pairing with dexoxy-thymidine bound to, for instance, cellulose, poly-A+ RNA is extracted (without additional preparation techniques) from an (ideally homogeneous) population of cells. Either total cellular RNA or RNA residing in particular compartmens or fractions in the cell (cytosol, nucleous, polysomes, etc) is extracted. From the resulting RNA sample, ribosomal RNAs that remain from internal base-pairings can be depleted, since they constitute the majority of the cellular RNA mass and therefore would lead to a high proportion of redundant ribosomal reads. To construct the sequencing library, the remaining RNAs are subject (following an order which depends on how the experiment is designed) to reverse transcription (RT) and fragmentation. The former requirement is due to the fact that, currently, all massively parallel sequencing instruments are able to sequence only DNA molecules; the latter emerges from the necessity of providing proper read coverage along the entire transcript sequence, although only small parts from the ends of the cDNA molecules in the library are sequenced. As mentioned in the Introduction, either one or both ends of the cDNA fragments are sequenced, and in addition some recent protocols are able to preserve the strandedness of the original transcript by adopting nucleotide modifications when one of the strands of the cDNA molecule is synthesised (Parkhomchuk et al. 2009).

Variations in the different protocols of cDNA preparation have been found to be responsible for most of the technical variability (Bullard et al., 2009). To this end, typical primers used for RT are random nucleotide polymers (usually of the technically favorable length of 6nt), or poly-dT primers, or a mix of both. While (depending on the length of transcripts) the use of poly-dT primers can result in 5’incomplete cDNAs, the use of random hexamers (depending on primer concentration) can lead to an over-representation of the 5’end of transcripts. The exploration of the causes of biases can be envisioned through simulations. To this end, and in order to assist experimental design, we have built an RNAseq experiment simulator (the Flux Simulator, see Figure 2).

/ Figure 2: Graphical user interface
of the Flux Simulator (
The simulation pipeline comprises different steps: selection of a reference transcriptome which will be sequenced, in silico gene expression, library construction and simulation of obtained sequencing reads. The histograms in the screenshot show prospective read distributions along transcripts, broken down by different transcript lengths and expression levels. A simulated northern blot summarizes the length distribution of obtained cDNA fragments.

Pre-processing of the reads

Depending on the platform, the results of NGS experiment can be made available in different formats; in particular, the information obtained by second-generation machines usually goes through a complicated series of pre-processing steps (base calling, quality calling, and so on) before becoming available to the end user. Here we assume that the final read data will come in either FASTA (only base calling) or FASTQ (base calling+qualities) format.

A word of caution: not all the FASTQ files are equivalent, since the encoding of the qualities may be different. While in most cases the qualities are represented as ASCII numbers (one per base), the actual encoding may vary (the most frequently used being the Phred one, with numbers ranging from 0 to 40, and the Solexa one, with a wider range). Since the FASTQ format does not contain any indication as to which quality scheme is employed, and this information cannot be uniquely reconstructed from the file itself, it is recommended that the user assumes information about the encoding from the data producers.

Mapping of the reads

A number of tools can be used to map RNAseq reads to a reference genome and/or transcriptome. These tools offer complementary advantages, depending on the goal of the experiment and the interests of the researchers (quantification of known transcripts and genes vs. discovery of novel isoforms, comparison of different conditions, etc.) Here we illustrate the typical steps involved in RNAseq mapping by using the GEM set of mapping tools, which we have developed (cfr. GEM, Genome Multitool, publicly available from We would like to emphasize that although different mapping programs may provide roughly equivalent functionality, such functionality is not exactly the same; in particular, some of the following analysis stages rely on various good properties (exhaustiveness, match counts, quality-driven alignment, possibility of outputting multi-maps, native algorithmic interfacing with the Flux Capacitor) which are not always (all) offered by all mappers.

Reference indexing. As the initial step, the reference to which reads will be mapped is to be indexed once for all subsequent mappings, i.e., a set of archive files which encode the original sequences needs to be created (as required by other programs like BLAST). The generation of the index (which is typically demanding in terms of the required computational time) is a crucial stage of the whole pipeline, because the choice of the reference essentially determines the computational efficiency and the detection limitations of the mapping stage which will be carried out afterwards. In the GEM software package, the index is created with the gem-do-index command; the -o flag (output) specifies the common prefix of the various files each archive is made of; the sequences to be indexed (some examples of possible references can be found in Table 1) are fed to the program as a multi-FASTA file specified by the -i flag (input).

Table 1: Possible reference indices and their application for read mapping. Six possible human reference sequences (A-F) with the sizes of their respective GEM indexes in Megabytes. Indices were composed from the release GRch17 of the human genome reference, and from transcripts, splice- and exon-junctions from the GENCODE annotation v3 (refs), respectively. Sequences for the reference “known exon junctions” (resp. “all exon junctions”) are formed by 32nt to the left and to the right of each annotated splice junctions (resp. of each exon junction formed by each pair of non overlapping exons in a locus). The rationale for choosing 32nt is to make sure that reads of 36nt mapping to a junction have at least 4nt on each side of the junction. These settings have been used in several studies (Mortazavi et al., 2008; Wang et al., 2008). A rule of thumb to select the half-width of a junction is to take L-M-2, with L the read length and M the number of mismatches allowed.

Reference / Index
Size / known
exons / known
junctions / novel
exons / novel
junctions
A known junctions / 267M / - / map / - / -
B transcriptome / 508M / map / map / - / -
C all exon junctions / 3,482M / - / map / - / map1
D genome +
known junctions / 3,382M / map / map / Map / -
E genome / 3,125M / map / split-map / Map / split-map
F genome +
all exon junctions / 6,607M / map / map / Map / map1

1 mapping to a combinatorially exhaustive set of intra-locus junctions from known exons permits to identify novel splice-junctions exclusively between exons that are known in the reference transcriptome and that belong to the same locus.

  1. Read mapping/Split-mapping Once an index for the reference(s) has been produced, the mapping of read file(s) to it can be performed by invoking either the GEM mapper (command gem-mapper), or the GEM 2-exon split mapper (command gem-split-mapper, see Ribeca et al, submitted), or both. Both programs require to specify:the index (parameter -I, input)
  2. a file containing reads in FASTA/FASTQ format (parameter -i, input)
  3. a file which will store the produced mappings (parameter -o, output);

in addition, it is possible to optionally specify further alignment parameters like the maximum number of allowed mismatches, whether qualities should be used or not, etc..