Comparative genomics methods for the prediction of small RNA binding sites

Rym Kachouri-Lafond1 and Mihaela Zavolan1

1Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Klingelbergstrasse 50-70, CH-4056 Basel, Switzerland.

One page summary:

Graphic overview

This chapter presents a comparative genomics approach to the identification of binding sites for regulatory RNAs.

1. Abstract

In the recent years new classes as well as new members of already known classes of small regulatory RNAs have been discovered. Examples are small nucleolar RNAs (snoRNAs) that act in ribonucleoprotein complexes, in which they guide modification or splicing of target transcripts, and microRNAs (miRNAs) that modulate the turnover rate and protein output of mRNAs. Although small RNAs recognize their targets through RNA-RNA hybridization, this interaction is constrained by protein factors within the ribonucleoprotein complexes, thus making target prediction challenging.

Genomic regions that are functionally important for the organism, be they protein-coding genes, non-coding RNA genes or regulatory sites, are conserved over longer evolutionary distances compared to regions that do not carry functional elements. As whole-genome sequencing became available, comparison of genomic sequences of different species has become instrumental to the prediction and identification of functional elements. Here we describe comparative genomics approaches that we applied in order to discover binding sites of small regulatory RNAs (summarized in the graphic overview at the beginning of the chapter). In these approaches, we first define a model that describes the interaction between the regulatory RNA and its target. Then we identify the putative binding sites of the regulatory RNA genome- or transcriptome-wide in any given species of interest. We then examine the orthologous genomic regions from other species to determine whether the site is conserved. In parallel, we apply this procedure to randomized variants of the regulatory RNAs that have similar properties (e.g. nucleotide composition). To the extent to which the number of putative sites that are conserved across species is larger for a real regulatory RNA compared to its randomized variants, we can infer that i) the RNA-RNA interaction model that we have defined is appropriate for describing productive interactions of the small RNA of interest, and that ii) we can distinguish functional, meaningful targets of the small RNAs from similar sequences in the reference genome.

2. Theoretical background

Initial efforts to clone and sequence small RNAs revealed numerous such molecules. They are encoded in the genome, from which they are transcribed, processed, and incorporated into various ribonucleoprotein (RNP) complexes [1-7]. Deep sequencing studies revealed in fact that much of the genome is transcribed, generating numerous types of non-protein-coding RNAs. Moreover, even molecules that have been extensively studied such as the snoRNAs appear to give rise to processing products that may have acquired novel functions [8].

2.1. SnoRNAs

SnoRNAs (small nucleolar RNAs) are relatively short RNAs that accumulate either in the nucleolus or the Cajal bodies (for reviews see [9-11]). When located in the Cajal bodies they are called small Cajal bodies RNAs or scaRNAs. Based on their structural features, two major classes of snoRNAs have been defined: the C/D box and the H/ACA box snoRNAs (Figure 1). They associate with specific sets of proteins, distinct for each snoRNA class, to form snoRNP complexes (reviewed in [12]). For instance, C/D box snoRNPs contain the protein fibrillarin (Nop1p in yeast), which is thought to be the methyltransferase component of C/D snoRNPs [12]. A third structure-defined class, found only among the scaRNAs, contains both the C/D and the H/ACA) domains. SnoRNA size ranges in human from 50 to 235 nucleotides for C/D snoRNAs, 120 to 250 for H/ACA snoRNAs, and 80 to 550 for scaRNAs (see also the database of [13]). In vertebrates, most of the snoRNAs are cotranscribed as intronic sequences in host precursor mRNAs, and only few are independently transcribed by RNA polymerase II (reviewed in[9, 14]).

Most of the snoRNAs base-pair with ribosomal RNA sequences (rRNAs), guiding either their methylation (C/D snoRNAs) or pseudouridylation (H/ACA snoRNAs). The U3 C/D box snoRNA guides instead pre-rRNA cleavage. Ribosomal rRNAs are not the only substrates of snoRNA-guided modifications. Other substrates of snoRNA-guided methylation and/or pseudouridylation include the small nuclear RNAs (snRNAs) U1, U2, U4, U5, U6 and U12 in vertebrates (summarized in database [13]), U2 and U5 in plants (database introduced in [16]), U2 snRNA in yeast ([17] and [18]). In addition, tRNAs in archae are processed with the help of snoRNAs [19, 20], which have been termed in this context sRNAs (sno-like RNAs).

Although hundreds of sequences that have the hallmarks of snoRNAs have been cloned and sequenced, for many of them no apparent target could be readily identified. For this reason, these snoRNAs were called “orphan” snoRNAs [1, 21-27]. Finding their targets is of great interest, particularly because the antisense boxes, which in prototypical snoRNAs are involved in rRNA recognition, are conserved in many of these snoRNAs, suggesting that they have a biological function.

It has recently been determined that among these “orphan” snoRNAs, the HBII-52 (also called SNORD115) targets in fact an mRNA, namely the mRNA encoding the serotonin receptor 2C [21]. The effect of the snoRNA-mRNA interaction has been reported to be either alternative splicing [28] or editing [29]. Together with the HBII-85 C/D box snoRNA (also called SNORD116), which also occurs in multiple copies in the genome, HBII-52 is part of a region of chromosome 15 that appears to be deleted in the Prader-Willi syndrome (PWS) [21, 22, 28, 30]. Consistent with the neurological phenotype of this genetic syndrome, the two families of snoRNAs have brain-specific expression. Moreover, they are both expressed from the paternal chromosome, whereas the corresponding allelic region in the maternal chromosome is silenced due to genomic imprinting [31].

The first predictions of snoRNA-rRNA interactions were part of a reverse-engineering approach for snoRNA identification in the yeast genome that started from known modified rRNA nucleotides [32, 33]. Following the study of Cavaille and Bachellerie [34] it was generally assumed that rRNA-snoRNA interactions require 10 to 21 nucleotides of complementarity (Watson-Crick or G-U) between the antisense box of the snoRNA and the rRNA target. The rRNA nucleotide that pairs with the 5th snoRNA nucleotide upstream of the D (or D’) box undergoes methylation.

Following the discovery of “orphan” snoRNAs and the reports that they may target mRNAs, some attempts have been made to predict such mRNA targets. snoTARGET [35] is a tool that was developed for this purpose. The criteria defining the putative targets sites were taken from previous studies that analyzed snoRNA/target interactions (rRNAs and snRNAs) [36, 37]. That is, the duplex length was required to be between 9 and 20 nucleotides, a maximum of three G-U pairs and a single mismatch, located either at position 2 or at a position >11 in the antisense box were allowed. The positions that appear to tolerate mismatches were inferred in a study of Chen et al. [37], in which more than 400 known rRNA and C/D snoRNA complementary sequences were analyzed. The first position was defined to be the first nucleotide after the D or D’ box and it did not participate in base-pairing interactions. The application of snoTARGET allowed the authors to infer that predicted target sites of HBII-85 snoRNAs are enriched near exons and preferentially in alternatively spliced genes, supporting a proposed role of these snoRNAs in alternative splicing.

2.2. MiRNAs

MiRNAs are genome-encoded 21-23 nucleotides-long RNAs. They are transcribed by polymerase II [38] either from independent genes or as part of the introns of protein-coding transcripts. Within the primary transcripts, the miRNAs are part of hairpin structures that are recognized and processed by the Drosha/DGCR8 complex to release individual pre-miRNAs [39-42], which are then transported by exportin 5 [43] to the cytoplasm. Here, the Dicer/TRBP complex releases from the pre-miRNAs the 21-23 nt-long duplexes, from which usually only one strand is incorporated into an miRNA-induced silencing complex (miRISC). This complex, in which the miRNA acts as a guide, contains additionally an Argonaute protein, and binds to target mRNAs to induce translational repression, deadenylation and degradation of the mRNA target [44, 45]. Structural studies revealed that the miRNA 5' end is anchored in the MID domain of the Argonaute protein, and that the 5' half of the miRNA is accessible and in a relatively rigid conformation for target recognition. This explains numerous previous observations of the 5' end of the miRNAs being most important for target recognition [46-51].

Although the principles of miRNA-target recognition are only partially understood, many methods for predicting targets have already been proposed. The variables that are taken into consideration range from evolutionary conservation [50, 52-55], the sequence composition of the environment of the putative target site, its location within the 3’UTR, the base-pairing pattern in the 3' region of the miRNA [56], the structural accessibility of the target site and the energy of interaction between miRNA and target [57, 58]. High-throughput measurements of mRNA [56, 59-62] or protein [61, 62] levels upon miRNA transfection enable one to evaluate the performance of prediction methods. Recent studies [63, 64] indicate that for miRNAs, comparative genomics-based target prediction methods with very few assumptions perform remarkably well in comparison to methods that aim to capture mechanistic information. The aim of comparative genomic analyses is to identify regions that are conserved in evolution, indicating that they are under selective constraint. For instance, Xie et al. [65] and Lewis et al. [50] predicted targets of miRNAs by identifying miRNA-complementary 3'UTR segments that were conserved among a chosen set of species. To be able to handle the continuously growing whole genome sequence data, various authors proposed methods for quantifying the degree of conservation of putative target sites [54, 55] or the probability that a putative target site has been under selection [53].

3. Protocol

In order to predict targets of non-coding RNAs, be they snoRNAs or miRNAs, we first need to define a model of interaction between the small RNA and the target. Because only one example of a snoRNA-mRNA interaction is known [28], in predicting mRNA targets of orphan snoRNAs we guided ourselves by principles of snoRNA-rRNA interaction. Cavaille and Bachelerie [34] showed that snoRNA-rRNA duplexes are usually 10-21 bp, with an average of 12-13 bp. A more recent study that included 415 rRNA and C/D snoRNA complementary sequences from animals, plants and yeasts showed that the constraint on duplex length can be further relaxes to 7-24 bp [37]. Thermodynamic stability of the duplex appears to play an important role, because short duplexes are always GC-rich, whereas long duplexes appear to tolerate G-U wobble and non-canonical base pairs, particularly when the duplex is GC rich [34].

We therefore defined a putative snoRNA target site as either a genomic region that has perfect complementarity to at least 10 contiguous nucleotides in the snoRNA antisense box, or a genomic region that can form a stable hybrid (free energy of hybridization lower than -15 kcal/mol) with the antisense box of the snoRNA. We performed pattern matching to identify regions of at least 10 nucleotides perfect complementarity, and we applied RNAhybrid [66] to predict stable hybrids. We imposed additional constraints on the hybrids, as suggested by the results of Cavaille and Bachelerie [34]. Namely, we only allowed a maximum of two bulged nucleotides in the snoRNA and/or the target sequence. The final step of our snoRNA target prediction in human was to extract the orthologous regions in rhesus maccaque, mouse, cow and dog and to determine whether they also contain putative snoRNA target sites as defined above. The final set of predictions included only putative target sites that were conserved in all these species. Because we were interested in the potential involvement of the snoRNAs in alternative splicing, we intersected the set of predictions with the loci of protein-coding genes.

For miRNAs we considered as models of interaction perfect complementarity between the target and the 1-8, 1-7 or 2-8 nucleotides of the miRNA. We further expanded on the relatively simple model described above, developing a Bayesian model to quantify the selection pressure on sites with particular patterns of conservation across species. This enables us to incorporate information from any number of species located at arbitrary evolutionary distances from each other, appropriately weighing conservation between species that are close in evolutionary distance and between species that are farther apart. Given a putative site in the species of interest, we are interested to know the probability that the site has been under evolutionary selection. The genome sequence data however, only enables us to determine whether the site has been conserved in other species, and it is of course more likely that the site is conserved in closely related species relative to more distantly related species. We thus set to estimate the probability that a site with an observed pattern of conservation across species has a pattern of selection in these species. A conservation pattern is defined as a vector of 0’s and 1’s, with 1 meaning that the site is present in a given species and can form the same base pairs with the miRNA as in the reference species, and 0 meaning that any of these two conditions does not hold. Similarly, a selection pattern is defined as a vector of +’s and –‘s, with a ‘+’ denoting that the site is under selection on the branch of the evolutionary tree leading to the given species, and ‘–‘ meaning the opposite. The vector having only ‘–‘ entries corresponds to the situation in which the site has not been under selection in any of the considered species, whatever its conservation pattern is. If we know in which species a site is under selection, we can infer the probabilities for observing any of the possible conservation patterns. These are simply given by That is, we identify all the conservation patterns that are consistent with the chosen selection pattern , and among these we compute the relative probability of conservation pattern . By a conservation pattern that is consistent with a selection pattern we mean that the site is conserved in all the species in which it is under selection, but in the species in which it is not under selection it may or may not be conserved. Conservation in species in which the site is not under selection is a chance occurrence whose probability we can estimate as follows. Ideally we would need 7- and 8-mer sequences that are evolving neutrally, i.e. are under no particular selection. Because in reality we do not know what parts of the transcripts are evolving neutrally, we rather estimate the probability to observe each possible conservation pattern over all possible 7- and 8-mers, only a small fraction of which corresponds to miRNA-complementary 7- or 8-mers. What we are interested in however is not the probability of a conservation pattern given a selection pattern, but rather the probability of a selection pattern given the observed conservation patterns. From Bayes’ theorem, we have that , where is the prior probability of selection pattern . We estimate these priors by maximizing the likelihood of the conservation patterns. That is, the likelihood of the conservation data is given by , where is the number of times conservation pattern has been observed in the data, and , the probability for conservation pattern , is given in terms of a quantity which can be computed as described above, , and the prior probabilities over selection patterns, . The probabilities over selection patterns are the parameters that we need to optimize. The number of these parameters grows exponentially with the number of species, i.e. as , where is the number of species that we take into consideration (including the reference species for which we want to predict sites). It is clear that as the number of fully sequenced genomes that we use in our inference grows, the number of parameters that we would have to estimate would quickly become too large. To circumvent this problem, we used the following approximation. We computed the probability of a selection pattern as a product of the selection patterns at every node in the evolutionary tree. With a further assumption that sites can only be lost in evolution, at each node in the tree we can have one of three situations: the site is under selection only along the left branch, only along the right branch, or in both branches originating at the node. In the branch linking the reference species with the rest of the evolutionary tree we have an additional probability that the site in under selection. Finally, we take into consideration the pattern of conservation of the miRNAs, reasoning that if a miRNA is absent in a species, regions that are complementary to the miRNA in this species cannot be under selection. A complete discussion of this model is presented in [53]. In the end, for each miRNA-complementary site in the reference species we estimate the probability that the site is under selection in any other of the considered species as . Here the vector denotes absence of selection in all the considered species.

4. Example of an experiment

With the methods described above we predicted targets for all miRNAs in human, mouse, rat, fish, fruitfly and worm, and targets of the orphan snoRNA HBII-52 in human mRNAs.

In the case of HBII-52, we obtained 222 predicted target sites located within or at most 200 nt nucleotides away from a known human exon. Many of these predictions were tested experimentally (Stamm group, submitted) by the transfection of neuronal cells (Neuro2A) with MBII-52 (the mouse ortholog of HBII-52). RT-PCRs of the isolated RNA revealed the splicing pattern of each of the candidate gene. As a control, a mutant MBII-52 construct with a scrambled antisense box was used. Based on these experiments, the Stamm group identified five additional targets, whose splicing pattern is affected by MBII-52 (Stamm group, submitted).Returning to the question of evidence of evolutionary selection on MBII-52-complementary sites, we found that applying the same algorithm to randomized sequences with the same dinucleotide composition as the real snoRNAs does not yield larger numbers of conserved putative target sites. This indicates that we have not yet captured the relevant determinants of functional snoRNA-mRNA interactions and that additional work will be needed to uncover these determinants. Alternatively, the number of targets of orphan snoRNAs may be too small to yield a statistical signal when compared to predicted targets of randomized snoRNA sequences.