Splicing regulation: a structural biology perspective

Antoine Cléry1, and Frédéric H.-T. Allain1,2

1 Institute for Molecular Biology and Biophysics, ETH Zürich, CH-8093 Zürich, Switzerland

2 To whom correspondence should be addressed:

Splicing regulation: a structural biology perspective

Introduction

The spliceosome and his associated proteins is a highly dynamic RNP machine involving a complicated network of RNA-RNA, RNA-protein and protein-protein interactions. Mass spectrometric analyses of affinity-purified spliceosomal complexes indicate that the total number of spliceosome-associated factors is approximately 170 [1]. Among all the proteins involved in splicing, one can first distinguish the proteins which are part of the spliceosome (the spliceosomal proteins) and the others which are referred as splicing factors. Three chapters have been dedicated to this nuclear macromolecular machinery in human, yeasts and plants. Here, we focus on the large number of splicing factors involved in the regulation of splicing (also referred as alternative-splicing). Recent estimations indicate that nearly 80 to 95% of human multi-exon pre-mRNAs are alternatively spliced [2-4]. In higher eukaryotes, high frequency of alternative-splicing events results from the presence of degenerated 5’ and 3’ splice-sites which fail to efficiently recruit the spliceosome. As a result, the presence of additional RNA sequences located in both exon and intron elements are necessary to stimulate or inhibit splicing. Most of these cis-acting RNA sequences are bound by splicing factors which help recruiting or not the splicing machinery. The numerous splicing factors identified to date can be categorized in three main families: the SR proteins (containing serine/arginine rich sequences) which mostly facilitate splice-site recognition, the hnRNP proteins which are considered to have rather an antagonist function and finally the tissue specific splicing factors which can play both roles (reviewed recently by Chen and Manley [5]). All these alternative-splicing factors contain different types of RNA binding domains (mostly RRMs, KH domains and zinc fingers) often in multiple copies (Fig. 1 and Table 1) and all of them recognize RNA sequence specifically. In this chapter, we review the current knowledge on how alternative-splicing factors recognize RNA and proteins at the atomic level. Structural biology contributions have been essential over the last decade to help deciphering this vast protein-RNA and protein-protein interaction network. Structures have explained how certain cis-acting elements can be discriminated by splicing factors but also how RNA binding protein can affect RNA structure. We have organized this chapter in grouping the splicing factors by the types of RNA binding domains they embed rather than by family of proteins. We then review successively the structures of alternative-splicing factors containing RRMs (the most common RNA binding domains found in splicing factors), containing zinc-fingers and finally containing KH domains. We describe and compare the different structures and show how the structure of the alternative-splicing factors in complex with the different RNA and protein partners contribute to a better understanding of the mechanism of action of these proteins in splicing regulation.

1. The RRM: a versatile scaffold for interacting with multiple RNA sequences and also proteins

The RNA-recognition motif (RRM), also known as RBD (RNA binding domain) or RNP (ribonucleoprotein domain) is the most abundant RNA-binding domain in higher vertebrates (this motif is present in about 0.5%-1% of human genes) [6]. Over the last ten years, biochemical and structural studies have shown that this domain is not only involved in RNA/DNA recognition but also in protein-protein interaction. Both modes of interactions play crucial role in splicing regulation.

1.1.RRM-RNA interaction and splicing regulation

An RRM is approximately 90 amino acids long with a typical  topology that forms a four-stranded -sheet packed against two -helices (Fig. 2A). RRMs are found in almost all types of splicing factor families in a single copy or in multiple copies (Fig. 1). Although the -sheet is most commonly used to bind single-stranded RNAs, an extreme structural diversity of modes of RRM-nucleic acid recognition has been selected during evolution making RRMs a very versatile RNA binding platform [7, 8].

Most commonly, three aromatic side-chains belonging to the two signature sequences RNP1 and RNP2, and exposed on the -sheet surface (Fig. 2A and 2B), accommodate two nucleotides as follows: the bases of the 5’ and of the 3’ nucleotides stack on an aromatic ring located in 1 (position 2 of RNP2) and in 3 (position 5 of RNP1), respectively (Fig. 2A). The third aromatic ring which is usually located in 3 (position 3 of RNP1) is often inserted between the two sugar rings of the dinucleotide (Fig. 2A). However, deviations from this basic mode of binding are found in many RRM-RNA complexes, due to a role of the N- and C-terminal extensions of the domain, to the interdomain linker in case of proteins containing multiple RRMs or to additional protein cofactors that can also modulate the RNA-binding specificity [8]. Several alternative-splicing factors containing one or multiple RRMs have been solved in complex with RNA over the years (Table 1), namely PTB (polypyrimidine-tract binding protein, 4 RRMs), HuD (2 of 3 RRMs), Sex-lethal (2 RRMs), hnRNP A1 (2 RRMs), U2AF65 (2 RRMs), Fox-1 (1 RRM), RBMY (1 RRM), SRp20 (1 RRM) and more recently RRM3 of CUG-BP.

1.1.1 RNA binding by splicing factors containing a single RRM

Splicing factors embedding a single RRM are few in comparison with the ones containing multiple RRMs. With a single RRM, only SRp20, 9G8, SC35, SRp46, SRp54, SRrp86, RNPS1, Tra2 and Tra2 are found among SR and SR-like proteins, hnRNP C1/C2 and G, among hnRNP proteins and Fox-1 and Fox-2 among the tissue-specific splicing factors (Fig. 1). With a single RRM, one would expect these proteins to bind RNA with less affinity and less sequence-specificity than multi-RRM proteins, we will see that if this is true for some (SRp20) this is not always true (Fox-1). Among these factors, the structures of three single RRMs in complex with RNA have been determined, namely SRp20, Fox-1 and RBMY (a testis-specific protein with more than 80% identity with the RRM of hnRNP G).

The NMR structure of the human SR protein SRp20 in complex with the 5’-CAUC-3’ RNA sequence still represents the first and unique structure to date of an SR protein in complex with RNA [9]. The structure reveals the presence of an additional aromatic residue (a tryptophan) located on the -sheet surface (on 2-strand) that is responsible for the binding of the two most 3’ nucleotides (Fig. 2C). Although, four nucleotides are bound, the affinity is rather weak (20 M) due to the unusual semi sequence-specific mode of RNA recognition by this RRM. Indeed the structure reveals a binding consensus sequence CNNC (where N can be any nucleotide) which is compatible with the sequence consensus established for this protein by in vitro and in vivo SELEX experiments [10]. This degenerate sequence-specificity of SRp20 RRM allows the binding of this protein to more diverse RNA sequences making the evolutionary pressure on the bound RNA weaker, which is ideal for exonic sequences containing natural SRp20 RNA targets [10]. This weak RNA binding affinity allows a more frequent SRp20 association and dissociation from the RNA which is important in the context of the highly dynamic processes involving this protein which is present from RNA transcription to mRNA export.

The structure of the RRM of human Fox-1 (a tissue-specific alternative splicing factor) in complex with the 5’-UGCAUGU-3’ RNA presents a radically different mode of binding compared to SRp20 [11]. Although both proteins contain a single RRM, the affinity of the Fox-1 RRM for 5’-UGCAUGU-3’ is extremely high (Kd in the subnanomolar range) reflecting a very high sequence-specificity for the central pentamer GCUAG. To accommodate seven RNA nucleotides on a single domain, the RRM of Fox-1 uses, in addition to the -sheet surface, several loops joining secondary structure elements (Fig. 2D). In particular the presence of a phenylalanine in the 1/1 loop of Fox-1 RRM is critical for binding RNA as the first three nucleotides are wrapped around it (Fig. 2D) [11]. Although the mechanism of action of Fox-1 and Fox-2 in splicing regulation is not known, the clear sequence-specificity of the protein allowed a reliable mapping of its binding sites and the identification of strong correlations between the location of Fox-1 binding site relative to splice-sites and its effect on splicing regulation [12, 13]. Considering the very high affinity of Fox-1 RRM, one would expect Fox-1 to remain bound to the RNA when the protein finds its target, contrary to SRp20.

The structure of the single RRM of the human testis-specific RBMY in complex with RNA revealed common features to both SRp20 and Fox-1 [14]. Considering the high sequence identity between the RRMs of the human RBMY and hnRNP G, the structure suggests that hnRNP G can bind sequence-specifically CAA motifs on the -sheet surface of the RRM (Fig. 2E). However, hnRNP G and RBMY having a different 2/3 loop (both in length and sequence), only RBMY has the ability to bind a stem-loop containing a CAA motif in the loop by insertion of the  loop in the major groove of the RNA stem (Fig. 2E). Although only putative targets have been identified for RBMY [14], it is interesting to note that the two tissue-specific splicing factors described here (RBMY and Fox-1) both bind RNA with high-affinity and specificity using a single RRM.

1.1.2 RNA binding by splicing factors containing multiple RRMs

Most splicing factors contain multiple RRM copies (Table 1). Structures of the two RRMs of Sex-lethal, U2AF65 and hnRNP A1, of RRM3 of CUG-BP, of RRM1 and RRM2 of HuD and of the four RRMs of PTB have been determined in complex with RNA (Table 1). From these few structures it appears that generally RRMs joined by a single protein chain happen to bind very similar sequences although not in an identical manner. This could be at the origin of the very repetitive sequences that have been observed in cis-acting elements regulating splicing [5]. There are of course exceptions to this rule with for example five SR proteins (ASF/SF2, SRp30c, SRp40, SRp55 and SRp75) which all embed two very different RRMs (a canonical RRM and a pseudo-RRM), each of these two RRMs harboring a different RNA binding specificity [10, 15].

Recognizing pyrimidine-tract by Sex-lethal, U2AF65 and PTB

The 3’ splice-site pyrimidine-tract is a major cis-acting element for both constitutive and alternative splicing. Several trans-acting factors have been shown to bind the pyrimidine-tract resulting in activating (U2AF65) or repressing (Sex-lethal and PTB) splicing. The structure of the RRMs of the three proteins bound to pyrimidine-tracts revealed the nature of the RRM-RNA interaction and the molecular basis of the sequence-specificity of each protein.

The structure of all four RRMs of PTB was solved in complex with short 5’-CUCUCU-3’ pyrimidine-tracts [16]. It was found that each RRM of PTB can bind a short pyrimidine-tract, RRM1 and RRM4 binding three pyrimidines, RRM2 binding four and RRM3 binding five. RRM2 and RRM3 of PTB contain an additional fifth -strand resulting in an extension of the -sheet and therefore the binding of additional nucleotides (Fig. 3A) [16]. The structure revealed a similar although not identical sequence specificity for the four RRMs as RRM 1, 2, 3 and 4 recognize specifically YCU, CUNN, YCUNN and YCN sequences, respectively (Y is a pyrimidine and N any nucleotide). The dissociation constant (Kd) of each RRM for a CUCUCU sequence is around 1M but increases substantially for polyU sequences confirming the sequence-specific binding preference for pyrimidine-tracts containing cytosines [17].

The structure of the pre-mRNA splicing factor U2AF65 in complex with a U-tract revealed a different mode of pyrimidine-tract recognition although still using the -sheet surface (Fig. 3B) [18]. This interaction is governed by hydrogen-bonds involving flexible side-chains of conserved U2AF65 residues and by water molecules mediating interactions between U2AF65 side-chains and the uracil bases. The use of flexible side-chains and the possible relocation of bound water molecules could explain how U2AF65 accommodates at certain position cytosines which are present in most 3’ splice-site pyrimidine-tracts. Like PTB RRMs, the two RRMs of U2AF65 bind RNA independently explaining the similarly weak affinity (Kdin the M range) observed for this splicing factor [19].

These structural data allow a better understanding of how PTB and U2AF65 compete for binding on the 3’ splice-site pyrimidine-tract [20]. U2AF65 preferentially binds uracil-tracts but can adapt to bind any pyrimidine-tract due to its versatile mode of RNA binding, whereas PTB preferentially binds pyrimidine-rich sequences containing CU-tracts. This explains that alternative exons repressed by PTB and containing CU-tracts in the 3’splice-site can be changed into constitutive exons and therefore de-repressed by several C to U changes [21, 22].

Binding of U-tracts by the two RRMs (RRM12) of Sex-lethal is quite different from the other two proteins. In the structure of the complex [23], Sex-lethal RRM12 recognizes sequence-specifically each nucleotide of 5’-UGUUUUUUU-3’ except U5, with RRM2 recognizing the 5’UGU and RRM1 the 3’UUUUUU sequences. InterRRM interactions upon RNA binding and contact from the short interdomain linker to the RNA contribute to the overall high affinity (Kd in the nM range) of Sex-lethal for the RNA. Comparison between the two structures explains well, how Sex-lethal can prevent U2AF65 binding to U-tract like observed in the Drosophila tra pre-mRNA [24]. Sex-lethal RRMs can not only discriminate better than U2AF65 uracils over cytosines but also the two RRMs of Sex-lethal can bind cooperatively U-tracts while the two RRMs of U2AF65 cannot.

Although PTB, U2AF and Sex-lethal bind pyrimidine-tracts using similar RNA recognition motifs and the same surface of interactions (the -sheet), subtle variations in the side-chain composition on the -sheet surface has allowed the RRM of each protein to recognize UCU, YYY and UUU sequences, respectively. Additionally, the RRMs of Sex-lethal evolved to bind pyrimidine-tract cooperatively while the RRMs of PTB and U2AF65 appear to bind RNA independently.

Recognizing purine-pyrimidine tract by CUG-BP and HuD

Several purine-pyrimidine tracts have been found as alternative-splicing regulatory cis-acting elements like for example CA-tracts, UG-tracts or CUG-tracts [25]. AU-rich elements have been initially characterized for their importance in RNA stability and more recently in alternative-splicing regulation [26]. Several RRM containing proteins have been identified as trans-acting factors binding these purine-pyrimidine tracts, for example hnRNP L binding CA-tracts, RBM35 and CELF-proteins like CUG-BP binding UG-tracts and CUG-tracts [25] or ELAV-proteins like HuD binding AU-tracts. The structures of HuD RRM1 and RRM2 bound to AU-rich RNA [27] and more recently the CUG-BP RRM3 in complex with RNA [28] have been determined and provided information on how such RNA tracts are recognized by RRMs.

HuD and CUG-BP have in common a similar domain organization, both proteins embedding three RRMs with the two most N-terminal RRMs (RRM1 and RRM2) being separated by a small interdomain linker (11 and 9 amino-acids, respectively) while the C-terminal RRM3 is found much further away from RRM2 (89 and 113 amino-acids, respectively). The two solved structures therefore provide indications on the RNA binding mode of the numerous splicing factors containing three RRMs (Fig. 1). Considering the high sequence similarity between RRM12 of HuD and the RRM12 of Sex-lethal, it is maybe not too surprising that the structure of HuD bound to UAUUUAUUU [27] (Fig. 3C) and Sex-lethal bound to UGUUUUUUU adopt a very similar conformation [23]. While most of the contacts with the pyrimidines are sequence-specific (Fig. 3C), the protein contacts to the adenines in HuD do not appear to be A-specific, similarly to the contacts to guanines in Sex-lethal [23]. It is therefore unclear how the purines are discriminated by these two RNA binding proteins. In the case of HuD, it was even suggested that adenines destabilize HuD binding [29]. Like for Sex-lethal, it is very likely that RRM1 and RRM2 of HuD bind RNA in a cooperative fashion to increase RNA binding affinity and specificity [30].

The structure of RRM3 of CUG-BP1 was recently determined in complex with the hexamer UGUGUG [28]. This NMR structure revealed sequence-specific recognition of the central UGU motif although all six nucleotides are bound by RRM3 (Fig. 3D). The 12 amino-acids immediately N-terminal to the RRM strongly interact with the surface of the RRM in its free form by running across the -sheet. This N-terminal extension also contributes to RNA binding by interacting with G4 and U5 (Fig. 3D). This extension partly explains that six nucleotides can be bound to this isolated RRM although the binding affinity remains modest (Kd = 1.9 M).

The binding affinity and mode of sequence-specific binding for the two N-terminal RRMs of HuD and the C-terminal RRM of CUG-BP are quite different. This possibly reflects on the different roles played by these two different parts in both proteins [30, 31], although this needs to be confirmed by the structure of the three RRM containing protein bound to RNA. Also it remains to be seen if, in this context, the three RRMs have the same RNA binding specificity or not.