Supplementary notes on Alu-transcript numbers, location, subfamilies and function

Supplementary Document 1:

The transcript repeat element: the human Alu sequence as a component of gene networks influencing disease

Paula Moolhuijzen1*, Jerzy K Kulski1, 2, 3 *, David S Dunn1, David Schibeci1, Roberto Barrero1, Takashi Gojobori4, Matthew Bellgard1

Plant and animal small RNA target genes

Small RNAs have been found in all plant and animal multicellular organisms investigated so far. Plant and animal miRNAs are evolutionarily ancient small RNAs, ∼19–24 nucleotides in length, generated by cleavage from larger highly structured precursor molecules. In both plants and animals, miRNAs regulate gene expression post transcriptionally through interactions with their target mRNAs. These miRNAs, excised from endogenously encoded hairpin RNAs, negatively regulate endogenous target genes by cleavage or translational inhibition of their mRNA (Millar and Waterhouse 2005).

To date, plant miRNAs share much higher complementarities to their target genes (zero to three mismatches) than animal miRNAs to their target genes, although in both cases, high to perfect complementarities between the target mRNA and the 5′-half of the miRNA is required (Doench and Sharp 2004, Parizotto et al. 2004).

The current knowledge on animal miRNA-binding sites is limited and biased because their discovery was primarily based upon computer predictions using databases composed of only 3′- UTR sequences (Enright et al. 2003, Lewis et al. 2003, Xie et al. 2003). In comparison, plant miRNA-binding sites are found almost exclusively within the open-reading frames of the target genes (Millar and Waterhouse 2005, Xie et al. 2003).

Supplementary notes on Alu-cDNA numbers, location, subfamilies and function

Alu-cDNA Functions

The functional categories of the Alu-cDNA collected from the H-Inv human cDNA database were analysed by Gene Ontology (GO), which classifies the known functions of gene products into at least 5,175 categories according to biological processes, cellular components and molecular functions (Ashburner et al. 2000). . GO analysis was performed for cDNA with a NCBI gene identifier and CDS to determine the different metabolic and regulatory pathways that are possibly associated with the cDNA-Alu in normal and cancerous tissues. Over 8,000 cDNA-Alu from the H-Inv database were classified into nearly 3000 of the GO functional categories, and then assigned into 35 GO Slim categories for the Homo sapiens taxa identifier 9636 (http://www.geneontology.org/GO.downloads.shtml#ont). Over 70% of the cDNA-Alu loci grouped into binding and catalytic activity molecular functions (Supplementary Figure 3), and more than 80% were involved in biological processes, such as physiological process, nucleotide metabolism, cell communication and transport. Less than 3% of the cDNA and cDNA-Alu loci were designated “molecular function unknown” or “biological process unknown”. The molecular function, biological process and cellular component of loci unique to tissue traits for the full cDNA and cDNA-Alu subsets showed no cDNA-Alu affect on the GO slim category proportions. The proportion of loci unique to tissue traits for a few molecular functions did however vary between unique normal and disease traits for all cDNA and cDNA-Alu. The molecular functions of both cDNA and cDNA-Alu were associated mainly with binding (40%) and catalytic activity (20%). The number of loci products involved in the function of nucleic acid binding increased nearly two-fold for the disease loci, whereas loci associations with signal transducer activity and transporter activity decreased in the cancerous trait three and two fold, respectively. These shifts in function were significant by a Fisher’s exact test at P<0.0001 (Supplementary Figure 3).

In the GO analysis of approximately half of the cDNA-Alu loci, there was no difference between cDNA and cDNA-Alu for the biological and cellular processes or molecular function. However, theproportion of categories for function, process and cellular components of the expressed loci changed when the normal and cancerous trait of the tissues were compared., with the number of cDNA and cDNA-Alu subsets increased for nucleic acid binding and decreased for signal transducer and transporters activity in the cancerous trait. In general, the majority of the cDNA isoforms containing full-length Alu in the coding regions of the transcripts translated into hypothetical proteins rather than known proteins. As the function of these cDNA-Alus is not known or is labelled ‘hypothetical’, it is proposed that they are either untranslated or destined for rapid degradation or they serve a binding role against excessive retrotranspositional activity within the genome.

Alu Families and Subfamilies

When cDNA-Alu transcripts Alu elements were categorized into their particular families and subfamilies, the contribution of the Alu families in cDNA was similar to the proportion of Alu families within the genome; AluS 54%, AluJ 26%, AluY 10%, Monomeric 8% and Alu 2% (Price et al. 2004) (Supplementary Figure 2B). The most frequent Alu subtype was AluSx. No family bias was detected between the cancerous and normal tissue sets.

Alu Locations Within the cDNA

The relative position and size of Alu insertions within the cDNA sequences was determined for 17,819 of 17,861 distinct cDNA sequences with an annotated CDS that contained at least one Alu. The Alus positions were partitioned into three groups according to the regions in which they were found (1) overlapping the coding sequence (CDS), (2) exclusively in the 5’ UTR, or (3) exclusively in the 3’ UTR. Supplementary Figure 1 shows the number of fragments and distinct transcripts in normal or cancerous tissues, and the transcripts with known functions are indicated in red.

The transcript Alu content was measured based on the proportion of transcripts containing an Alu fragment represented within the CDS, 5’UTR or 3’UTR or represented by sequence length (base pairs) to the total sum of sequence represented in these three regions.

Alu sequence quantification within the cDNA

Greater than 28,000 Alu fragmented or full-length elements were found in the 17,000 cDNA-Alu sequences surveyed, at an average of 1.6 Alu per sequence. The Alu elements ranged in size from10 nucleotides in length to full-sized. All these Alu fragments were within the RepeatMasker recommended RM score to be outside the threshold for a false positive match (http://www.repeatmasker.org/). Supplementary Figure 2A shows the number of Alu fragments plotted as a % of the consensus Alu sequence length for Alu subfamilies, and for transcripts represented in cancerous and normal tissues, with known and unknown (hypothetical) functions. The plots show that the number of cDNA-Alu in normal tissues is clearly higher than in cancerous tissues. In the cDNA-Alu plot, minor peaks represent a slightly increased number of the monomeric form of the Alu structure (110-130 bp) and major peaks above 90% of the Alu consensus full length sequence represent a greatly increased number of the dimeric form of the Alu structure (>250 bp). This trend of minor and major peaks was similar for the cDNA-Alu in both the normal and diseased tissues.

While multiple fragments of Alu elements can overlap the CDS within a single transcript, the largest number of Alu contained in the CDS was 4 copies in one particular transcript sequence. Full-length dimeric Alu (>280 bp) fragments overlapping the CDS that represent half of the transcripts potentially have cryptic splicing sites provided by the Alu sequences.

Alu-cDNA: small non-protein-coding RNA

There are several classes of small non-protein-coding RNA (npcRNA) that play important roles in cellular metabolism including mRNA decoding, RNA processing and mRNA stability. Indeed, altered expression of some of these npcRNAs has been associated with cancer, neurodegenerative diseases such as Alzheimer's disease, as well as various types of mental retardation and psychiatric disorders (Xie et al. 2008).

It has been demonstrated that both free and embedded Alu RNAs play a major role in post transcriptional regulation of gene expression, for example by affecting protein translation, alternative splicing and mRNA stability (Hasler et al. 2007).

As transposable element evolution, frequencies and mechanisms for insertion differ further opportunity remains to investigate other classes of retrotransposable elements.

Adenosine to inosine (A-to-I) RNA editing

DNA methylation and differences in histone modifications are the best-studied epigenetic control mechanisms involved in cancer development (Feinberg and Vogelstein 1983, Feinberg et al. 2006). Recently RNA silencing mechanisms, such as RNA interference and microRNA regulation were also linked to cancer (Lu et al. 2005; Feinberg et al. 2006). Adenosine to inosine (A-to-I) RNA editing is a site-specific modification in stem–loop structures within precursor mRNAs, catalyzed by members of the double stranded RNA (dsRNA)- specific ADAR (adenosine deaminase acting on RNA) family (Bass 2002). ADAR-mediated RNA editing is essential for the normal development of both invertebrates (Palladino et al. 2000) and vertebrates (Higuchi et al. 2000; Wang et al. 2000; Hartner et al.2004). The splicing and translational machineries recognize inosine (I) as guanosine (G). These A-to-I editing events occur in noncoding repetitive sequences, mostly Alu elements, and tend to undergo multi-editing in tight clusters. In addition, RNA editing was shown to be involved in the regulation of nuclear retention and in miRNAs biogenesis (Prasanth et al. 2005; Blow et al. 2006; Yang et al. 2006). Paz et al. (2007) found reduced editing levels at Alu sequences in a human brain tumor. Alterations in editing and alternative splicing of serotonin receptor transcripts were found to also correlate with a decrease in enzymatic activity of the editing enzyme adenosine deaminase acting on RNA (ADAR) 2, as deduced from the analysis of ADAR2 self-editing. These results suggest a role for RNA editing in tumor progression (Maas et al. 2001). In the vast majority of edited RNAs, A-to-I substitutions are clustered within transcribed sense or antisense Alu sequences. Edited bases are primarily associated with retained introns, extended UTRs, or with transcripts that have no known corresponding gene. Therefore, Alu-associated RNA editing may be a mechanism for marking nonstandard transcripts to bypass the translational machinery (Kim et al. 2004).

1

Supplementary notes on Alu-transcript numbers, location, subfamilies and function

Supplementary Table 1 Subset of full-length Alu-transcript spanning CDS isoforms function exaimation.

accession / gene_id / loci / cancer / gene_id / gene_symbol / location / gene_description / repeat / isoformProtein
AK022432 / GID:950 / HIX0018713 / normal / GID:950 / SCARB2 / 4q21.1 / scavenger receptor class B, member 2 / FLAM_A
AK025390 / GID:25926 / HIX0014090 / normal / GID:25926 / NOL11 / 17q24.2 / nucleolar protein 11 / FLAM_C / hypothetical
AK026942 / GID:411 / HIX0018251 / cancer / GID:411 / ARSB / 5p11-q13 / arylsulfatase B / AluSx / hypothetical
AK024421 / GID:56996 / HIX0006933 / normal / GID:56996 / SLC12A9 / 7q22 / solute carrier family 12 (potassium/chloride transporters), member 9 / AluSx
AK024421 / GID:56996 / HIX0006933 / normal / GID:56996 / SLC12A9 / 7q22 / solute carrier family 12 (potassium/chloride transporters), member 9 / AluJb
AK024494 / GID:56996 / HIX0006933 / normal / GID:56996 / SLC12A9 / 7q22 / solute carrier family 12 (potassium/chloride transporters), member 9 / AluSx
AK024494 / GID:56996 / HIX0006933 / normal / GID:56996 / SLC12A9 / 7q22 / solute carrier family 12 (potassium/chloride transporters), member 9 / AluSx
AK054840 / GID:152641 / HIX0007805 / normal / GID:152641 / FLJ30277 / 4q35.1 / hypothetical protein FLJ30277 / AluSq / hypothetical
AK055885 / GID:1 / HIX0015537 / normal / GID:1 / A1BG / 19q13.4 / alpha-1-B glycoprotein / AluSq
AK091237 / GID:134266 / HIX0005298 / cancer / GID:134266 / GRPEL2 / 5q33.1 / GrpE-like 2, mitochondrial (E. coli) / FLAM_A / hypothetical
AK092255 / GID:169611 / HIX0035002 / cancer / GID:169611 / OLFML2A / 9q33.3 / olfactomedin-like 2A / AluSq / hypothetical
AK094948 / GID:286207 / HIX0008404 / normal / GID:286207 / C9orf117 / 9q34.11 / chromosome 9 open reading frame 117 / AluSx / hypothetical
AK000385 / GID:8635 / HIX0017958 / normal / GID:8635 / RNASET2 / 6q27 / ribonuclease T2 / AluSq / hypoth
AK098245 / GID:5549 / HIX0001490 / normal / GID:5549 / PRELP / 1q32 / proline/arginine-rich end leucine-rich repeat protein / FLAM_C / hypoth
BC011823 / GID:55652 / HIX0010579 / disease / GID:55652 / FLJ20489 / 12q13.11 / hypothetical protein FLJ20489 / AluSx / hypoth
BC010030 / GID:202020 / HIX0004117 / disease / GID:202020 / FLJ39653 / 4p15.32 / hypothetical protein FLJ39653 / AluSg / hypoth
AL162039 / GID:92597 / HIX0004271 / normal / GID:92597 / MOBKL1A / 4q13.3 / MOB1, Mps One Binder kinase activator-like 1A (yeast) / AluSx / hypoth
AF010144 / GID:27308 / HIX0028517 / normal / GID:27308 / AD7C-NTP / 1p36 / neuronal thread protein AD7c-NTP / AluSc / hypoth
AK129645 / GID:83546 / HIX0024449 / normal / GID:83546 / RTBDN / 19p12 / retbindin / AluSx / hypoth
AK124186 / GID:196743 / HIX0009334 / normal / GID:196743 / PAOX / 10q26.3 / polyamine oxidase (exo-N4-amino) / AluSx
AK125252 / GID:2523 / HIX0010735 / normal / GID:2523 / FUT1 / 19q13.3 / fucosyltransferase 1 (galactoside 2-alpha-L-fucosyltransferase, H blood group) / AluSx / hypoth
AK125657 / GID:142679 / HIX0002652 / disease / GID:142679 / DUSP19 / 2q32.1 / dual specificity phosphatase 19 / AluSq / hypoth
AK126660 / GID:59271 / HIX0027783 / normal / GID:59271 / C21orf63 / 21q22.11 / chromosome 21 open reading frame 63 / AluSq / hypoth
AK127614 / GID:9462 / HIX0023758 / normal / GID:9462 / RASAL2 / 1q24 / RAS protein activator like 2 / AluSx
AF218028 / GID:81607 / HIX0001228 / normal / GID:81607 / PVRL4 / 1q22-q23.2 / poliovirus receptor-related 4 / FLAM_A / hypoth
BC009467 / GID:158960 / HIX0056219 / disease / GID:158960 / LOC158960 / Xq28 / hypothetical protein BC009467 / FLAM_A / hypoth
BC024593 / GID:11165 / HIX0033170 / disease / GID:11165 / NUDT3 / 6p21.2 / nudix (nucleoside diphosphate linked moiety X)-type motif 3 / AluSg / hypoth
BC037327 / GID:22873 / HIX0011407 / normal / GID:22873 / DZIP1 / 13q32.1 / DAZ interacting protein 1 / FLAM_C
BC018643 / GID:83716 / HIX0020171 / disease / GID:83716 / CRISPLD2 / 16q24.1 / cysteine-rich secretory protein LCCL domain containing 2 / AluSx / hypoth
BC044913 / GID:92597 / HIX0004271 / normal / GID:92597 / MOBKL1A / 4q13.3 / MOB1, Mps One Binder kinase activator-like 1A (yeast) / AluSx / hypoth
BC060883 / GID:400581 / HIX0039163 / normal / GID:400581 / LOC400581 / 17p11.2 / GRB2-related adaptor protein-like / AluSx / hypoth
CR627453 / GID:57653 / HIX0008209 / normal / GID:57653 / KIAA1529 / 9q22.33 / KIAA1529 / AluYb8 / hypoth

1

Supplementary notes on Alu-transcript numbers, location, subfamilies and function

Acknowledgment

The H-Inv human cDNA data was analysed mostly by Paula Moolhuijzen as part of her PhD requirements.