Title: Array analysis

Pierre de la Grange

GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, Paris, France

*Address correspondence to: Pierre de la Grange, GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, 1 avenue Claude Vellefaux, 75010 Paris, France; tel: +33 (0) 157276839; fax: +33 (0) 157276 831; E-mail:

  1. Abstract

Alternative Splicing is the main mechanism allowing to increase the transcriptome diversity by generating multiple RNA isoforms from a single gene. This mechanism concerns more than 90% of human genes andCells frequently change the usage of alternative exons due to stimulation and alternative splicing is altered in many diseases. Recently, expression microarrays have been developed that can detect changes in terms(not clear what in terms should mean, could be deleted?) of exon-level expression and splice site selection. Currently, the biggest challenge for the expression microarrays dedicated to study alternative splicing is the bioinformatics analysis of array data, their RT-PCR validation and their subsequent biological interpretation. Despite these problems, microarrays revealed an unexpected number of alternative splicing events, whatever the experimental conditions that were usedcompared together to find these events ( for example: pathological states, tissues, hormonal treatment, splicing or transcription factor knock out) (siRNA against splicing or transcription factor, etc). It is important to underline that arrays dedicated to splicing analysis also provide robust analysis in term of global gene expression (i.e., transcriptional effect) and thus, could replace in the medium term current standard technologies for large-scale gene expression analysis such as cDNA arrays. the Affymetrix U133 arrays.

  1. Theoretical background

2.1.Microarray: general principles

Microarrays uses base-pairing (hybridization between nucleic acids) as a principle to detect interactions between nucleic acids, similar to Northern Blot analysis.such as Northern. The major difference from Northern blot analysis is that expression of many genes can be detected simultaneously (from thousands genes to the whole-genome). The majority of arrays used to study alternative splicing, use oligonucleotides that are attached to glass slides. Each oligonucleotide is designed to target a specific genomic region and one spot on the array gathers several thousands of the same oligonucleotides. The subsequent signal of the spot depends on the number of oligonucleotides hybridized of to this spot and is in principle (should be proportional to the amount of the corresponding RNA in the sampleregion from sample).

Microarray slides can be produced by ink-jet printing (Agilent, ExonHit, Jivan) or by photolithography (Affymetrix). The major advantage of ink-jet printing is that it can be easily customized, since it does not require the generation of photolithograpical masks. The drawback is the smaller number of spots per array (currently around 250,000). The major advantage of arrays generated by photolithography is their high number of spots (currently around 6,500,000).and a high reproducibility between arrays

2.2.Probe design of splicing microarrays: interest and limitation

Since probe sequence can be designed to target a specific genomic region, all the different gene regions can be investigatedtargeted, including exons, parts of exons, introns, and even exon-exon junctions. Information regarding the exon/intron gene structure and the different alternative events (i.e., the annotations) used for a microarray design (i.e., selection of probes of the array) are often based on known annotations from publicly available databases (from EST and cDNA alignments against genomics sequence) and can also include information from predictions (e.g., in silico gene and exon predictions, cross-species conservation). A list of common databases is given in chapter 44 de la grange Depending on the array used (see below), two major kinds of probes can be included on the chip: exon probes and exon-exon junction probes.

Exon probes detect expression of a specific cassette exon ore alternative 3’ or 5’exons a specific exon part, as well as intronic retention (e.g., part corresponding to a 3’ alternative splice site, intronic region corresponding to an intron retention event). This kind of probes is also useful for studying the global gene expression regulation (i.e., the transcriptional effect), as the probesets cover most of the available RNA information. Another advantage of this kind of probes is that events that are not known to be alternative can be detected (e.g., “new” exon cassette). Annotations based on in silico and cross-species conservation can also lead to discovery of “new” genes and exons.

Exon-exon junction probes hybridize half to the end of one exon and half to the beginning of the next exon. Depending on the alternative splicing events belonging between to these two exons, exon-exon junction probes can vary (e.g., if exon 3 is known to be a cassette exon, exon-exon junction probes will be designed between exons 2 and 3, between exons 3 and 4, and between exons 2 and 4). This kind of design is particularly efficient to study alternative event regulation but should not be used for studying the global gene expression regulation. Another limitation of this design is that due to limited array space exon-exon probes are designed to target known alternative events only. However, custom microarrays designed to study expression regulation of few genes only can include all possible exon-exon junctions.

Can you make a simple picture showing the probes?

2.3.Available splicing microarrays

Two kinds of microarrays can be used for splicing studies: commercial microarrays and custom microarrays.

Three companies provide commercial microarrays for splicing study. Affymetrix provides the GeneChip® “Exon Array” system that is based on exon probes only but is able to detect expression of more than one million of exons for Human, Mouse and Rat (the 200,000 known exons plus around 800,000 putative exons). Currently, it is the most used microarray for splicing study.

ExonHit Therapeutics provides the “SpliceArray” (for Human, Mouse and Rat) that is based on both exon and exon-exon junction probes. As described above, annotations are based on known alternative events only.

JIVAN provides the “SpliceExpress” that is based on both exon and exon-exon junction probes. Annotations are also based on known alternative events.

Both Affymetrix and Agilent provide custom microarrays on demand. The Agilent technology allows flexible custom data development (synthesis of oligonucleotides to be printed on the chip is) and provides a free software named “eArray” that greatly facilitates the design of probes.

Affymetrix is developing a new generation microarrays for splicing study (“Junction Array”). This array will include both exon and exon-exon junction probes. No information is available regarding timing of this microarray commercialization. Other companies are certainly developing or will develop new microarray dedicated for splicing.

2.4.The different steps of the microarray data treatment

From the data acquirement by chips scanning and data quantificationquantitating, several steps are necessary to obtain relevant results that can be further exploited by biologists (see figure 1). The first one is the normalization of data. This step is necessary to compare intensities from all the chips of given experiment in order that differences found in the analysis step come from biological effect and not from other factors (e.g., date of experiment, technician that made the experiment, technical variation).

Another step in the pre-treatment of data (before the statistical analysis) is the background subtraction. Each spot of the chip will result in lead to have a signal value. This signal can be separated in two parts: the first one corresponds to the specific signal due to expression of the corresponding targeted genomic region and the second one corresponds to a non-specific signal corresponding to background. Thus, goal of this step is to estimate general background intensity and then to substrate this background to all probe intensity.

The Objective of the statistical analysis step is to find relevant differences in term of gene expression between the tested experiment conditions at the gene (transcriptional effect) and exon levels (splicing effect).

After having list of regulated genes and exons, a visual inspection of these results is necessary. It allows to check these results in term of quality (reproducibility, fold-change…) but also to start their biological interpretation (e.g., what kind of alternative event is regulated?).

Subsequent functional analysis can be performed to predict the functional consequences of the predicted regulations (e.g., in which pathways the predicted regulated genes are involved?)

  1. Protocol

A microarray project aiming to study gene expression regulation between two experimental conditions should include at least six arrays: for statistical reasons, each condition must be tested in triplicate in order to find biological effects and avoid technical variations. Taking into account this point, a project with 6 Exon Arrays will lead to analyze around 40 million points of data.

3.1.Normalization

Most commonly the normalization is based on all genes on the array. The assumption is used that between two conditions the majority of genes do not change in terms of their expression level. Microarray intensities should always be looked at using log2 scale. This scaling should roughly adjust the variance (variance of what? ) to be the same for all intensities. Differences of log2 intensities reflect the log2 ratios (M values) for a comparison. Then, a robust estimation of a “rescaling” factor (e.g., median of differences) has to be performed. There are many normalization methods. Which of the methods is most stable and gives best results is dependent on the type of data, the image analysis program, etc. To determine the best method, it is a good idea to try several methods initially on a few datasets and inspect the results visually using controls:

- Scale normalization: the simplest way to normalize data is simply to adjust the scale of the data, e.g., set the median of differences to 0. This does not consider any region or intensity dependent effects;

- Quantile: A Similar idea to scale normalization but more drastic, as all of the various quantiles are adjusted and not only the 50% quantile (median). This type of normalization is most commonly used for the Affymetrix arrays;

- Other methods are also applicable but are not described here (Lowess, VSN…) (can you give references for this?.

3.2.Background subtraction

Each microarray gathers many control probes that are used to estimate the background intensity. For example, in the case of the Affymetrix Exon Arrays, the background is based on the GC content of probes. The Affymetrix probe length is always 25 nts. For each GC content (from 1 to 25), there are around 1,000 control probes with the same GC content that are not targeted a transcripted genomic region. The Corrected signal intensity of a given probe is obtained by calculating median intensity of GC control probes with the same GC content, and then by subtracting this value to the raw signal intensity of the probe.

3.3.Statistical analysis

ExonHit Therapeutics and Jivan provide their own analysis system for their chips but Affymetrix does not provide any software for their Exon Arrays. Currently, several algorithms/software are available to analyze data from these arrays. Corresponding algorithms are based on several methods:

- The Splicing Index (the logarithm of the ratio of the exon signal to the total signal from the gene: log2[exon/total]) [1];

- PAC (Pattern-Based Correlation);

- MIDAS (Microarray Detection of Alternative Splicing);

- ASNOVA [2].

For more information about these methods, see the corresponding “white paper” on the Affymetrix website [3].

Based on these methods, several commercially software/services are available. From them, three of them are most used:

- EASANA what does ensana stand for?(GenoSplice technology):

- Genomics Suite (Partek):

- XRay (Biotique Systems):

Genomics Suite and XRay work as a client software and must be installed on the user’s computer. It seems important to underline that specific knowledge and/or a delay to easily use these software can be necessary to. EASANA is provided as a service: the user has to send its CEL files (obtained after scanning the chips) on the GenoSplice server and the company send back the result files and provides assistance for biological interpretation and/or other personalized services. No biologicalsplicing or bioinformatic knowledge is necessary. EASANA is based on an algorithm developed within a EURASNET team (is there a reference?, can you briefly describe the reference?).

3.4.Visualization of data

Generally, Theanalysis software provides a visualization system allowing to check results. Most of them consist on showing the splicing index curve, for example the Genomics Suite from Partek (figure 1A). The BLIS interface from Biotique Systems (providing XRay) displays signal intensity of probesets in the different experiments (figure 1B). The EASANA visualization module from GenoSplice technology (developed in collaboration with EURASNET teams) displays both the mean intensity between all couplepairs? of experiments (i.e., treatment vs. control) and in each couple of experiments to show the reproducibility between experiments. A simple color system allows to explicitly retrieve intensity variation at the gene and exon level (figure 1C). Interestingly, the signal intensities are displaying at the probe level (i.e., not at the probeset level). Not clear what is the difference between a probe and probeset

3.5.Functional analysis of results

Functional analysis can be performed by powerful free software/website. Two examples are DAVID (david.abcc.ncifcrf.gov) [4] and PANTHER ( [5]. For these two tools, the user has to select a list of reference genes (e.g., the human Refseq genes) and to input a list of genes of interest to analyze their function/pathways where they are implicated. In the case of analysis of an Affymetrix Exon Array project, this list can be the list of regulated genes (transcriptional effect), or the list of genes where regulation at the exon level were predicted (splicing effect).

Can you put all these programs into one table

  1. Example of an experiment

In order to identify exons regulated by the splicing factors PTB and nPTB, an Exon Array project lead by Christopher W.J. Smith (Cambridge University, UK) was conducted. HeLa cells treated with siRNAs targeted PTB/nPTB were compared to those treated with control siRNA (to be published).

These microarray data were analyzed by different analysis systems including EASANA from GenoSplice technology. This system provides lists of regulated genes and regulated exons, each with two levels of confidence (“high” and “low”). The “high” level only considers “high-quality probes” from well-annotated genes by filtering probes according to their specificity in addition to their expression (GC content, overlap with repeat regions, cross-hybridization). The “low” level includes all probes corresponding to well-annotated probes. EASANA predicted 721 regulated genes by siPTB/nPTB using the “high” confident level (280 up; 441 down) and 1,543 regulated genes by siPTB/nPTB using the “low” confident level (450 up; 1,093 down). In term of exon regulation, 218 exons were predicted to be regulated by siPTB/nPTB using the “high” confident level and 2,273 using the “low” confident level. In these two lists, exon 15 of the KIAA0652 gene was predicted to be specifically included with siPTB/nPTB. This exon was also predicted by other analysis systems that were run on this experiment. Visualization systems from Genomics Suite (Partek), BLIS (Biotique Systems) and EASANA (GenoSplice technology) are presented on figure 1 for this event.

  1. Troubleshooting

The major problem with array experiments is their poor reproducibility with other methods, notably with RT-PCR. The validation rate can be as low as 35% [6]. In the majority of cases it is around 50-80%. Since these numbers only address the false positive cases, the real error rates, that include false negatives, will be much higher. One reason for the poor reproducibility could be the large amount of unknown RNAs that often overlap with known transcripts [7].

The next problem concerns the data analysis. All software/algorithms do not provide the same results for the same project. Even if one or two software/services are better than the others, several systems should be used ideally to gather maximum of results. According to knowledge and capacities available within teams, the choice between software and services can be decided. In addition, it could also be possible to develop an internal analysis system. However, it can be a very long work and results may not be as relevant as those provided by existing solutions.

Another important constraint is that array experiments give no connectivity information between distant exons, even with junction probes. For example, if two alternative events are predicted to be regulated in a same gene, it is not possible to know whether regulation of event #1 is associated with event #2 or not. These specific events can be analyzed by RT-PCR based methods, described in chapter 18 (SMITH) and 19 (Chabot).

Figure legends

Figure 1: Visualization systems for the Affymetrix Exon Array data: example of the siPTB/nPTB effect on exon 15 of the KIAA0652 gene