Supplementary notes

1/ The computational pipeline

Figure 1: Overview of the computational pipeline for the prediction of conserved direct target genes of a transcription factor.

To identify functional target sites and consequently a target gene battery of a given transcription factor in a genome wide manner we devised a multi step procedure (figure 1), which relies on the evolutionary conservation of functionally relevant transcription factor binding sites. This procedure is applied at a genome-wide level and therefore allows to predict whole sets of target genes.

For the genome wide analysis we first reduce the complexity by limiting the search space to the region upstream of annotated genes that we subsequently search for the presence of the motif corresponding to the transcription factor binding-site. The regions that do not contain the motif of interest are discarded in this step. We then perform a pair-wise alignment with the selected regions within closely related genomes (e. g. human/mouse or human/rat) using promoterwise [1] as an alignment tool adapted for the analysis of non-coding sequences (http://www.ebi.ac.uk/~birney/wise2/). Regions where the identified binding site is not in a conserved stretch are discarded. In a next step we scan orthologous regions in more diverged species for the presence of the motif. This additional filtering step is independent of any alignment, i. e. the motif does not have to lie in a conserved stretch.

For many of these upstream sequences, no significant alignment is detectable between diverged genomes, making it difficult to identify the orthologous region in the diverged genome. However in most cases gene orthology is unambiguous and can be used as an anchor to identify the corresponding upstream region. For example orthologous genes in fish (zebrafish and fugu) are identified and their upstream region is then scanned for the presence of the motif. An upstream region passes this second filtering step only if the motif is present in one of the corresponding fish regions. Predicted target genes are now defined by virtue of their linked upstream region that contains the site.

Benchmarking

We benchmarked our in silico procedure using the binding site of the transcription factors E2F (Transfac, Jaspar [2,3] and comparing our data set with that obtained by chromatin-immuno precipitation (IP) [4] (see also Materials and Methods). The E2F position weight matrices (PWMs) (Transfac, M00516) is used to search E2F target genes as describe in the material and method. The sensitivity and specificity of the procedure for each cutoff of the matrix were calculated and the receiver operating characteristic (ROC) curve plotted (figure 2)

Figure 2 : ROC curve. Two alternative PWMs for E2F binding site were used (M00050 and M00516 from Transfac (Matys et al. 2006)) variable cutoffs were analyzed and the predicted target genes were compared with the data from [4] as described in materials and methods.

Next we set the PWM match cutoff to 85% with the matrix that performs best (M00516). To test if the computational procedure enriches significantly in target genes experimentally identified by Ren et al. [4] to be bound by human E2F1 E2F4 in their upstream regions, a randomization procedure was applied. A set of genes was randomly sampled from the genes experimentally analyzed. The number of genes in that random set corresponds to the real number of genes found by the computational procedure to overlap with the set of genes analyzed by Ren et al. [4]. The overlap between the random set and the positive set was assessed and compared with the real overlap (using the predicted target genes). This randomization procedure was repeated 100000 times (table 1 and table2).

Real data

Filftering steps / Motif present / Conserved with rodents / Conserved with fishes / Conserved both in rodents and fishes
In ALL / 165 / 29 / 26 / 14
In POS / 34 / 19 / 17 / 12
percentage in POS / 20.6 / 65.5 / 65.4 / 85.7

Table 1 : Number of genes predicted for each filtering steps that overlap with the genes analyzed by [4] (ALL) and the positive set from [4] (POS). The filtering steps are the presence of a hit in the upstream region of the human gene (Motif present), the presence of the motif in a conserved region with rodent (conserved with rodent), the presence of the motif in the orthologous region of fish (conserved with fish) or the full filtering procedure as describe in Material and methods.

Randomization

Filtering steps / Motif present / Conserved with rodents / Conserved with fishes / Conserved both in rodents and fishes
In ALL / 165 / 29 / 26 / 14
In POS (average) / 13.5 / 2.4 / 2.1 / 1.1
percentage in POS / 8.5 / 8.3 / 8.0 / 7.1

Table 2 : Random genes within ALL were randomly sampled and the overlap with POS calculated. For each steps the number of genes sampled is the same as found using the computational procedure and the randomization was repeated 100000 times per filtering steps.

In no case, the random dataset shows an overlap that is equal or better than the experimental dataset (P value < 0.00001). Furthermore, the percentage of genes found that overlap with the positive set increase at each filtering steps. These results confirm that the filtering procedure significantly enriches in target genes and this enrichment is improved in each filtering steps.

2/ Promoter analysis

We tested the ability of those constructs with an altered ATOH7 (Ath5) binding site to drive the expression of a GFP reporter in vivo, in transgenic medaka embryos. Due to the high stability of the GFP protein, fluorescence is also detected in mature neurons, long after ath5 expression has ceased. Embryos injected with GFP reporter constructs, in which the E-box consensus in the two conserved motifs is disrupted (HP), fail to efficiently express GFP in the endogenous domain (not shown, see also Materials and Methods).

Interestingly the variations changing the binding motif without altering the E-box consensus (M, N and MN, see Table S3 for primer sequences) had a striking effect on the specificity of the promoter. In the respective transgenic embryos of all three variants ectopic GFP expression domains were detected, in addition to the endogenous domain in the retina. Those were the olfactory receptor neurons (Fig. S1C), as well as yolk cells and the tail region (data not shown). Ath5 is not normally expressed in these territories at any developmental stage, with the exception of Xenopus [5], where a single nucleotide alteration within one of the two conserved motifs is found (see Fig. 1A). Strikingly the same alteration in the medaka promoter, or the Xenopus promoter itself, when analyzed in transgenic medaka embryos drives GFP expression in both the retina and the olfactory receptor neurons (data not shown). Thus the change of a single base pair in the E-box of the medaka promoter is sufficient to attract a new regulatory input leading to the establishment of a new expression domain.

3/ Specificity of activation

While Xenopus Neurod4 (Xath3) can activate the Ath5::GFP reporter, it only partially activates the predicted Ath5 target genes (Fig. S3). Interestingly this activation is mediated by the induction of endogenous Ath5 expression and is efficiently blocked in the presence of Ath5 morpholino oligos (Table S3). Xenopus Ash1 (Xash1) in contrast does neither activate the Ath5 targets nor the Ath5 reporter. Both are efficiently activated by Xenopus Xath5, an activation that cannot be competed for by the medaka Ath5 morpholino, indicating the specificity of the morpholino and the interaction.

4/ Mutant analysis

To investigate whether the in silico target genes are directly controlled by Ath5, we examined the expression of six predicted targets in the retina of the zebrafish mutant lakritz (lak), which does not express functional Ath5 [6]. In all cases analyzed (Brn3C, CD166, Adam11, Gfi-1, HuC and NN-1, Fig. S4), RGC expression of the target genes was specifically abolished in the lak mutant retinae, indicating the requirement of lak/ath5 function for the expression of the target genes in this domain (Fig.S 4). The expression of these Ath5 target genes is not affected in their other expression domains outside the retina indicating an Ath5 independent input into their transcriptional control.

References

1. Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14: 988-995.

2. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108-110.

3. Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, et al. (2006) A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res 34: D95-97.

4. Ren B, Cam H, Takahashi Y, Volkert T, Terragni J, et al. (2002) E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev 16: 245-256.

5. Burns CJ, Vetter ML (2002) Xath5 regulates neurogenesis in the Xenopus olfactory placode. Dev Dyn 225: 536-543.

6. Kay JN, Finger-Baier KC, Roeser T, Staub W, Baier H (2001) Retinal Ganglion Cell Genesis Requires lakritz, a Zebrafish atonal Homolog. Neuron 30: 725-736.