Transcription Factor and Microrna Motif Discovery: the Amadeus Platform and a Compendium

Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets

Supplemental Notes

A. Data

Genomic sequences

Promoter sequences (repeat and exon masked) of human, mouse, fly (D. melanogaster) and worm (C. elegans) genes were extracted from the Ensembl (Birney et al. 2004) project (releases 42-44, Dec 2006 – Apr 2007). Each sequence includes 3 Kb upstream of the TSS, and the first two exons and introns. Yeast (S.cerevisiae) promoter sequences were downloaded from SGD( Full-length 3' UTR sequences (repeat masked) of human and mouse genes were downloaded from Biomart ( Ensembl release 45, Jun 2007). The total number of sequences for each species is listed in the table below.

Sequence type / Organism / Source / Number of sequences
Promoters / Human / Ensembl v42 / 22,972
Mouse / Ensembl v42 / 24,344
Fly / Ensembl v43 / 14,006
Worm / Ensembl v44 / 17,899
Yeast / SGD / 6,329
3' UTRs / Human / Ensembl v45 / 18,417
Mouse / Ensembl v45 / 16,591

Binding patterns of TFs and miRNAs

Yeast TFBS motifs were taken from MacIsaac et al. (MacIsaac et al. 2006); only TFs with known binding specificities were used. PWMs of known metazoan TFs were obtained from the Transfac database (Wingender et al. 1996) (release 10.2, June 2006). For miRNAs, we set the binding motif as the 8-mer seed of the mature miRNA, downloaded from miRBase (release 10.0, Aug 2007,

B. Benchmarks

Synthetic and small benchmarks

Many studies that present new motif discovery algorithms demonstrate the improved accuracy of their methods using simulated data and/or a small ad-hoc collection of datasets. Given the large number of datasets in the literature, any reasonable method is likely to perform well on a small sub-collection of datasets. Obviously, a reliable measure of the performance of an algorithm can only be induced from a large set of samples. As pointed out by Tompa et al. (Tompa et al. 2005), implanting BSs inside synthetically generated sequences is problematic as well, since we do not know the 'correct' stochastic process that nature uses. Indeed, cis-regulatory sequences contain multiple heterogeneous signals (e.g., active TF/miRNA BSs and evolutionary remnants of such sites, structure-based constraints, and more), and thus exhibit complex statistical characteristics and inter-dependencies that are not captured by computational models like Markov chains. A simple example is the significant phylum-specific local variation in G+C content over a large region around the TSS (Aerts et al. 2004). Moreover, the exact method used for generating the sequences and implanting the motifs inside them may favor certain types of algorithms and motif-evaluation scores over others.In order to compare the performance of different tools and estimate their accuracy on real data, we believe the benchmark must be based on a large number of experimentally-derived datasets.

Our benchmark setup

Our first benchmark consists of 173 gene sets from the yeast ChIP-chip datasets (Harbison et al. 2004), which correspond to 83 distinct TFs (most TFs were tested under several conditions). We included only sets with more than four genes and for which a binding motif has been reported in the literature. The second benchmark is comprised of 42TF and miRNA target sets in our metazoan compendium (see Fig. 3 and Table 1). In both benchmarks, we compared the performance of Amadeus and five extant motif finders. Each program was executed twice on each target set – once with motif length 8, and once with 10. We compared the two top-scoring motifs reported in each execution to the known Transfac/miRBase pattern(s) of the corresponding TF/miRNA. One may expect that in an optimal scenario the correct motif would always be the top-ranked motif. However, since often several TFs co-occur in the same promoters, forming cis-regulatory modules (CRMs), even an ideal motif finder might rank the correct motif below some of its CRM partners. In light of this, and since the event of a TF having more than one partner with near-perfect co-occurrence seems unlikely, we chose to consider the two top-scoring motifs found in each execution (this idea was proposed by Tompa et al. (Tompa et al. 2005)). A program was considered to successfully recover the real binding motif if at least one of the four motifs it reported (two for each motif length) matched the known pattern below a pre-defined PWM divergence cutoff (see below).

Comparison to the benchmark of Ettwiller et al.

Ettwiller et al. (Ettwiller et al. 2007) collected 10 mammalian ChIP-chip datasets and used them as a benchmark for motif discovery tools. Table 1 compares their collection to the other benchmarks described in our paper. They developed a new motif finder, called Trawler, and reported that it successfully recovered the correct motifin all 10 datasets, 9 of which are included in our compendium. However, in our experiments, Trawler identified the correct motif only in 4 of these datasets (E2F, Myod, Myog, NFKB), even when using a liberal PWM divergence cutoff. There are several possible reasons for the discrepancy in the results: (1) In some cases, there are large differences between the gene setsused by Ettwiller et al. and the corresponding sets in our study. For example, their CREB dataset contains only 200 of the 2,338 CREB targets reported by Zhang et al. (Zhang et al. 2005), whereas we used the entire target set (all the sets in our compendium were taken as-is from the corresponding study, with no additional processing or filtering). (2) We used the PWMs from the Transfac database as the correct TFBS motifs; Ettwiller et al. used motifs from the literature. For a few TFs, their motif is a consensus string that differs from the Transfac PWM. For instance, they use "ATCGAT" as the correct ONECUT (HNF6) motif, whereas the PWM (M00639) in Transfac has a core consensus of "AAATCAAT". (3) In order to ascertain whether a program found the correct motif, our benchmark computes the divergence between the correct PWM and each of the top two motifs reported by the program. Ettwiller et al. manually compared the reported motifsto those from the literature. (4) For reasons described earlier, in all our experiments we used the samepromoter region (from -1000bp to +200bp w.r.t. the TSS) anda fixed BG set per organism (2,000 genes, randomly chosen from the corresponding genome).Ettwiller et al. used a different promoter region and BG set for each dataset, according to the sequences represented on the corresponding microarray.

Execution parameters of motif discovery tools

We compared the performance of Amadeus to that of MEME (Bailey and Elkan 1994), AlignACE (Hughes et al. 2000), YMF (Sinha and Tompa 2002), Trawler (Ettwiller et al. 2007)and Weeder (Pavesi et al. 2001).We chose a set of popular tools that represent a wide range of algorithms, and, with the exception of Trawler, are publicly available as standalone executables (in order to allow running time comparison). Trawler, a new motif discovery pipeline, is accessible only via a web interface. We included it in our comparison since according to its developers it achieves high success rates on mammalian datasets.

The version and execution parameters of each program we tested are provided below:

MEME:MEME (v3.0.3) was run with the following parameters: "-dna –mod zoops –text –nostatus –maxsize 18830000 –nmotifs 2 –w ML –bfile BF", where "ML" is the motif length (8 or 10), and "BF" is the file with the BG model – frequencies of all 4-tuples in the BG sequences. For promoter analysis we also supplied the "-revcomp" parameter (to search on both strands).

AlignACE: AlignACE (v4.0) was executed with "-numcols ML–nocols –gcback BG", where "ML" is the motif length (8 or 10), and "BG" is the G+C frequency in the BG sequences.

YMF: The "statsvar" program from the YMF package (v3.0) was run with the "-sort" option. The BG model files ("organism tables") were prepared using the "preproc" program. The following changes were made to the supplied configuration file to enlarge the search space: (a) The category 1 numMotifs parameter was increased from 1000 to 3000; (b) The NUM_RYWS parameter was increased from 2 to 3; and (c) The "max #regions" value was set to 20,000 (to handle large target sets). The output of YMF is a list of top-scoring IUPAC consensus strings. We created a PWM from each IUPAC string, by assigning an equal probability to each base in corresponding IUPAC code (e.g., if the first letter in the IUPAC string was "Y", then the first column in the PWM had the values 0.5 for "C" and "T", and 0 for "A" and "G"). The list of motifs reported by YMF is highly redundant, i.e., contains many similar motifs. Therefore, in order to obtain two non-redundant motifs, we selected the top-scoring motif and then traversed the output motifs in decreasing order of their scores, and chose the motif, whose divergence from the first motif was at least 0.24.

Trawler: The Trawler program was run via its web interface ( since the standalone version is not available yet. For the metazoan datasets, we uploaded a background sequence file that contained 2,000 promoter (or 3' UTR) sequences, chosen at random from the entire genome (we tried to use larger BG sets, but most executions did not complete; using Trawler's built-in BG files did not improve the results); for the yeast benchmark, we uploaded all ~6,300 yeast promoters.The minimum motif length parameter was set to the tested motif length (8 or 10); in yeast, this yielded poor results, so we used the default length of 6 (using this value for the metazoan datasets gave poorer results than lengths 8 and 10), and we also changed the minimum number of motif occurrences to 5 for target sets with less than 20 genes. The rest of the parameters were set to their default values (increasing the maximum number of mismatches from 2 to 3 did not improve the results).For each run, we selected the top-scoring PWM from each of the two highest-scoring motif families (in yeast, we took the top four motifs, as we executed Trawler with a single motif length).

Weeder: The "weederlauncher" program (v1.3) was executed with the "medium T100" parameters. For promoter analysis, we specified the "S" flag (to search on both strands), and used the BG model files supplied with the software. For 3' UTR analysis, we generated BG model files with the 6- and 8-mers counts, computed on the coding strand of all 3' UTR sequences.

Amadeus: Amadeus (v1.0) was run using the default parameters ("Normal" running mode). For the yeast and metazoan benchmarks we used the HG enrichment score. We also used the binned enrichment score on the metazoan benchmark. For the genome-wide analyses we used the localization and chromosomal preference scores, as stated in the manuscript.

All programs except Trawler were executed on an Intel server (dual-core Xeon 5150 CPU (2.67GHz), 4GB RAM) with GNU/Linux operating system. Trawler was executed via the web, so the system on which it was executed and its running times are unknown.

Utilizing comparative sequence analysis

Focusing on promoter regions that are conserved among orthologous genes in related species could reduce the false-positive rate of BS prediction. However, several recent studies suggest a limited cross-species conservation of functional BSs. For example, Odom et al. studied several evolutionarily-conserved TFs in human and mouse hepatocytes, and found that most of their BSs (41-89%) are species specific; moreover, whena TF binds the promoters of orthologous genes, the BSs reside in aligned regions only in a third of the cases (Odom et al. 2007).Lin et al. identified genomic sequences bound by ESR1 (Estrogen Receptor alpha) in breast cancer cells, and found that only 23% of them are conserved among vertebrates (Lin et al. 2007). In another work, BSs of two TFs in three yeast species were shown to have diverged considerably faster than ortholog content, perhaps constituting the major cause of phenotypic diversity (Borneman et al. 2007). In light of such evidence, we suggest to utilize comparative genomics by means of searching for motifs that are concurrently over-represented in target sets of related species, identified by species-specific experiments or using orthology.

Amadeus supports such joint analysis of multiple target sets from one or more species – the motif search is performed on all target sets in parallel, and the scores attained by each motif on all sets are combined into a single p-value using the Z-transform. For example, HSF is represented in our compendium by two target sets from two species – human and fly (see Fig. 3). In both sets, none of the tested tools successfully recovered the correct motif with the default PWM divergence cutoff of 0.18 (the only successful recovery was that of Trawler on the fly set with divergence 0.21). When running Amadeus on both target sets together, the HSF binding pattern is reported as the top-scoring motif, with high similarity (divergence 0.179) to the known HSF motifs in Transfac (Supplemental Fig. 7). Similarly, Amadeus discovers Mef2 as the top-ranked motif when analyzing the mouse and fly Mef2 target sets together; no tool recovered Mef2 in either of these target sets.

C. Amadeus – algorithms, scores and software

Amadeus motif search algorithm

As described in the main text, the work in Amadeus is divided into several refinement phases: The first phase evaluates all k-mers, where k is the length of the motif; subsequent phases expand the best candidates by introducing degenerate positions into the k-mers and recursively merging pairs of motifs; a greedy EM-like procedure is then applied in order to optimize the motifs; finally, a non-redundant list of top-scoring motifs is reported, together with additional statistics and information, such as similar motifs from Transfac/miRBase and a histogram of the BSs locations. The following sections provide supplementary details on the main data-structures, features and scores we implemented.

Data structures in Amadeus

Amadues is written in Java and its high efficiency is achieved by using sophisticated data structures that were tailored for the specific demands of the algorithm. In order to analyze genome-wide sequence data of several species (e.g., 28 Mb for all 1.2Kb long human promoters), special data structures were developed to minimize memory consumption. One of the main data structures we implemented, called HitList, maintains the set of occurrences (or hits) of a motif. The gene, strand and location of each hit are binary-coded and packed into a single machine word. The hits in a HitList are held in a sorted integer array, in order to allow efficient implementation of the main operations performed on HitLists (e.g., merging two HitLists takes linear time, with very small constants), and to minimize memory consumption (each hit takes a single word, plus a small overhead for memory management).

Graphical interface features

Amadeus has an intuitive, user-friendly GUI (Graphical User Interface). The main screen, shown in Figure 2, allows the user to control the input parameters (organism, target set, type of analysis, scores to use, etc.) and displays the output in both textual and graphical formats. For each discovered motif, Amadeus displays the motif logo and three tables: (1) All the scores computed for the motif (scores not selected by the user are reported but not used for computing the motif's p-value); (2) Statistics on the number of hits and target genes of the motif; (3) Similar known motifs from Transfac/miRBase. Additional information is presented in other panels and pop-up screens, including: results of motif-pair analysis, the list of k-mers that comprise the reported motif, hits localization graph, logos of similar known motifs, a list of genes whose cis-regulatory sequences contain the motif, and a TFBS viewer that displays the target-set sequences and all motif occurrences within them. Some of these screens are shown in Supplemental Figure 2. All output information (textual printings, putative targets of reported motifs, and graphical output) can be saved to files for further analysis.

Sequence redundancy

Many genomes contain families of paralogous genes with highly similar promoters or 3' UTRs, for example histones and zinc fingers in human, and Y'-element helicases in yeast. As noted by Harbison et al. (Harbison et al. 2004), such gene families skew the results of motif discovery, since all subsequences in their cis-regulatory regions are over-represented. Amadeus overcomes this bias by comparing each pair of input sequences; when it identifies a pair that shares a long identical subsequence, it removes one of the two genes from the target and BG sets.

Length and GC-content biases in target sets

As noted in the paper, there are many cases where the promoters or 3' UTRs of a set of co-regulated genes are considerably biased in terms of their length and/or C+G content. Searching for enriched motifs in such sets without accounting for these biases often results in failure to discover the correct motif. Below we provide two examples, in addition to the Mef2 example given in the main text.

One of the datasets in our metazoan compendium is a group of genes down-regulated in human cells that were transfected with miR-16 (Linsley et al. 2007). The average 3’ UTR length in this set is ~1700, while the average length in the background (the whole genome) is only ~960. Applying Amadeus with the HG score on this set results in the signature of miR-16 as the top-scoring motif. However, many other motifs receive statistically significant scores as well (e.g., 27 motifs have p-value below 10-12). Reassuringly, when we repeated the same analysis with the binned enrichment score, the correct motif was ranked first and no spurious motifs received a high score (specifically, only the miR-16 motif scored below 10-12). Similar results were obtained for other miRNA target sets.