Vipin Narang1 Wing-Kin Sung1, Ankush Mittal2

Probabilistic Graphical Modeling of Cis-Regulatory Codes Modules Governing in Drosophila Development

Vipin Narang1 Wing-Kin Sung1, Ankush Mittal2

1Department of Computer Science, National University of Singapore, Singapore – 117543.

2Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee-247667, Uttaranchal, India.

Classification:

Physical Sciences – Statistics,

Biological Sciences – Genetics

Corresponding Author:

Vipin Narang,

c/o Graduate Office,

Department of Computer Science,

3 Science Drive 2,

National University of Singapore,

Singapore – 117543.

Email:

Manuscript information:

Number of text pages (including references and figure legends): 32

Number of figures: 7

Number of tables: 1

Number of words in abstract (max 250): 162

Number of characters in the paper (including spaces, max 47000): 45,310

Data note: Supplementary text included containing 10 supplementary figures

ABSTRACT

Cis-regulatory modules (CRMs) are enhancers or repressors that control the expression of genes in a particular tissue at a particular development stage. It is hypothesized that CRMs with similar motif module will control the same gene expression pattern. Such motif module is called the regulatory code. So far few regulatory codes are known which have been determined based on wet lab experiments. We present the first computational approach to learn regulatory codes de-novo from a repository of CRMs. We use a probabilistic graphical model to derive the regulatory codes and to predict novel CRMs. Using a training data of 356 non-redundant CRMs, 810 novel CRMs have been recovered from the Drosophila melanogaster genome regulating gene expression in different tissues at various stages of development. We also derived Specific specific regulatory codes are derived conferring for CRMs controlling gene expression in the drosophila embryonic mesoderm, the ventral nerve cord, the eye-antennal disc and the larval wing imaginal disc. Furthermore, the taget genes of CRMs with those regulatory codes are validated to express in the corresponding tissue at the corresponding development stage. Furthermore, 31 novel genes are implicated in the development of these tissues.

INDEX TERMS

Gene regulation, Cis-regulatory modules, Transcription factor binding motifs, Drosophila melanogaster, Bayesian networks.

INTRODUCTION

The recent sequencing and annotation of the genomes of several multicellular species has revealed that there is extensive similarity in their protein coding sequences. Thus, the complexity of higher species is attributed to a more elaborate system of gene regulation [1]. Transcriptional control appears to be the primary mechanism of gene regulation, where the essential feature is binding of trans-acting proteins called transcription factors (TF) to cis-regulatory DNA flanking the coding sequences. Cis-regulatory modules (CRMs) such as enhancers or repressors are a special class of cis-regulatory sequences that control the spatio-temporal specific expression of genes. They are short (0.5-1 kb) sequences with a high density of transcription factor binding sites (TFBS) and can control the expression of gene expressions located from several kilobases upstream or downstream of the gene or from within the introns. Recent studies have demonstrated their widespread role in regulating networks of gene expression, especially during early development [2, 3].

CRMs have been most widely studied in Drosophila melanogaster (fruit fly), which is a model organism for the study of development. The number of CRMs in the Drosophila genome is estimated to be 3-4 times greater than the number of genes themselves [1]. Several genes expressed during Drosophila development are known to be regulated by CRMs. CRMs regulate gene expression in almost all phases of cell differentiation in Drosophila from pluripotency to tissue formation. In the Drosophila embryo, the transcription factors are non-uniformly distributed in different locations regions of the embryo. At any given location, various TFs are present in different concentrations. Depending on the TF concentrations at a given region, specific CRMs are activated to express or repress their target genes. This results in differential expression of the zygotic genes in different locationsregions, ultimately leading to tissue differentiation. Thus CRMs are important components of developmental gene networks.

CRMs are difficult to characterize experimentally due to their short length and variable location in the genome up to several kilobases upstream or downstream of the target gene. In-vivo experimental validation of CRM activity using reporter gene assays is time and resource consuming. So far less than 700 CRMs have been experimentally characterized in approximately 200 Drosophila genes [4]. Therefore in-silico discovery of CRMs is an attractive option. Several recent studies have attempted computational discovery of novel CRMs [5-10]. The computational approach in essence characterizes the CRMs as short (~1 kb) genomic segments containing high density of binding sites for a set of co-acting TFs [5-15]. The binding sites are recognized with the help of positional weight matrices (PWMs) for the TFs [16]. More sophisticated approaches consider the contribution of both strong and weak binding sites within a CRM [7, 8]. Probability models such as hidden Markov models (HMM) have also been employed to recognize significant clusters of TFBSs within genomic sequences, sometimes additionally incorporating order and distance constraints among the TFBSs [13-15]. Genome comparison with related Drosophila species has also been used to aid CRM discovery, though with caution due to low conservation of the CRM sequences in general [9].

The main challenge for the computational approach is that the TFBS compositions of different CRMs regulating gene expression in different developmental stages and tissues are exceedingly varied. Each specific expression profile is governed by the interaction of a specific set of TFs. The TF set forms a regulatory code governing the expression pattern. For example the TFs bcd, cad, hb, kni, Kr, gt, dl, tll, etc are known to regulate gene expression in the blastoderm embryo [6, 8]. Whereas the set of TFs dl, twi, su(H), etc governs gene expression in the embrynoic neuroectoderm [10]. Presently the regulatory codes are extracted based on tedious wet-lab experiments and biological knowledge. This has limited their applicability to few specific CRM types. Another limitation is that good quality PWMs, which are required to predict the TFBS clusters, are available only for a few TFs. Currently the PWMs have been computed from a small number of experimental TFBS sequences determined by DNAse footprinting. Most of these PWMs lack sufficient sensitivity and specificity [17].

An alternative computational approach is sequence based modeling of CRMs such as using oligonucleotide frequencies [18]. This approach has shown some degree of success in modeling blastoderm CRMs. However, it is not applicable to modeling other types of CRMs [19]. Firstly, we do not have enough typed CRMs to train a sequence based model. Secondly, a model based on oligonucleotide motifs gives large number of false positives (as shown in this study).

We introduce a computational CRM modeling and prediction approach called Modulexplorer to perform de-novo learning of regulatory codes for Drosophila CRMs from CRMs of unknown types. The Modulexplorer pipeline is shown in Figure 1. The input to Modulexplorer is a database of known CRMs and a set of non-CRM background sequences. Modulexplorer first characterizes the TFBSs within the CRMs de-novo [5, 20-22]. It represents the TFBSs with dyad motifs in degenerate IUPAC alphabet to achieve high specificity [23-26]. Then using a probabilistic Bayesian network model, it learns the TFBS interactions which are over-represented in CRMs while under-represented in non-CRMs. These interactions describe the regulatory codes. The trained model is then used to discover novel CRMs. This paper reports the novel regulatory codes and CRMs discovered.

RESULTS

Training of Modulexplorer CRM Model

The Modulexplorer Bayesian network model for a CRM is shown in Figure 2. In the Bayesian network model, the TFBSs are the causal elements or parent nodes while the CRM is their common effect or child node. The basic idea is to consider a CRM as a cluster of TFBSs for cooperating TFs with certain distance and order constraints. The TFBS interactions are encoded in the probabilities at the edges of the Bayesian network. The interaction probabilities are learnt de-novo by the Bayesian network from training CRM and non-CRM sequences with unsupervised learning. Distance and order constraints are considered between pairs of closely interacting motifs. After training, the Bayesian network model functions as a classifier to discriminate between CRM and non-CRM sequences. During inference, the Bayesian network assigns high probability of CRM to a sequence which contains a combination of closely interacting TFBSs.

To construct the Modulexplorer model, we first developed a method to robustly identify TFBSs within CRMs. Drosophila CRMs show homotypic clustering of TFBSs, i.e. they contain multiple binding sites for the same transcription factor [5, 20-22]. Therefore TFBSs often appear as redundant subsequences within the CRM. We studied the correlation between redundant sites and TFBSs in 155 Drosophila CRMs for which TFBSs have been previously characterized by DNAse footprinting experiments [4, 17, 27]. These CRMs contained 3-6 binding sites per TF in general (Supplementary Figure 1(a)). The fluffy tail test statistic [28] of the CRMs also indicated a high level of sequence redundancy in the CRMs (Supplementary Figure 1(b)). We developed a novel algorithm to recover the redundant sites in a CRM with high accuracy (see Supplementary Figure 2). The predicted redundant sites overlapped the experimentally annotated TFBSs in the CRMs with an average sensitivity of 81% and false positive rate of 22% (Supplementary Figure 1(c)). The p-value for this correlation is 9×10-32 compared with random sites of the same length. Thus the method robustly identified the TFBSs in the CRMs de novo.

To learn the Modulexplorer Bayesian network model, we obtained experimentally validated CRMs from the REDfly database [4] (version 1.0, February 2006). A non-redundant and non-overlapping set of 356 training and 58 test CRMs was derived from the database as described in the Data section. The CRMs were of several different types as described in Supplementary Figure 3. In addition, we prepared three different background sets from exon, intron and intergenic sequences respectively. The background sequences were randomly selected from the Drosophila melanogaster genome. The number and lengths of the sequences in each background set was kept the same as the training CRMs.

Ten-fold cross-validation training of the Bayesian network was performed with the above training data. The discrimination achieved between CRM and background sequences in cross-validation training is shown in Figure 4. The result is compared with the current best performing algorithm HexDiff [18] and with a Markov model. Other CRM prediction algorithms (such as [7, 13, 29]) could not be included for comparison since they require prior biological knowledge of the CRM model (such as the PWMs of the TFs) and are specific to a subset of CRMs of the same type. As shown in Figure 4(a), all three models showed high discrimination between CRM and coding (exon) sequences. However for non-coding (intron and intergenic) sequences (Figure 4(b)), the Markov model showed no discrimination (area under ROC=0.37), HexDiff showed marginal discrimination (area under ROC=0.58), while Modulexplorer had the highest discrimination (area under ROC=0.75).

Modulexplorer’s prediction performance was then evaluated on a test dataset of 58 CRMs and 1000 random background sequences different from the training set. The test set is unbiased as it contains CRMs expressed in a variety of tissues and stages and is distinct from the training sequences (Figure 4(c)). The ROC, shown in Figure 4(d), resembles the performance on the training set (area under ROC=0.72).

Pairwise TF-TF Interactions Learnt De-novo by the Modulexplorer

We investigated the conditional probabilities associated with the edges of the Modulexplorer Bayesian network model to obtain insight into the pairwise TFBS interactions represented in the model. We took from the model the representative motifs for 61 known TFs which best matched their known binding sites (listed in Supplementary Figure 8). The strength of interaction between any two motifs M1, M2 was measured as the ratio of the marginal probabilities . The pairwise interaction matrix is shown in Figure 5. Based on the interaction matrix, the TFs were hierarchically clustered using UPGMA algorithm [30].

According to their known biological functions, the 61 TFs may be grouped into five broad categories. The TF Twist (twi) and its cofactors dl, sna, byn, slbo, prd, bin, su(H), su(Hw) in mesoderm and nervous system development [10, 31-33] are placed in the first group. The TFs sd, pan, nub, ap, grh and ara which are involved in the development of imaginal discs such as the wing disc were placed in the second group. The third and fourth groups were formed by the TFs of the antennapedia complex (Antp, abd-A, abd-B, ubx, dfd) and the blastoderm (bcd, hb, cad, kni, Kr, tll, gt) respectively. The fifth group consists of the TFs ey and toy involved in eye development.

In the TF-TF interaction matrix, the first four TF groups showed high mutual interaction values in general. The overlap is expected as these TFs are known to function cooperatively [34, 35]. However, closer analysis of the interaction matrix by hierarchical clustering indicated that the TFs of these four groups form three distinct clusters in the TF-TF interaction matrix. The first cluster included all TFs from the first group and the TFs sd, nub, grh and pan from the second group. The second cluster contained the remaining TFs from the second group and all TFs of the antennapedia complex. In addition, the TFs ems, vvl, en, exd, cf2-II, gl, Dref, zen and eve also came together with this cluster according to the hierarchical clustering. There are some supports that these TFs may be related to the known factors in the second cluster. Exd, en, ems, zen and eve are known regulators in the development of appendages including the legs and wings [34]. Vvl also has known function in wing development [36], while little information is available on the TFs gl and cf2-II. The primary function of The the TF Dref , is DNA replication. Though there is no support of its role with antennapedia complex, appeared clustered with the antennapedia complex. While its primary function is DNA replication, recent studies have shown its role isn many diversifye functions [37] and hence it may be related to the antennapeida complex.

The third cluster contains all the blastoderm TFs. Moreover the antennapedia pair-rule TF ftz was also found in this cluster. This association is not surprising as ftz cooperates with blastoderm TFs in known CRMs [38].

The TFs ey and toy of the fifth group appeared as a distinct cluster by the UPGMA clustering. The TF deaf1 was also found in this cluster. Little is known in the existing literature about deaf1. Surprisingly we found from a survey of recent literature that deaf1 seems to have a role in eye development [39].