In silico identification of novel cis-regulatory elements in Mesorhizobium loti

Feroz Khan (M.Tech. Biotech)1, Shipra Agrawal (Ph.D)2 and B. N. Mishra (Ph.D.)3§

1,3Department of Biotechnology, Institute of Engineering & Technology, Lucknow

2Jawahar Lal Nehru Center of Advanced Research, Bangalore, India

E-mail: , &

§Corresponding author

Dr. B.N. Mishra

Assistant Professor Head

Department of Biotechnology

Institute of Engineering & Technology,

Sitapur Road, Lucknow-226012 (U.P.), India

Phone: +91 522 2363220, 2733148 Ext. 206 (O), +91 522 2731636 (Fax)

Email:

ABSTRACT

A computational approach was designed to detect over-represented hexanucleotide(s) located within -400 bp upstream sequences of four data set of genes similar to cellular functional categories viz. Nitrogen fixation, Symbiosis, Nitrogen metabolism and Glutamate family in Mesorhizobium loti; a symbiont to model legume plant Lotus japonicus. The upstream sequences of these genes were analyzed for known transcription factor (TF) binding site(s) and then verified statistically along with experimental data comparisons using over-represented hexanucleotide frequencies. Finally eight families of known TF/binding sites were recognized in all sets as high affinity novel motif patterns. Genome wide occurrence of detected patterns was verified, which had several nif genes, nod genes, nitrogen metabolism related genes and amino acid biosynthetic genes. These findings in the genome of M. loti may lead to more intricate analysis of regulatory network involved in symbiotic interaction with the host plant L. japonicus.

Keywords: Regulatory binding motifs, TF binding sites in M. loti, Hexanucleotides motifs, Nitrogen fixing bacteria.

INTRODUCTION

The majority of computational analyses has been done in coding sequences or between proteins and less in non-coding sequences [van Helden, J, 2003]. Non-coding regions are of interest since they govern the regulation of gene expression. Regulatory profiles of known and unknown genes are already being determined experimentally at a genomic scale, by DNA microarray technology [Schena et al., 1995; Schena et al., 1996; Schena M, 1996; Goffeau et al., 1997; Lashkari et al., 1997; DeRisi et al., 1997]. Besides, several programs have been developed to isolate unknown patterns shared by sets of functionally related DNA sequences [Waterman et al., 1984; Galas et al., 1985; Mengeritsky G and Smith TF, 1987; Stormo GD & Hartzell GW, 1989; Hertz et al., 1990; Lawrence C & Reilly A, 1990; Cardon LR and Stormo GD 1992; Lawrence et al., 1993; Neuwald et al., 1995; Hertz G and Stormo G, 1995; Wolfertstetter et al., 1996; Bulyk et al., 2004]. These programs have been inspired by a particular type of signals and are generally highly efficient for the detection of elements.

In the present work, a simple analysis method was designed to detect over-represented novel hexanucleotides in the upstream sequences of four set of genes, belonging to co-related functional categories viz., Nitrogen fixation, symbiosis, Nitrogen metabolism and glutamate family in M. loti [Kaneko et al., 2000]. The upstream of these genes were analyzed for known transcription factor (TF) binding site(s) using prokaryotic transcription factor database (ooTFD) and identified motifs were further verified by statistical oligonucleotides analysis.

MATERIALS AND METHODS

Retrieval of genes

Retrieval of M. loti gene sets sequences [Kaneko et al., 2000], belonging to functional category viz., Symbiosis, Nitrogen metabolism, Glutamic acid family and Nitrogen fixation were taken from the Rhizobase server (http://www.kazusa.or.jp/rhizobase/). The purpose behind the selection of this gene was to relate it to different nitrogen metabolic pathways (Tables 1, 2, 3 and 4).

Retrieval of upstream sequences without overlapping ORF upstream

Due to the organization of genes into operons in bacteria, it is preferable to prevent overlap with upstream open reading frames (ORFs) and prevent including too many coding sequences. For example, in Escherichia coli, 25% of the genes have an upstream neighbour closer than 50 bp which may suggest that they belong to the same operons [van Helden J, 2003; Siegele et al., 1989; Bulyk et al., 2004].

Identification of known regulatory factor/ binding site

The upstream sequences of all four genes sets were analyzed for known TF-binding sites through ‘Tfsitescan’ search tool at object oriented transcription factors database (ooTFD) server [Ghosh D, 2000] (Tables 5, 6, 7 and 8).

Clustering of genes on the basis of identified TF- binding site(s) family: All studied sets of genes were categorized into regulatory families on the basis of identified known transcription factor/ binding site(s) along with corresponding known sites, their length and binding known transcription factor, if any (Table 9). The details of each known transcription factor/ binding site with their references are summarized in Table 10.

Hexanucleotide analysis

In M. loti, detection of potential upstream hexanucleotide patterns in each set were performed along with genome wide pattern (s) identification. Modules for oligo analysis can be accessed through Regulatory Sequence Analysis Tools (RSAT) server [van Helden et al., 2000; van Helden J, 2003] at web interface http://rsat.ulb.ac.be/rsat/. In the analysis, different statistical parameters were used for validation of true positive predictions. According to van Helden J. (1998) different statistics parameters can be defined as:

Expected occurrences (exp_occ): the number of occurrences expected for the considered oligonucleotide within the set of sequences. The calculation of this value depends on the probabilistic model.

Occurrence probability (occ_pro): the probability to have N or more occurrences, given the expected number of occurrences (where N is the observed number of occurrences).

Expected matching sequences (exp_ms): the expected number of sequences with at least one occurrence.

Matching sequence probability (ms_pro): the probability to have L or more sequences with at least one ocurrence of the oligonucleotide, given the probabilistic model (where L is the observed number of matching sequences).

Significance index (sig): this is a conversion of the occurrence probability, taking into account the number of possible oligonucleotides (which varies with oligo size) and doing a logarithmic transformation. The highest sig corespond to the most overrepresented oligonucleotide. Sig value higher than zero indicate overrepresentation.

PROBABILITIES

Various calibration models were used to estimate the probability of each oligonucleotide. From there, expected number of occurrences were calculated and compared to the observed number of occurrences. The significance of the observed number of occurrences is calculated with the binomial formulae (van Helden J., 1998):

Expected occurrences

Where, p = probability of the pattern

S = number of sequences in the sequence set.

Lj = length of the jth regulatory region

k = length of oligomer

T = the number of possible matching positions.

Probability of sequence matching

The probability to find at least one occurrence of the pattern within a single sequence is:

with the same abbreviations as above.

Expected number of matching sequences

In this counting mode, only the first occurrence of each sequence is taken into connsideration. We have thus to calculate a probability of first occurrence.

with the same abbreviations as above.

Correction for autocorrelation (from Mireille Regnier)

Where,

a = is the coefficient of autocorrelation

Probability of the observed number of occurrences (binomial)

The probability to observe exactly ‘obs’ occurrences in the whole family of sequences is calculated by the binomial as:

Where,

obs = is the observed number of occurrences,

p = is the expected frequency for the pattern,

T = is the number of possible matching positions,

The probability to observe ‘obs’ or more occurrences in the whole family of sequences is calculated by the sum of binomials:

E-value

The probability of occurrence by itself is not fully informative, because the threshold must be adapted depending on the number of patterns considered. The E-value represented the expected number of patterns which would be returned at random for a given P-value (probability).

Where,

NPO =is the number of possible oligomers of the chosen length.

Significance index or coefficient (“sig”)

The significance index is simply a negative logarithm conversion of the E-value (in base 10). The significance indexes are calculated as follows:

This index value is very convenient to interpret: highest values correspond to the most exceptional patterns.

RESULTS

Bacterial metabolism has been widely studied and provides many examples of known regulons [Fisher et al., 1988; Hovey AK & Frank DW, 1995; Householder et al., 1999; Bulyk et al., 2004]. On many cases the TF involved in the common response is known, as well as its binding site [Ow DW et al., 1983]. These families of coregulated genes provide ideal datasets to calibrate the analyses method which could be extended to families whose regulatory elements are unknown. In the studied work several transcription factor families were predicted on the basis of DNA motif conservation. On a co-related nitrogen element metabolic pathways criteria basis, genes were grouped together without a priori consideration on the content of their upstream regions and then analyzed for detection of potential known TFs and their binding sites (Table 9) and then further verified through comparison with detected hexanucleotides in each studied genes sets (Table 11, 12 and 13). Finally refinement of noise has been performed on the basis of common genes having both known TF binding sites and also hexanucleotide patterns (Table 14). In the present studied work results revealed that finally identified common genes with known and unknown novel binding sites or hexanucleotide patterns should have important role in the regulation of nitrogen metabolic pathways, thus identified genes were responsible for regulation of nitrogen fixing symbiosis process between nitrogen fixing bacteria and host leguminous plants. The descriptions of results are as follows:

Clusters of overlapping hexanucleotides reveal wider regulatory sequences

For each data set (Table 1,2,3,4) we extracted -400 bp upstream sequences relative to transcription start site and performed hexanucleotide analyses as described in methodology. To avoid false positive predictions all hexanucleotide patterns in all upstream sequences were retained by setting standard statistical parameters e.g. significance coefficient (briefly “sig”) > 0 and number of occurrences (briefly “occ”) ≥ 4 and with the chosen cut-off parameter very few genes upstream sequences showed hexanucleotide patterns e.g. 37 out of total 167 genes upstream in all four sets (Table 11 and 12). Different significant statistical values corresponding to each hexanucleotide patterns were summarized in Table 13. Finally genes upstream with both hexanucleotide patterns and known TF-binding sites were retained e.g. 20 out of total 37 genes which showed 8 known TF-binding sites (Table 14). In most pattern clusters families, the hexanucleotides with higher significance coefficient and occurrences were assumed to be the novel regulatory binding sites. Highly significant patterns were generally appeared in clusters with few additional overlapping hexanucleotide patterns that have a weaker significant coefficient e.g., hexanucleotide of MalT family e.g. GGCAGA (sig=0.32), which can be grouped with another strongly overlapping sequence: CCCCAC (sig=0.62). When combined the two patterns, most significant hexanucleotide pattern correspond to 7 bp conserved sequence e.g. CGGCAGA, which was highly matched with 6 bp known TF binding consensus sequence of Malt_Cs. Similarly in ExsA family, the most salient hexanucleotide was ATAAAA (sig=2.70) which can be grouped with other strongly overlapping sequence e.g. AAACGT (sig=1.12). When combined the two patterns, most significant hexanucleotide correspond to 9 bp conserved sequence e.g. ATAAAACGT, highly matched with 8 bp long known TF binding consensus sequence of ExsA_Cs transcription factor (TNAAAANA). In most families viz. MalT, PhoP, ExsA and MalT_malPp, the overlapping clusters reflect the fact that recognition domain of the transcription factor is wider than 6 nucleotides with conserved core region of hexanucleotides. The maximum significance coefficient value indicates the most conserved hexanucleotides core that usually corresponds to bases directly interacting with the transcription factor. The decrease of significance value for the lateral overlaps comes from the fact that these positions are less crucial for the TF binding.

Transcription Factor Families covering clusters of variable patterns

On the basis of clusters of pattern three TF families were categorized viz. MalT, PhoP and ExsA (Table 14). Description of each family and related genes are as follows:

1. MalT family

In the study 2 independent clusters of binding site (hexanucleotide pattern) for Malt_Cs transcription factor [Raibaud et al., 1985] were detected in which cluster-I showed higher affinity scoring pattern GGCAGA (sig=0.32 & occ=8) corresponding to known binding site (e.g. GGAKGA) for Malt_Cs TF and cluster-II with pattern ATAAAA (sig=2.70 & occ=6) defines low affinity consensus for the same TF.

2. PhoP family

For PhoP factor, it is reported that PhoP-PhoR two-component regulatory system controls the phosphate deficiency response in Bacillus subtilis. A number of pho regulon genes which require PhoP for activation or repression have been identified [Eder et al., 1999]. A similar situation was observed here in the PhoP family, where 2 independent clusters of binding site for PhoP TF were detected in which cluster-I showed higher affinity scoring pattern CGATCG (sig=1.39 & occ=6) corresponding to known binding consensus (e.g. TTHACA) for PhoP TF and cluster-II with pattern ATAAAA (sig=2.70 & occ=6) showed low affinity to the same TF.

3. ExsA family

It is reported that ExsA has been implicated as a central regulator of exoenzyme production by Pseudomonas aeruginosa [Hovey AK & Frank DW, 1995]. In the study we identified hexanucleotide pattern for the same TF. In ExsA family cluster-I showed high affinity pattern ATAAAA (sig=2.70 & occ=6) corresponding to known binding consensus (e.g. TNAAAANA) for the ExsA_Cs transcription factor, while cluster-II defines low affinity scoring consensus GGGATA (sig=0.42 & occ=4) for the same TF.

Putative unknown regulatory sites

Besides, known regulatory sites few unknown additional hexanucleotides were observed within families (Table 14). Based on the results for the known regulatory sites one can inferred that the ideal unknown site should appear as a cluster of overlapping hexanucleotides with higher significance coefficient. Fit with these criteria several unknown regulatory patterns were extracted from the hexanucleotide analysis and were considered as good candidates for new unknown regulatory sites on the basis of conservation. Putative predicted families of unknown hexanucleotides are as follows:

A. Families with cluster of unknown hexanucleotide patterns