Availability: the Source Code Implemented by Yiyu Zheng Is Freely Available At

Feature pattern discovery to predict nucleosome occupancy in yeast and human

Category
Featurepattern discovery to predict nucleosome occupancy in yeast and human
Yiyu Zheng and Haiyan Hu*
Department of Electrical Engineering and Computer Science, University Of Central Florida, Orlando, FL 32826, USA.
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX

Feature pattern discovery to predict nucleosome occupancy in yeast and human

[*]abstract

Motivation:Understanding the deterministic factors for nucleosome-forming is important to gene regulation and cellular functions. Recent development of next-generation sequencing (NGS) techniques have made genome-wide nucleosome occupancydata available. Based on these data, various features including a number of DNA sequence features and structural features have been reported predictive for nucleosome forming. However, thedifferent contributions of differentDNA features to nucleosome-forming remain unclear in different species (human and yeast).

Results:We developed a pattern discovery and scoring method FFN to identify features and their combinations that influence nucleosome-forming and inhibition. Applying the FFN to gene promoter regions in yeast and human and grouping genes based on their transcriptional levels, we found different feature combinations that are predictive for different nucleosome-forming and nucleosome-depletion sequences in genes with different transcriptional activities. We also found that different feature values in the same feature combinations can have different predictive power for nucleosome-forming. We compared feature patterns identified in yeast and human and discovered a lot of them are conserved. Also, comparing with genes with lowhighly-transcriptional activities, highly low expressed genes contain more nucleosome-forming sequences that exhibit the identified feature patterns.

Availability: The source code implemented by Yiyu Zheng is freely available at

Contact:

1.introduction

Eukaryotic genomic DNA is packaged with histone and other proteins into a chromatin complex. The basic unit of DNA packaging is called nucleosome comprising of a histone octamer with ~147 bp DNA segments wrapping around it(Kornberg and Lorch, 1999; Luger, et al., 1997). The DNA segments around nucleosomes are often called nucleosomal DNA and those between two adjacent nucleosomes are often called linker DNA. The linker DNA can have variable lengths often ranging from 20 to 80 bp. The amount of histones occupying a unit-length of DNA segment in a population of cells is called the nucleosome occupancy level corresponding to the given DNA segment. It has been well known that nucleosome occupancy is not uniformly distributed across the genome and different nucleosome occupancies in different DNA regions can have important functional meanings(Daenen, et al., 2008; Tillo, et al., 2010). For example, highly occupied DNA regions are often difficult to be accessed by regulatory proteins such as transcription factors while nucleosome-depletion regions are more accessible to DNA binding proteins functioning in gene regulation(Sekinger, et al., 2005; Zhu and Thiele, 1996). Nucleosomes have also been reported important roles in epigenetic gene regulations (Jiang and Pugh, 2009; Li, et al., 2007). Therefore, understanding the cause of nucleosome forming on different genomic regions is important to understand gene regulation and uncover cellular mechanisms.

Experimental methods to measure nucleosome occupancy have been advanced by the recent next-generation sequencing technology(Shendure and Ji, 2008), especially the parallel sequencing of nucleosomal DNA (Albert, et al., 2007; Barski, et al., 2007; Field, et al., 2008; Kaplan, et al., 2009; Valouev, et al., 2008; Zhang, et al., 2009). A number of datasets on genome-scale measurement of nucleosome occupancy have become publicly available in multiple species including human and yeast (Lee, et al., 2007; Schones, et al., 2008; Wang, et al., 2008; Yuan, et al., 2005). Based on these datasets, dozens of literature studies have reported particular types of DNA sequences tend to have higher nucleosome occupancy levels in vivo and there potentially is a genomic code for nucleosome-forming sequences (Segal, et al., 2006). Accordingly, a number of computational methods have recently been developed to model and predict nucleosome occupancy based on the underlying features of the primary DNA sequences(Field, et al., 2008; Gupta, et al., 2008; Lee, et al., 2007; Peckham, et al., 2007; Reynolds, et al., 2010; Segal, et al., 2006). For example, Peckham et alhas incorporated variable-length k-mers into a SVM model to predict nucleosomal DNA in yeast. Lee et alhas applied lasso regression model to a set of DNA sequence and structural features for nucleosomal DNA prediction. Reynolds et al combines the nucleosomal DNA and the adjacent linker DNA sequences to characterize the nucleosome. Although the prediction accuracy is in general not sufficiently high leading to the common understanding that factors other than primary sequence may also influence nucleosome forming, these methods are able to identify a number of sequence features that in fact contribute to nucleosome occupancy prediction. The selected features however are not always consistent with each other. For example, several studies have confirmed~10 bp periodicity of certain dinucleotidesas predictive of nucleosome forming (Field, et al., 2008; Ioshikhes, et al., 2006; Kaplan, et al., 2009; Segal, et al., 2006). Lee et alclaims structural features such as tilt and propeller twist are the most effective features, and identified “tip, tilt, and propeller twist” as important structural features for nucleosome occupancy prediction in yeast.The diversity of effective features identified in nucleosome-prediction studies leads to our hypothesis that multipleDNA featurescan simultaneously or combinatorially influence nucleosome-occupancy.

In this paper, we developed a computational method FFN (Finding Featuresfor Nucleosomes) to identify features and feature combinations that are important for nucleosome occupancy prediction. By applying FFN to genome-wide nucleosome occupancy measurement data in yeast(Lee, et al., 2007) and human (Schones, et al., 2008), we found that a number of different features when combined have high predictive power for nucleosome-forming or nucleosome-depletion sequences. The prediction power of combined features can be affected by factors beyond static DNA sequences such as gene transcriptional activities. We also showed that certain structural features frequently appear in feature patterns in nucleosomeforming sequences.

2.methods

2.1Data source

Nucleosome positioning data in yeast is obtained from Lee et al, and the data in human is from Schones et al for both resting and activated human Tcells(Lee, et al., 2007; Schones, et al., 2008). The Laplacian of Gaussian (LOG) method (Ozsolak, et al., 2007; Zhang, et al., 2008)is then applied to the raw datato identify enriched regions. The parameters are chosen for best consistency between LOG results and HMM results(Lee, et al., 2007). These enriched regions are then defined as nucleosome-containing sequences (NCS), and the sequences between two NCSs are defined as linker-containing sequences (LCS).

For a given gene and its promoter region (1,000 bp upstream of the Transcription Start Site (TSS)), we definedanucleosomeas 0th nucleosome if it overlaps with the TSS, the nucleosomes in the upstream region of the TSS as -1stnucleosome, -2ndnucleosome and so on, and the nucleosomes in the downstream region of the TSS as 1st, 2nd nucleosomes and so on.

Thegene expression data in yeast is obtained from David et al. (David, et al., 2006).5736 annotated genes with gene expression measurements are kept for further identification of high-confidence transcripts. The yeast gene annotation is based on the annotation resource at UCSC. High-confidence transcripts are definedas these transcript segments that overlap greater than 50% with a non-dubious annotated coding region in the 5’ end(Lee, et al., 2007). In total, 5300 out of the 5736 genes are defined as high-confidence transcripts.

The gene expression data in human is obtained from GEO database (GSE10437), which measures whole genome gene expression under tworesting T cell conditions and two activated conditionsrespectively (Schones, et al., 2008). 19049 genes with gene expression measurements are obtained using gene annotation resources at UCSC (human hg18 build). Genes that are absent under bothof the resting conditions and present under both of theactivated conditionsare defined as induced genes. Similarly, genes that are present under both of the resting conditions and absent under both of theactivated conditionsare defined as repressed genes(Schones, et al., 2008). In total, 299 genes are defined as induced genes and 393 genes as repressed genes.We defined these induced genes and repressed genes together as perturbed genes.

2.2Feature compilation and sequence representation

Features that are relevant to nucleosome occupancy are compiled from literature. These compiled features can be categorized into twoclasses: (1) sequence features such as DNA k-mer frequency(Peckham, et al., 2007), poly(A) tracts, transcription factor binding sites and sequence repeat(Lee, et al., 2007); (2) DNA structural features from Lee et al and Abeel et al (Abeel, et al., 2008; Lee, et al., 2007). The structural feature valuesare computed based on the conversion tables (Supplementary Table1).The structural features that have large correlation with other structural features (the absolute value for Pearson correlation calculation is greater than 0.9) are removed resulting in 23 structural features (Supplementary Table 1). Finally, 766 features including 694 k-mer frequency features, 4 Poly tracts, 40 poly(dA/dT) tracts, 2 sequence repeat features, 23 structural features and 3 motifs (yeast only) are kept for further analysis (Supplementary Table 2).

We then performed LogitBoost(Friedman, et al., 2000) to further select the most relevant features. LogitBoost is a boosting algorithm for classification using logistic regression as cost function, and can assign weight to the features selected in the model so as to estimate the impact of the features on the model(Friedman, 2001).We used the implementation of LogitBoost in software “WEKA”(Hall et al., 2009). We chose top 1000 nucleosomes with highest profile score and top1000 linker sequences with lowest scores as training sets in yeast, resting and activated human Tcellsrespectively. We thenperformed 100 round iterations using all the features in LogitBoost. We kept the top 10 features in yeast, resting and activated human Tcells dataset as they generally take 60% weight out of all the features selected (Supplementary Table3). Also we includedthe top 10 features from Peckham et al(Peckham, et al., 2007)and three structural features from Lee et al (Lee, et al., 2007). Finally, after removing some duplicated features, we included 30 features in our consideration(Table 1).

We then used these selected features to represent the NCSs and LCSs as follows. We computed all the feature values for every 147bp-long subsequence in all the NCSs/LCSs. As the feature value generally conforms to normal distribution, we discretized each feature into mlevels (e.g. m=4) using cutoffs as μ-(m/2-1)*σ, μ-(m/2+1)*σ, …,μ, …, μ+(m/2-1)*σ. With the discretized feature values, every 147 bp subsequence in anNCS/LCSwas replaced bythe combination of itsdiscretized feature values, called feature profiles.

Table 1.Top 30 features

1 / Tip / 11 / AAT/ATT / 21 / GAC/GTC
2 / Minor groove mobility / 12 / ATTA/TAAT / 22 / CCCC/GGGG
3 / Tilt / 13 / TAA/TTA / 23 / ACAC/GTGT
4 / Z-DNA free energy / 14 / TAATA/TATTA / 24 / AATTA/TAATT
5 / Persistence length 1 / 15 / AAAA/TTTT / 25 / ATAT
6 / Slide / 16 / CGCC/GGCG / 26 / A/T
7 / Major groove mobility / 17 / AAG/CTT / 27 / TA
8 / Propeller twist / 18 / ACA/TGT / 28 / AAA/TTT
9 / ATA/TAT / 19 / AAATA/TATTT / 29 / AT
10 / ATAA/TTAT / 20 / CCGCC/GGCGG / 30 / AATA/TATT

30 features used in pattern discovery

2.3Discriminative score of a potential feature pattern

To determine the discriminative power of a potential feature pattern, say , where the first subscript represents the index of a feature and the second represents a specific level of the corresponding feature, we defined its z-score in NCSs and LCSs as Zn and Z1, respectively. The discriminative score of this feature pattern is then defined as a D-score = Zn- Zl, where Zn and Zl are calculated by the following formula (1[YZ1]):

(1)

To calculate the above and for a random sequence, we assume for a given feature, its level in a sequence of windows can be modeled by a first order Markov chain. This is because two neighboring windows overlap except at the last position, and consequently the level of a feature only depends on these overlapped positions and the last position, which is consistent with the property of the first order Markov chain. Based on this assumption, we calculate the two expectations for a random sequence of length hi in the following way:

(2)

(3)

Here and are the transition matrix and its power r matrix of the Markov chain modeling the feature respectively, and is the stationary probability of, which is calculated as following:

(4)

where is a column vector with all entries between [0, 1], and meets the requirement that:

(5)

In the end we keep the patterns with D-score above as discriminative patterns. As the ranges of D-score in different species are different, in each situation, we normalize the D-score of each pattern by dividing the original D-score by the highest D-score of all patterns to make the score in range [0, 1].

3.EXPERIMENTS AND results

To identify potential features that combinatorially characterize nucleosome/linker-containing sequences, we developed FFN algorithm (Algorithm 1.FFN algorithm). We then applied the FFN to discover feature and feature combinations in yeast and human Tcell data.

3.1The FFN algorithm

The FFN algorithm aims to discover combinations of features that are relevant to NCS/LCS-forming and can be used to distinguish NCSs from LCSs.Given all of the 147 bp-long subsequences in the obtained NCSs and LCSs, the FFN starts from enumerating all the possible two-feature combinations, called di-feature patterns. Only those di-feature patterns whose occurrence is larger than α percent of the totalnumberof NCS subsequences (e.g. α=20) are kept.The algorithm then searches for frequently co-occurred di-feature patterns using frequent pattern mining techniques. Next,FFNperforms an extension step, in which FFN investigates whether some of these di-feature patterns in the same cluster can be further extended into tri-feature or even longer patterns. This extension step is implemented by a pattern merging procedure. For example, given four di-feature patterns a1b1, a1c3, b1c3 and c3d4 in one cluster, meaning they are frequently co-occurring in the input sequences, we observe that they can in fact be merged into three tri-feature patterns a1b1c3, b1c3d4, a1c3d4 and one tetra-feature pattern a1b1c3d4. Note that the occurrence frequencies of these merged patterns may vary, but should be all greater than α% of the total NCS subsequences. After the pattern extension step, FFNevaluates the statistical significance of the patterns’ enrichment in the NCS and LCS subsequences using z-score(see Methods section). The D-score for a given pattern is then computed to determine whether it has sufficient discriminative power to distinguish NCSs from LCSs. In this way, we can identify NCS-specific feature patterns that are frequently found in NCSs but less frequently found in LCSs. Similarly, we can start FFN from LCS data to identify LCS-specific feature patterns.

Algorithm 1.FFNalgorithm

Input: NCS profile, LCS profile
Output: a set of discriminative patterns
Ck: candidate pattern set of length k
Lk: frequent pattern set of length k
Initial discovered pattern set: R is Ø
Start from length 1 pattern set L1: {all features}
While Lk is not null
Determine candidate pattern set Ck+1by merging patterns in Lk
FOR each profile item p in Input
FOR each candidate pattern c in Ck+1
IF p contains pattern c
Increment support(c)
Generate Lk+1 with all candidate patterns in Ck+1 withsupport > alpha
For each candidate patterns P in Lk
Calculate Zn:zScore(p) based on formula (1-5);
Calculate Zl: zScore(p) in LCS profiles
D-score = Zn-Zl
If (D-score > cutoff ())
put the pattern in the result set R
Return discovered pattern set R.

3.2Feature patterns identified in yeast data

Applying the FFNalgorithm to the yeast nucleosome data, we identified 88NCS-specificpatternswith D-score larger than 3√2 (Table 2, Supplementary Table 4). For example, “Minor grove mobility”at level 1, “Z-DNA free energy” at level 2, “Persistence length 1” at level 1 and “A/T” at level 2 form the pattern with highest D-score 80.56. This pattern occurs 228845times in NCSs with a z-score 526.30 and 162726times inLCSs with a z-score 445.74. A length-3 subpattern of this pattern containing “Z-DNA freeenergy” at level 2, “Persistencelength1” at level 1 and ”A/T” at level 2 is also identified as a discriminativepattern with D-score 57.04.We also identified several patterns with same feature combination but different feature levels, for example, pattern “Tip” at level 2, “CCGCC/GGCGG”at level 0, “TA”at level 2 (D-score=35.09) and “Tip” at level 1, “CCGCC/GGCGG” at level 0, “TA” at level 1 (D-score=13.23).These identified patterns suggest that various combinations of features at various levels can be considered for NCS prediction.

Table 2.Top 10Yeast Patterns Identified

Rank / Pattern
1 / Minor groove mobility at level 1, "Z-DNA free energy" at level 2, "Persistence length 1" at level 1, "A/T" at level 2
2 / Z-DNA free energy at level 2, "Persistence length 1" at level 1, "CCCC/GGGG" at level 0, "A/T" at level 2
3 / Z-DNA free energy at level 2, "Persistence length 1" at level 1, "CCGCC/GGCGG" at level 0, "A/T" at level 2
4 / Z-DNA free energyat level 2, "Persistence length 1" at level 1, "A/T" at level 2
5 / Minor groove mobility at level 1, "Z-DNA free energy" at level 2, "CCCC/GGGG" at level 0, "A/T" at level 2
6 / Minor groove mobility at level 1, "Persistence length 1" at level 1, "CCGCC/GGCGG" at level 0, "A/T" at level 2
7 / Minor groove mobility at level 1, "Z-DNA free energy" at level 2, "Persistence length 1" at level 1, "CCGCC/GGCGG" at level 0
8 / Minor groove mobility at level 1, "Z-DNA free energy" at level 2, "CCGCC/GGCGG" at level 0, "A/T" at level 2
9 / Z-DNA free energyat level 2, "CGCC/GGCG" at level 0, "CCGCC/GGCGG" at level 0, "A/T" at level 2
10 / Minor groove mobility at level 1, "Persistence length 1" at level 1, "A/T" at level 2

By applying FFN algorithm we identified 88 NCS-specific patterns. Three structural features frequently occur in the top 10 patterns.

We also compared the occurrences of features in NCS-specific patterns and LCS-specific patterns and found different feature preferences. Of all the 30 features, 17 features existed in NCS-specific patterns, and 11 existed in the LCS-specific patterns. There are 13 features that exclusively appear in NCS-specific patterns and 7 features only in LCS-specific patterns. We found that among the 13 NCS exclusive features, the feature of “A/T” is identified 40 out of the total 88 NCP-specific patterns but never in LCP-specific patterns. Especially, the feature “A/T” at level 2 frequently appears in the top scored feature patterns (9 out of top 10 patterns contains this feature, Table 2). The “A/T” feature has been identified as the most useful feature to distinguish NCSs from LCSs in Peckham et al. It frequently forms patterns with structural features such as “Minor groove mobility”, “Z-DNA free energy” and “Persistence length 1”. All top 20 patterns contain at least one of these four features. This observation is consistent with the discovery that structural features will help nucleosome occupancy prediction in yeast (Lee, et al., 2007). “Z-DNA free energy”, which is related to the free energy required for transition from B-DNA to Z-DNA transition (Ho, et al., 1990), has been identified 39 times (10 times in level 1 and 29 times in level 2). This feature is one of the structural features that are mostly negatively correlated with nucleosome occupancy(Gan, et al.)[YZ2]