Summer/Rotation/PhD Student Project: From Physical Model of Nucleosome Organization toward Genomes Annotation

This project is a part of an effort towards the development of an accurate model for prediction of the nucleosome location on an arbitrary strand of DNA.

Packaging DNA into nucleosome affects DNA accessibility by transcription factors (TFs): while some TFs can bind nucleosome-packed DNA (e.g., GR, Sp1, USF, GAL4), the majority of TFs cannot bind nucleosome-packed DNA. Knowledge of nucleosome positioning is essential for understanding mechanisms of the regulation of gene expression in eukaryotic cells; also, reliable prediction of nucleosome positioning may facilitate wide-scale annotation of the genomes.

Recently developed experimental approaches at high-throughput analysis of DNasa I hypersensitive sites within nuclear chromatin, which correspond to nucleosome-free DNA regions, have allowed mapping of these sites on a genome-wide scale. As genome-scaled identification of nucleosome positioning in 45 kb in S.cerevisiae chromosome III showed (Ercan & Simpson, 2004, Mol cell Biol, 24, 10026; Yuan et al., Science, 2005, 309, 626), most polymerase II transcribed genes contain a nucleosome-free region ~200 base pairs upstream of the start codon flanked on both sides by positioned nucleosomes, and the presence of these regions did not correlate with transcription rate from the corresponding promoters; moreover they were not found in the coding regions. In human chromatin, nucleosome-free regions are also enriched in gene regulatory regions, including promoters, CpG islands, and transcription termination sites, and also are found in introns (Crawford et al., Genome Res, 2006, 16:123). Genome-wide analysis of an 80 kb genomic region (30 expressed genes) in Arabidopsis has also demonstrated presence of nucleosome-free regions at the 5’ and/or 3’ ends of the most genes irrespective of their expression level (Kodama et al., Plant and Cell Physiology 2007 48(3):459)

Genomic DNA sequences demonstrate high variability in their binding affinity to the nucleosome core. A numerous attempts have been undertaken to find sequence-dependent signals on DNA determining the location and distribution of nucleosome and build the nucleosome positioning prediction models using various models, including probabilistic models (Segal et al., Nature, 2006, 442, 772), comparative genomics approach using six Saccharomyces genomes and deriving from the alignment nucleosome positioning sequence patterns considering the frequency of AA and TT dinucleotides (Ioshikhes et al., Nat Genet, 2006, 38, 1210), and SVM classifier trained on 1000 nucleosome-forming and 1000 nucleosome-inhibiting DNA fragments from Yuan et al., Science, 2005, 309, 626 (Peckham et al., Genome Res, 2007 17:1170). The results obtained using these and several other models (see references in Peckham et al., Genome Res, 2007 17:1170) demonstrated that the general properties of the DNA sequence can determine the preferred positions of individual nucleosomes. These properties include an enrichment of AT/TA dinucleotides in linker DNA and of GC/CG dinucleotides in nucleosome core DNA, a periodic repeat every 10.4 bases of the dinucleotides AA and TT in nucleosome forming sequences, and a 10-bp periodicity of AA/TT/TA dinucleotides that oscillate in phase with each other and out of phase with 10-bp periodic GC dinucleotides.

Applying their probabilistic model at yeast genome, Segal et al. (Segal et al., Nature, 2006, 442, 772) showed that nucleosome density was low near the transcription start site, thus confirming the experimental evidences of this phenomenon (Sekinger et al., Cell, 2005, 18, 735), and that the position of the lowest nucleosome occupancy was conserved in 16 fungal species. The conclusion drawn from the Segal’s and Ioishikhes’s works is that genomes “encode” the positioning and stability of nucleosomes in regions that are critical for gene regulation; the conservation of this “code” across related species suggests an evolutionary selection that makes regulatory and coding regions distinct with respect to nucleosome occupation. The next logical question to ask is to which extent the structure and dynamics of chromatin has been imprinted in DNA sequence during evolution; that is, whether similar nucleosome positioning signals can be found within the genomes of distinct organisms or not.

The model of Segal et al., Nature, 2006, 442, 772 can only explain ~50% of the in vivo nucleosome organization. Therefore, a more accurate nucleosome-DNA interaction model is still needed. Models built on physical principal are of a particular interest because, in contrast to the models taken into account the nucleosomal 160 bp short-range ordering in DNA sequence, they able to explain influence of experimentally known long-range correlations in the genomic sequences on nucleosome organization. One such a model has recently been suggested (Vaillant C, Audit B, and Arneodo A, Phys Rev Lett, 2007, 99, 218103) that remarkably reproduced the genome-wide experimental data (Yuan et al., Science, 2005, 309, 626).

The goal of the proposed project is to test the hypothesis that similar signals of nucleosome positioning can be found within the promoter regions of the genomes of distinct organisms. To attain this goal the following aims are proposed: (1) implement the physical model of nucleosome occupation of Vaillant et al. as a computer program; (2) test the program on available experimental data on nucleosome positioning, including genome-wide nucleosome maps; and (3) apply the model at analysis of promoter regions of eukaryotic DNA sequences of different species.

Available experimental data:

1)  Nucleosome Positioning Region Database (NPRD): http://srs6.bionet.nsc.ru/srs6/.

2)  199 stably wrapped and center-aligned mononucleosome DNA sequences to construct a probabilistic model representative of the DNA sequence preferences of nucleosomes in S. cerevisiae: Segal et al., Nature, 2006, 442, 772: data on 199 sequences are not provided in the paper or its supplement

3)  Genome-wide nucleosome maps for yeast: Yuan et al., Science, 2005, 309, 626; Bernstein et al., Genome Biol, 2004, 5, R62; Lee et al., Nat Genetics, 2004, 36, 900; Arabidopsis: Kodama et al., Plant and Cell Physiology 2007 48(3):459; human: Crawford et al., Genome Res, 2006, 16:123.

4)  Open reading frames and corresponding promoter regions in S. cerevisiae in the Saccharomyces Genome Database (http://www.yeastgenome.org).

5)  Eukaryotic promoter databases, including EPD, PLACE (plant promoters), DBTSS, AGRIS (Arabidopsis promoters), ABS (orthologous promoters), MPromDB (mammalian promoters), TiProD (human promoters), PlantPromoterDB.