Supplementary Methods
Validation of DNase-chip peaks by real-time PCR
Real-time PCR was used to validate DNase HS sites as previously described1,2. Briefly, primer sets designed to flank putative DNase HS sites were used to amplify genomic DNA from CD4 or GM06990 cells that was either treated or untreated with DNase. Real-time PCR using primer sets that surround a valid DNase HS site display a shift in the number of cycles to generate equivalent threshold levels of product as the untreated sample. This shift is referred here as the ∆ Ct value. Primers were designed (n = 384 sets for CD4 and n = 288 sets for GM06990) to flank seven mutually exclusive DNase-chip peak categories: those that were present in all three DNase concentrations (ABC), two out of three DNase concentrations (AB, BC, and AC), or those that were only present in a single DNase concentration (A only, B only, and C only). A set of 96 primer sets that flanked non-repetitive random coordinates from the genome was used to identify background levels of DNase sensitivity across the genome. Stringent threshold levels of hypersensitivity were set at a level above ~95% of the random primer sets. Threshold levels were not set higher due to the possibility that a small number of these random sites may represent true DNase HS sites. Confirmatory PCR reactions (n = 672) corresponding to DNase-chip peaks representing different DNase concentrations were performed a single time. PCR primers were also designed (n = 96 for CD4 and n = 96 for GM06990) to flank DNase-chip peaks that were present in one cell type but not the other (average of all 9 hybridizations for each cell type). Real-time PCR reactions for cell type specific DNase-chip peaks were performed in duplicate. To determine specificity, primer sets (n = 182) were designed to flank randomly selected unique coordinates in the ENCODE regions. Any primer set that generated a real-time PCR ∆Ct value below a stringent cutoff of <1.5 was determined to be a true negative. To determine sensitivity, primer sets (n = 180) were designed to flank MPSS clusters in ENCODE regions identified in a previous study3. These primer sets were used to confirm by real-time PCR the MPSS clusters that represented valid DNase HS sites in either CD4+ T cells or the GM06990 lymphoblastoid cell line. A small subset of PCR reactions were performed in triplicate to show that this method is highly reproducible (standard deviation of replicate ∆ Ct values < 0.2).
Computational analyses
To determine whether DNase-chip peaks were significantly overrepresented within different annotated regions of the genome, we performed Monte-Carlo simulations using 1,000 mock datasets. Each virtual dataset contained a list of 887 randomly selected regions from the NimbleGen ENCODE tiled array, one region for each experimentally determined DNase-chip peak. Each region in a dataset was the same length as a DNase-chip peak, and, like the peaks, did not begin or end in a gap on the array (repetitive region), or contain a gap longer than 179 basepairs.
We then compared the positions of the regions in the mock datasets and the positions of the DNase-chip peaks to the coordinates of features annotated on the genomic ENCODE regions. The coordinates of CpG islands and Goncode genes were downloaded from the UCSC Genome browser ( in December 2005. Only the reference Gencode set (known, putative, and novel genes experimentally verified) that contained an open reading frame was used in this analysis. Transcriptional start and end sites were identified for every Gencode transcript, including all alternatively spliced isoforms. The coordinates of the first introns and first exons were extracted from the longest Gencode transcript for each gene.The sequence conservation track was derived from a combination of TBA4, MLAGAN5, and MAVID6 multi-species alignment programs and a combination of binCons7, phastCons8, and GERP9 sequence conservation programs. The multi-species conserved sequences used in this study (MCS_moderate) identifies regions that are shared between two out of three alignment programs, and shared between two out of three sequence conservation programs ( The perl scripts used to generate the mock datasets and to compare the coordinates of the DNase-chip peaks and random regions to the annotated genomic features are available upon request.
Previously identified DNase HS sites using massively parallel signature sequencing (MPSS)3 are available at The number of DNase HS sites that mapped to ENCODE regions stratified for gene density and sequence conservation were normalized for the number of kilobases of sequence contained in each category (
Expression analysis
Total RNA was purified from CD4+ T cells and the GM06990 line from three biological replicates (Trizol). RNA was labeled and hybridized to Affymetrix U133 2.0 plus arrays. Two technical replicates were performed on one biological replicate from each cell type. Arrays were normalized together using the robust multi-array analysis10 (RMA) through the BioConductor project Affymetrix package ( For expression analysis, gene expression datasets (encompassing the full range of potential expression values) from each cell type were averaged from biological and technical replicates. Transcription start sites for each gene transcript were mapped to the nearest DNase HS site detected in each cell type. A one-tailed t-test, two-sample unequal variance, was performed to determine whether average gene expression for genes with a DNase HS site within 1 kb from the transcription start site was statistically different from average gene expression for genes without a HS site within 1 kb from the transcription start site. All expression data has been deposited on GEO ( accession # GSE4406.
1.Crawford, G. E. et al. Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. Proc Natl Acad Sci U S A 101, 992-7 (2004).
2.McArthur, M., Gerum, S. & Stamatoyannopoulos, G. Quantification of DNaseI-sensitivity by real-time PCR: quantitative analysis of DNaseI-hypersensitivity of the mouse beta-globin LCR. J Mol Biol 313, 27-34 (2001).
3.Crawford, G. E. et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res 16, 123-31 (2006).
4.Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14, 708-15 (2004).
5.Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13, 721-31 (2003).
6.Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14, 693-9 (2004).
7.Margulies, E. H., Blanchette, M., Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res 13, 2507-18 (2003).
8.Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034-50 (2005).
9.Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15, 901-13 (2005).
10.Irizarry, R. A. et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31, e15 (2003).