Supplementary Information

Generation of the chromosome sequences

Both chromosomes were sequenced using a clone-by-clone shotgun sequencing strategy1 supported by the BAC-based whole genome physical map2. The quality of the chromosome 2 and 4 sequences was determined to exceed the 99.99% accuracy standard established by the International Human Genome Sequencing Consortium (IHGSC) for sequencing the human genome3. In addition to the centromeric gaps, there are 17 gaps in the chromosome 2 sequence and 12 gaps in the chromosome 4 sequence. Sequences extend into the centromere on both chromosomes and reach the p-arm telomeric sequence on both chromosomes. On 2q, the sequence reaches within 200kb of the telomere and within 10kb of the telomere on 4q4. Based on the size estimates of remaining gaps, the available sequence represents greater than 99.6% of the total euchromatic sequence (Supplementary Table 1).

Attempts were made to close all remaining gaps first using probes derived from BAC-end sequences from clones flanking each gap and from available chimpanzee sequence (PanGSC, in preparation) against all available libraries (100-fold clone coverage of the genome, Supplementary Methods below) and second using sequence placement of fosmid paired end reads{Consortium, 2004 #94}. From the fosmid end placements, 76 clones were selected which added just over 500kb of sequence (now included in build35/hg17). Remaining gaps were sized using FISH and using comparative placement to mouse, rat and chimp. In only one case was additional chimpanzee sequence detectable that fell in the human gap region. Some of the remaining gaps are associated with segmental duplications (60%), local complex repetitive structures (20%), local increases in G+C content of 2 to 10% (50%), and locally high regions of G+C content (35%, with 10% >=55% G+C). The rapid local increases in GC content flanking gaps is consistent with the notion that GC-rich regions are difficult to clone5.

The integrity of the sequence and its assembly was tested using a variety of methods. First, as each clone was finished, an in silico digest of the sequence was compared for verification purposes to restriction digests of the clone DNA. Second, we checked the fully assembled sequence by performing in silico digests of clone-sized fragments across the chromosome against the underlying fingerprint data used to construct the physical map. In this way, we directly confirmed greater than 99.99% of the testable bands. Third, while not providing conclusive evidence of a “problem” and many times only identifying polymorphism, we examined BACs, fosmid and plasmid6 paired end sequences to assess consistency. We searched for two or more ends clustered and missing, or spanning inappropriate distances. Based on BAC placements, eight suspicious areas were identified, but fosmid and plasmid data through those regions were supportive of the genome sequence. For the 13 areas with inconsistent fosmid pairing data, we sequenced each of them and only two suggested modification of the genome sequence. Five were conclusively determined to be polymorphic (Supplementary Table 9). Finally, we compared the order of the placement of BAC ends against the chromosome 2 and 4 sequences to the order of the BACs within the fingerprint map2. No inconsistencies were found.

Orthology to mouse

The relationship between the human chromosome 2 and 4 sequences and the mouse genome could be readily defined for ~94% and ~97% of the sequenced chromosomes, respectively. The largest defined segment (44.1 Mb) on chromosome 2 includes two distal-less homeobox containing genes (DLX1 and DLX2) and the HOXD gene cluster. The largest defined segment (21.7 Mb) on chromosome 4 includes a replication-independent member of the histone 2A family, H2AFZ. In mice, this particular histone functions in embryonic development and lack of H2AFZ leads to embryonic lethality.

Supplementary mRNA information

As described in the text, we examined discrepancies between our genomic sequence and known mRNAs that created frameshifts and/or truncations in the genomic sequence. In cases where there was no support for the mRNA sequence variant, we attempted to construct a gene that avoided the sequence discrepancy. In cases where the site was determined to be an error in our BAC sequence (updates to our BAC sequence have also been submitted) or polymorphic and without the base change we could not create a protein the amino acids for the following GenBank7 gi identifiers were added to the chromosome gene index, even though the BAC-based sequence did not support those translations. The following mRNAs were not included in the gene index for the reasons indicated: 19923278 (missing nucleotide causes frameshift, confirmed BAC error), 21314676 (missing nucleotide causes frameshift, confirmed BAC error), 23510369 (extra nucleotide causes frameshift, BAC sequence supported by PCR), 24850108 (two missing nucleotides causes frameshift, confirmed BAC error), 27735096 (extra nucleotide causes frameshift, repetitive region could not be sampled by PCR for verification), 4557854 (several missing nucleotides causes frameshift, BAC sequence supported by PCR screening), 6006032 (several missing nucleotides causes frameshift, BAC sequence supported by PCR screening), 8394043 (several missing nucleotides causes frameshift, BAC sequence supported by PCR screening), 11496888 (frameshift causes internal stop, ESTs suggest it is a genomic sequencing error), 23468353 (frameshift, confirmed BAC error), and 16876981 (frameshift, confirmed BAC error).

One mRNA, gi17572816, spans across from chr4 to chr4_random. In the hg17/build35/May, 2004 release of the chromosomal sequence, this problem has fixed. One mRNA, gi21314625, was found to be incomplete (80 bases missing in the middle of the gene, a confirmed deletion in the BAC sequence which is found in an overlapping BAC.). This has been corrected in the May, 2004, release. Two additional mRNAs (gi9507164 and gi5031798 as well as the TWIST2 gene (gi|17981707|ref|NM_057179.1|, gi|21620050|gb|BC033168.1|, gi|17389790|gb|BC017907.1| gene which lies in the gap of NT_005120 and NT_022173) spanned across a sequence gap with some exons falling in the gap and were not included in the gene set. For gi9507164, although the gap still exists in the May, 2004, release, we have extended sequence into that gap and can now account for all exonic sequence.

Based on placements of mRNAs against chromosome 2 and 4, only one possible deletion was detected and that one was on chromosome 4. The mRNA, BC029568, appears to indicate a deletion in AC018687. It was found completely in AC018692, a finished clone that was not used in build34. It is basepair 1-176 of that mRNA that are not in AC018687. The data in AC018692 is currently represented (in build35) in chr4_random. In the next release of the human genome, we will update the main chromosome 4 sequence to include the data from AC018692.

Additionally, alignment with the mRNA NM_005277 indicates a 156 and a 236 basepair deletion in the clone AC093819.3. These were verified by alignment with an overlapping BAC clone and have been correctly updated in GenBank.

As described in the main text, but here provided in greater detail, we also investigated 161 mRNA regions where the matched genomic sequence contained substitution or insertion/deletion differences from the mRNA, 83 of which caused a frameshift and/or truncation of the protein product. To determine the origin of the difference, we re-sequenced the region of interest in a panel of 24 diverse individuals8, in the starting BAC and in some cases in overlapping BACs. A total of 78 substitution/triplet insertion discrepancies were examined (only 2 of the 78 were triplet insertion; one was polymorphic and in the other case all individuals agreed with the BAC). In eight cases, primers could not be chosen because the sequence was too repetitive. In eight cases, all genomic samples agreed with the BAC suggesting an error in the mRNA or a highly rare polymorphism. In two cases, all individuals matched the mRNA, but the sequence of the control DNA matched the BAC and other underlying BACs could not be sampled. Only one was determined to be an error in the BAC. A total of 44 were found to be polymorphic in the population of 24 individuals. An additional 15 were found to not be polymorphic within the panel of 24 individuals but of those 11 were polymorphic within the RP11 library and in four cases all overlapping RP11 clones sampled agreed with the BAC sequence.

Of the 83 possible frameshifts, two were too repetitive to obtain data. Eight were found to be errors in the genomic sequence. In 69 of the 84 frameshift cases, the sequence in the 24 genomic samples agreed with the sequence of the BAC again suggesting errors in the underlying mRNA. Of those, 54 were simple single base insertion or deletion errors in the mRNA that shifted or truncated the reading frame in the cDNA relative to the genomic sequence. The other 15 had multiple base insertion/deletion differences such that the frame was eventually restored. In such cases, comparison with the related mouse confirmed the genomic translation with a more conserved match between mouse with the genomic translation than with the original mRNA translation. Most interestingly, four were determined to be polymorphic in the population (Table 1; Supplementary Table 5). The first polymorphic example is found on chromosome 2 in the protein PMS1 (postmeiotic segregation increase 1), identified by its homology to a yeast protein involved in DNA mismatch repair. Some cases of hereditary nonpolyposis colorectal cancer are associated with mutations in this gene. This particular single base deletion event in the BAC with respect to the mRNA causes a change of frame (4 amino acids before the stop codon) and extension of the open reading frame by an additional 15 amino acids. Another PMS1 mRNA (BC036376) avoids the genomic region of this frameshift with an alternative splicing event 71 basepairs upstream. Similarities to predicted mouse proteins end at amino acid 149, and this frameshift discrepancy is at amino acid 163.

The polymorphic frameshift in the final exon of the PAS domain containing serine/threonine kinase (PASK/ NM_015148) leads to a longer open reading frame. The amino acids extending past the stop indicated by NM_015148, however, does not match the orthologous predicted gene in the mouse genome.

The third polymorphism was identified in ancient ubiquitous protein 1 (AUP1/ NM_012103), one of three isoforms referred to as the longest isoform at 476 amino acids. The frameshift modifies the protein beginning at position 147 and then truncates the protein with a stop codon at amino acid 149. The similarity to mouse corresponds with an alternate isoform and does not extend through this region (ends at aa 113).

A fourth polymorphism, an insertion of four basepairs in the genomic sequence relative to the mRNA, was identified on chromosome 4 in TSARG2, testis and spermatogenesis cell related protein 2 and causes truncation of the protein at amino acid 281 out of 305 (frameshift occurs at amino acid position 279). Two underlying mRNAs agree with the genomic data and read through the region of this 4 basepair deletion without a frameshift, confirming the polymorphism. The similarity to the related mouse protein (ENSMUST00000033917) extends through the region of the frameshift (to amino acid 287) indicated by the mRNA sequence.

In addition to the frameshifts detected by comparisons of the mRNA with the genomic DNA, we also detected possible frameshifts by comparing all genomic coding regions to the random genomic shotgun data available through the HAPMAP project6. We detected 28 additional potential high quality frameshifts in coding regions of REFSEQ9 or MGC10 mRNAs on chromosome 2, only 6 of which were annotated in dbSNP11. On chromosome 4, we detected 15 possible frameshifts in coding regions of REFSEQ mRNAs, only one of which was annotated in dbSNP. In all cases but one, the frameshift led to truncation of the protein. This suggests an additional four possible polymorphic frameshifts or BAC errors where the chimp sequence agreed with the mRNA, each of which will require further validation.

Supplemental non-coding RNA information

The lower than expected density of ncRNA genes on these two chromosomes may reflect the fact that many ncRNA genes are strongly clustered, such as the large array of 18 cysteine tRNAs at 7q36 (Hillier et al., 2003), the tandem array of 5-25 U2 snRNAs at 17q21 (Pavelitz et al., 1999), or the complex array of ~30 U1 snRNA genes and tRNAs at 1p36 (Lindgren et al. 1985). Chromosomes 2 and 4 appear to lack any large clusters of this sort. There are seven small clusters of 2-3 genes each within <100 kb of each other: two tRNA pairs, one pair of miRNAs, and four sets of snoRNAs in the introns of coding genes (2 in EF-1-beta, 3 in nucleolin, 2 in APG16L, and 2 in ribosomal protein S3a).

Supplemental Segmental Duplication Analysis

Segmental duplications have long been noted for their potential role in the evolution of new genes12. To examine the transcriptional and coding potential of duplicated regions, we analyzed a non-overlapping set of known genes with or without additional spliced EST. For each group, we categorized every exon as unique or duplicated on the basis of its overlap with duplicated sequence (Supplementary Table 10). About 5.49% (1162 out of 21164) of exons of chromosome 2 are duplicated, while only 2.47% (277 out of 11213) of exons of chromosome 4 are duplicated. Our analysis shows that the relative number of transcribed exons is slightly greater for duplicated DNA when compared with non-duplicated DNA on chromosome 2 and 4. These results support a previous observation13 that recently duplicated regions are rich in genes/transcripts.

SupplementaryMethods

Physical Mapping

The human libraries listed here were probed for filling gaps in the chromosome 2 and 4 maps. In addition, we have probed against filters from the RP43 chimpanzee library.

LibraryEstimated Coverage

RPCI-4 RPCI Male PAC4x

RPCI-5 RPCI Male PAC 6x

RPCI-11 Segments 1-5 Male BAC32.2x14,15

RPCI-13 Segments 1-4 Female BAC21.8x

Genome Systems BAC library 110x

Cal Tech Segments A, B 13x16

Cal Tech D 17x16

Broad/MIT fosmid library (H_AA) 15x

MRC Geneservice single chromosome cosmids/fosmids:

LL02NC02 – chromosome 24x

LL02NC03 – chromosome 22.6x

LA04NC01 – chromosome 43x

RP libraries :

Cal Tech libraries :

Genome Systems libraries:

MRC libraries :

Sequence and Assembly Validation

Each chromosome sequence was segmented into 150-kb fragments each with 40% overlap with the previous simulated clone. An in silico HindII digest (ISD) of these simluated clones was performed; bands less than 600 bp were removed from consideration to be consistent with the fingerprint data. Fingerprint data from the human genome physical mapping effort2 were converted from mobilities to sizes so that clones from the fingerprinting effort could then be directly compared to the chromosome fragments.

The ISD of each 150-kb simulated clone (2,662 clones for chromosome 2; 2,094 for chromosome 4) was then compared to the overlapping clones in the fingerprint map. We required that each band in the simulated clone be confirmed by no more than two of the overlapping fingerprints. A band was considered confirmed if the difference between the size of the fragment in the simulated clone varied by less than 3% from a band in the fingerprints under consideration. For bands between 600 and 700 bp (N=1828 bands in chromosome 2 are in this range; N=1438 bands in chr 4) and for bands greater than 11-kb (N=2119 chromosome 2; N=1424 chromosome 4), we allowed the discrepancy to be <4% given the slightly higher variability of the fingerprint gels in those ranges17. Only 405 and 226 additional bands on chromosome 2 and 4 respectively were confirmed using the 4% threshold however rather than the 3% threshold.

An ISD of the full chromosome 2 sequence revealed 71,331 total bands of which 58,741 were greater than 600 bp. For chromosome 4, there were 58,729 total bands and 48,002 bands greater than 600 bp. Using the above method on chromosome 2, only 6 bands from 5 clones (2 of the clones were at contig ends) were not confirmed. For chromosome 4, there were only 6 unmatched bands (from 12 clones, 7 of which were at contig ends). In this way, 99.99% of the testable fingerprint bands were confirmed. Additionally, the order of the in silico digests placement in the FPC database was consistent with the assembly position, with the only minor discrepancies found adjacent to contig ends where sequence had been used to resolve the final assembly ordering.

Tiling Path Verification, Overlap Polymorphisms and Assaying mRNA/genomic discrepancies

To evaluate clone overlaps where the rate of difference between overlapping clone sequences was higher than 1 in 1000 bases, a PCR product encompassing the differences was sequenced from each BAC in the overlap region and from each of a panel of 24 ethnically diverse genomic DNA samples8. If the 24 samples showed allelic variation, the overlap was judged to be correct, but if the 24 samples yielded persistent heterozygosity, the sequence was determined to be derived from a repeated sequence, with sequence differences between the copies.

To investigate discrepancies between mRNA and genomic sequence, PCR products were generated, resequenced and the polymorphic based examined in the DNA from 24 individuals, the original BAC and in some cases other BACs.

Initially, we tested 21 and 32 overlaps for chromosome 2 and 4, respectively, that contained at least three differences per kb. There were 3,547 differences in 955kb or 3.7 differences per kb on average. There were 8,785 differences in 2186kb or 4.0 differences per kb on average. The highest rate of apparent variation identified was 5.6 per kb on chromosome 2 and 6.3 per kb on chromosome 4.

We further evaluated the overlap regions of 2,718 human clones (from chromosomes 2, 4 and 7) by aligning each overlap using cross_match (P. Green, unpublished) using the following parameters: -minmatch 50 –minscore 2000 –discrep_lists –tags –penalty –1. We counted the number of phred18 high quality (base quality value >19) substitution, insertion and deletion events (where each insertion or deletion event, regardless of the numbers of basepairs inserted, was counted as one polymorphic event). We counted events in 5kb non-overlapping windows. Of those 2,718 overlaps (8,309 windows), 678 had at least 5kb of overlap and more than 3 “polymorphic events” in at least one of the 5kb windows. The greatest number of events per window was 136.

We identified 24 different regions where there were at least 3 consecutive windows with more polymorphic events than two standard deviations from the mean. When considering substitution + insertion + deletion events, the mean number of polymorphic events per 5kb window was 5.23 (s.d. 6.44). There were 19 overlaps with at least 3 consecutive windows of > 2 standard deviations (>18). When considering only subsitution events, the mean number of events per 5kb window was 4.20 (s.d 5.57). There were 21 overlaps with at least 3 consecutive windows greater than 2 standard deviations (>15). When considering only insertion + deletion events, the mean number of events per 5kb windows was 1.03 (s.d. 1.48). There were 4 overlaps with at least 3 consecutive windows of greater than 2 standard deviations. We examined these regions for features that may indicate balancing selection. Of the 24, 15 were in the region of a gene.