SUPPLEMENTARY INFORMATION

Clone mapping and sequencing of human Chromosome 17. Methods for generation of the clone path and sequence map at the Broad Institute (Whitehead/MIT Center for Genome Research) have been previously described1. The list of 664 clones comprising the tiling path is found in Table S2.

Clone mapping and sequencing of mouse Chromosome 11. Methods for generation of the clone path and sequence map at the Sanger Institute have been previously described2,3. The list of 867 clones comprising the tiling path is found in Table S5.

Quality assessment of clone path and sequence of human Chromosome 17.

We assessed the local accuracy of the clone path of human Chromosome 17 by aligning paired-end sequences from a human Fosmid library (designated WIBR2, representing 10x physical coverage) to the finished sequence4. By identifying discrepancies between the distances between Fosmid ends in the finished sequence and those expected based on insert size constraints, one can detect errors in the clone path4. This analysis revealed no aberrant clones. In addition, an independent quality assessment exercise commissioned by NHGRI5 estimated the accuracy of the finished sequence at better than 1 error in 100,000 bases (J. Schmutz, personal communication).

Several analyses support that nearly the entire euchromatic region of Chromosome 17 is present and accurately represented in the finished sequence. Of the 1266 gene loci (encoding 5589 transcripts) in the well-curated RefSeq6 dataset that have been mapped to Chromosome 17, all are present and complete in the finished sequence. In addition, the finished sequence shows excellent alignment to genetic and radiation hybrid (RH) maps (Figure S1). The clones at 17ptel and 17qtel both contain telomeric (TTAGGG)n repeat sequences. The clones flanking the centromeric gap both contain significant arrays of alpha-centromeric repeats, although neither shows the higher order repeat structure of arrays that distinguishes the true centromeric regions. The genetic map7 shows perfect alignment, with no discrepancies among 179 sequence-based genetic markers (Table S7). The RH map8 shows good agreement, but contains local discrepancies as is expected from its lower resolution (Table S8).

Quality assessment of clone path and sequence of mouse Chromosome 11.

We assessed the local accuracy of the clone path of Mouse Chromosome 11 by aligning paired-end sequences from a mouse Fosmid library (designated WIBR1, representing 10x physical coverage) to the finished sequence2. This analysis revealed no aberrant clones. In addition, an independent quality assessment exercise utilizing optical map data with a >20X coverage of mouse 11 highlighted 4 events for further investigation9 (D Schwartz and S. Goldstein, personal communication). All 4 were resolved either by clone re-assembly or replacement of deleted BACs within the tilepath.

Several analyses support that nearly the entire euchromatic region of Chromosome 11 is present and accurately represented in the finished sequence. Of the 1486 RefSeq6 genes that have been mapped to Chromosome 11, all are present and complete in the finished sequence. In addition, the finished sequence shows excellent alignment to genetic and RH maps (Figure S1). The genetic map10 shows perfect alignment, with no discrepancies among ~320 sequence-based genetic markers (Table S9). The RH map11 shows good agreement, but contains local discrepancies as is expected from its lower resolution (Table S10).

Gaps in the human Chromosome 17 clone path. The Human Chromosome 17 clone path currently contains 9 euchromatic gaps (numbered below), with an estimated 1.1% missing sequence, and a gap containing the centromeric heterochromatin (see Table S1). This is about 25% higher than the genome-wide average for both number of gaps and estimated sequence in gaps for NCBI build 35 (reference 4). All gaps on the chromosome fall into one of three classes. Three gaps are associated with segmental duplications which are refractory to mapping, two contain sequences that persistently delete from large insert clones, and for four gaps we were unable to identify spanning clones as they are likely to contain sequences that are unclonable in E. coli. Four of the gaps also fall into regions associated with primate-specific breaks in conserved synteny, so that other mammalian genome data is not useful for sizing them.

Proceeding from 17ptel to 17qtel, the general character of the gaps and the methods by which their sizes were estimated are described below. Approximate positions are listed. See Table S1 for more details.

1.  2.96 Mb. The gap region is apparently refractory to cloning as we were unable to identify spanning clones despite screening of large insert clone libraries representing >53-fold physical coverage of the human genome. The size of this gap was estimated at 101kb by taking advantage of conserved synteny of markers between the human and mouse genomes.

2.  21.5 Mb. The gap region is apparently refractory to cloning as we were unable to identify spanning clones despite screening of large insert clone libraries representing >53-fold physical coverage of the human genome. The gap is in a region associated with primate-specific breaks in conserved synteny, and so cannot be sized by comparison to nonprimate genome sequence. We were also unable to determine the size of this gap by using the TNG radiation hybrid map. A default estimate of 100kb for this gap is used in calculating the size of the chromosome.

3.  22.2 Mb. This gap contains the centromeric heterochromatin, and is estimated at 3Mb in size.

4.  31.7 Mb. This gap is flanked by segmental duplications, and is thus refractory to current mapping methods. The gap is in a region associated with primate-specific breaks in conserved synteny, and so cannot be sized by comparison to nonprimate genome sequence. We were also unable to determine the size of this gap by using the TNG radiation hybrid map. A default estimate of 100kb for this gap is used in calculating the size of the chromosome.

5.  33.5 Mb. This gap is flanked by segmental duplications, and is thus refractory to current mapping methods. The gap is in a region associated with primate-specific breaks in conserved synteny, and so cannot be sized by comparison to nonprimate genome sequence. We were also unable to determine the size of this gap by using the TNG radiation hybrid map. A default estimate of 100kb for this gap is used in calculating the size of the chromosome.

6.  38.6 Mb. This gap region is flanked by segmental duplications, and contains sequences that consistently delete from large insert clones. It appears to contain many tandem copies of a 6kb HERV element. The size of the gap was estimated at 100kb by comparing the average size of BACs in the library with the sizes of BACs with ends on either side of the gap, and then accounting for the amount of sequence present between those ends.

7.  63.5 Mb. This gap is flanked by segmental duplications, and is thus refractory to current mapping methods. This gap is in a region associated with primate-specific breaks in conserved synteny. The size of the gap was estimated at 90kb by both fibre-FISH and by estimating from spacing of flanking markers from the TNG radiation hybrid panel.

8.  75.1 Mb. This gap region contains sequences that consistently delete from large insert clones. The size of the gap was estimated at 153kb by taking advantage of conserved synteny of markers between the human and mouse genomes.

9.  77.3 Mb. The gap region is apparently refractory to cloning as we were unable to identify spanning clones despite screening of large insert clone libraries representing >53-fold physical coverage of the human genome. The size of the gap was estimated at 65kb by taking advantage of conserved synteny of markers between the human and mouse genomes.

10.  78.7 Mb. The gap region is apparently refractory to cloning as we were unable to identify spanning clones despite screening of large insert clone libraries representing >53-fold physical coverage of the human genome. This size of this gap was estimated at 100kb from spacing of flanking markers from the TNG radiation hybrid panel.

Gaps in the mouse Chromosome 11 clone path. The Mouse Chromosome 11 clone path currently contains two euchromatic gaps and a gap containing the acrocentric heterochromatin. These gaps remain despite screening of genomic libraries containing a combined ~40-fold physical coverage of C57BL/6. Proceeding from 11ptel to 11qtel, the sizes of the euchromatic gaps were estimated as follows: The gap at 34.4 Mb was estimated at 90 kb from Fibre FISH. The gap at 88.4 Mb was determined to be 6.8kb by sequencing in the NOD DIL strain. Both gap sizes have now been confirmed from using the optical map (D. Schwartz and S. Goldstein, personal communication). See Table S1 for more details.

Details of gene annotation for human Chromosome 17

The gene catalog was produced as previously described1. Databases of transcribed sequences, including RefSeq (release 1), mammalian gene collection (MGC) (February 3, 2003), dbEST, and GenBank mRNA (December 29, 2002), were aligned to the genome assembly using BLAT12. GenPept protein sequences (February 3, 2003) from many species were aligned using BLASTX13 and GeneWise14. Gene models were then created by manual annotation based on this transcriptional evidence, with the models classified according to the HAWK2 transcript type conventions (www.sanger.ac.uk/Info/workshops/hawk2), see reference 1 for details. We note that 76% of ‘known’ genes have evidence of CpG islands in the region from 2 kb upstream to 1 kb downstream of the annotated 5’-end; this is similar to some recent reports1,15, but somewhat higher than previous reports in the range of 61-66%4,16-18. Gene symbols were assigned by the HUGO Gene Nomenclature Committee for biologically characterized loci; a complete list of gene symbols from this paper is found in Table S9. Our annotations are available from the Vertebrate Genome Annotation database (VEGA) (http://vega.sanger.ac.uk/Homo_sapiens).

According to the Hawk2 categorization scheme Chromosome 17 contains 1017 ‘known’ genes, 123 ‘novel CDS’, 87 ‘novel transcripts’, 32 ‘putatives’, and 2 ‘gene fragments’. Only a small fraction of all loci, those in the ‘novel’ and ‘putative’ categories, were annotated as genes based on spliced EST evidence only. Some ‘putative’ and ‘novel’ loci may prove to be pseudogenes. Full-length transcripts of known genes have an average length of 2994 bp and contain an average of 10.6 exons, which is comparable to recent published reports of human chromosomes1,15-17,19. Internal exon lengths average 152 bp.

Extreme genes, human Chromosome 17. The longest gene on Chromosome 17 is ubiquitin specific protease 32 (USP32) [HsG16285] spanning 2,096,535 bp. The longest mature transcript is amiloride-sensitive cation channel 1, neuronal (degenerin) (ACCN1), transcript variant 2 [HsT34171, HsG15443] at 1,143,719 bp. This gene also contains the longest intron at 1,043,911 bp. The longest single exon is found in STE20-like kinase [HsG15270, HsT33652], a 9338 base terminal coding exon.

Duplication Class Clustering. Pairwise intrachromosomal duplications were defined as above. A pairwise duplication A~A’ was considered to be in the same class as another pairwise duplication B~B’ if B or B’ overlapped A or A’ by 150 bp, and at least 5% of the smaller element. We extended this by transitive closure to build maximally linked sets (i.e., if A~A’ overlapped with both B~B’ and C~C’, we clustered together the three regions, even if B~B’ did not overlap C~C’). The number of duplications in a class is counted as the number of distinct pairwise alignments X~X’ which were clustered. The number of bases in a class is counted as the number of distinct bases covered by at least one pairwise duplication in that class.

Syntenic Mapping, Ancestral Reconstruction, and Assignment of Breakpoints. We constructed pairwise maps of conserved synteny genome-wide between human4 (NCBI build 35), mouse2 (NCBI build 34), dog20 (CanFam 2), and opossum (K. Lindblad-Toh, per. Comm.) (MonDomV3) using bidirectionally unique genomic anchors from DNA alignments with PatternHunter21. Segments of conserved synteny >100 kb in size made from consistently ordered and oriented anchors were retained. These are referred to as “segments” of “directed synteny”, meaning no rearrangement was observed within the bounds in either species. We also define “blocks” of “undirected synteny” as regions on the map in which all sequences between two endpoints on one genome map only to a single interval in the second genome, and vice versa. Thus a block may contain one or more segments.

The entirety of human 17 and the distal end of mouse 11 form a single block of undirected synteny. There are two longer such blocks between the two genomes, the X chromosome being the longest (145.5 Mb in defined conserved synteny on human X) and a block between human 4 and mouse 5 the second longest (88.2 Mb). Six other regions exceed 50 Mb in size, including the majority of human chromosome 20 (57.5 Mb, a small piece of mouse in this region maps to human 19). Modeling the breakpoints as random events and fitting to an exponential distribution shows that an unbroken block the size of human 17 is not a significant outlier.

We reconstructed the primate-rodent ancestor (Fig S2) starting with segments of directed synteny between human and mouse. We then used comparison to dog and opossum to determine the most parsimonious ancestoral structure. If either dog or opossum agreed with the ordering of segments in one of human or mouse, that was taken to be the state of the primate-rodent ancestor. In no cases did dog and opossum provide conflicting evidence. In one case, a break between 20.1 and 20.8 Mb in human, neither dog nor opossum supported either human or mouse ordering. This region appears to contain breakpoints used independently in multiple lineages. It is placed at the end of our ancestral reconstruction in the order and orientation suggested by dog, although if dog is rearranged at this site the orientation may be incorrect. Since all other junctions are supported by at least one outgroup, there is no other ordering consistent with a minimal rearrangement path; however, if the true rearrangement history is not minimal, there may be other valid orderings we have not explored.