Supplementary Information
Generating a sequence-ready map of chromosome 7
The hierarchical clone-by-clone sequencing of chromosome 7 required the generation of a suitable BAC-based physical map. The starting framework for this effort was the YAC-based STS-content map of the chromosome1. This provided a mapped STS on average every 79 kb; however, the YACs themselves were not suitable templates. Initially, the STS-specific PCR assays were used to isolate clones and to construct BAC/PAC-based contig maps. This was followed by higher-throughput hybridization-based screening methods utilizing pooled labeled overgo probes2. The hybridization-based screening provided deep clone coverage, while the PCR-based screening provided clones with STS-content information. Isolated BACs were subjected to restriction enzyme digest-based fingerprint analysis3. The resulting regionally localized contigs were positioned on the longer-range physical map by their STS content, and minimal tiling paths of clones were selected for sequencing. With time, chromosome 7 clones were also selected from the genome-wide BAC physical map4. At that time, contig information was integrated from the chromosome 7 regional database to preserve the previously determined clone order and contig identification. Subsequent clone selection utilized the whole genome database. Attempts were made to close all gaps using overgo probes derived from BAC-end sequences from clones flanking each gap. All available BAC and PAC libraries were screened, which collectively represented 100-fold clone coverage of the genome (see supplementary methods).
Comparison of the chromosome 7 sequence to physical and genetic maps
We evaluated the completeness of the chromosome 7 sequence by looking for its representation of STSs from previously constructed physical and genetic maps of the chromosome, specifically a YAC-based STS-content map (2,150 STSs)1, the Genethon microsatellite-based genetic map (272 CA-repeat-containing STSs)5, and a chromosome 7-specific radiation-hybrid (RH) map (268 STSs)1. Using a combination of e-PCR6 and MegaBLAST7, 2,044 (95%), 268 (99%), and 248 (93%), STSs on the above three maps, respectively, could be uniquely assigned(see Supplementary Information). The small number of unassigned STSs included those from multi-copy sequences, sequence polymorphisms, regions within the remaining chromosome 7 sequence gaps, and clerical errors that preclude accurate matching of STS names to their true underlying sequence. For example, the majority (81/106) of the STSs present on the YAC-based map but unassigned to the chromosome 7 sequence were largely repetitive, making them difficult to localize reliably by computational methods. In the final analysis, only 9 (0.4% of the total STSs) of the STSs reside within notably weak, suspicious regions of the YAC-based map (e.g., small, isolated contigs) or perhaps within the remaining sequence gaps.
We also used these maps to evaluate the assembly of the sequence. The positions of the identified STSs in the chromosome 7 sequence were plotted relative to their established map positions (main text Fig. 2). Of the 2,044 identified STSs in the YAC-based map, only 19 (<1%) are in serious disagreement (defined as >3 Mb) with their sequence position. There is some clustering of these grossly discordant STSs. For example, six of these STSs (sWSS1373, sWSS2115, sWSS1349, sWSS2734, sWSS2676, and sWSS2678) reside within the chromosomal region associated with the deletions causing the disorder Williams syndrome, an area that was not fully resolved in the YAC-based map. Most of the other grossly discordant STSs reside at the end of YAC contigs (often weakly connected to neighboring STSs), making it likely that their position on the YAC map was incorrect.
Higher-resolution analysis revealed 172 examples of minor, local discordance between the YAC-based map position(s) of a single STS or groups of STSs and their sequence position. These cases mostly involve single STSs mapped out of order (46/172) or pairs of markers whose positions are simply reversed (91/172). The remainder (35/172) of cases involve between 2 and 9 STSs mapped out of order. The only notable clustering of markers with minor discrepancies was found near the centromere and in the Williams syndrome region. Indeed, this level of minor, local discordance (involving only ~8% of STSs) is within the inherent limits of YAC-based mapping methods.
There is also strong agreement between the chromosome 7 sequence and both the Genethon genetic and RH maps. In each case, the order for >93% of the STSs is the same in the sequence and on the map, with only one and four instances, respectively, of an STS more than several megabases from its predicted position. The remaining cases of more local discordance are well within the experienced accuracies of meiotic and RH mapping methods.
Together, these findings reveal an excellent overall concordance between the chromosome 7 sequence and previously constructed physical and genetic maps. The rare discrepancies can be largely accounted for by the inherent lower-resolution nature of the various mapping methods; however, it is also possible that some of the observed differences reflect polymorphisms between the different copies of chromosome 7 used for map construction and sequence generation. Nonetheless, these results, in conjunction with the robustness of the BAC-based physical map used for sequence generation4, provide strong support for the established chromosome 7 sequence.
Supplementary mRNA information
As described in the text, we examined discrepancies between our genomic sequence and known mRNAs that created frameshifts and/or truncations in the genomic sequence. In cases where there was no support for the mRNA sequence variant, we attemped to construct a gene that avoided the sequence discrepancy. In cases where the site was determined to be an error in our BAC sequence (updates to our BAC sequence have also been submitted) or polymorphic and without the base change we could not create a protein the amino acids for the following GenBank8 gi identifiers were added to the chromosome 7 gene index, even though the BAC-based sequence did not support those translations: gi7662219, gi4504950, gi13994299, gi15706488, gi16554448, gi5174616, gi12654784, gi4501850. Two additional mRNAs (gi13540581 and gi14043716) spanned across a sequence gap with some exons falling in the gap and were not included in the gene set.
Of the 23 mRNAs which had no similarities to any mouse gene or to any protein in the database, 20 were in regions of chromosome 7 that did have a mapped orthologous region in mouse and 7 of those 20 were spliced. Overall, of the 23 mRNAs, there were 17 that were single-exon genes and 15 of the 17 matched only themselves when searched against the human genome sequence. The open reading frame of the set of 23 was generally short with an average length of 115 amino acids. Finally, overall 12 of the 23 matched only themselves when searched against the human genome sequence.
Some mRNAs thought to be located on chromosome 7 were not found in the analyses described here. However, the following mRNAs have been identified in the new data incorporated into chromosome 7 since the June 28, 2002 (NCBI build31), freeze of the data set used for the analyses presented in this paper: NM_006060, NM_003364, NM_030798, NM_148842, BC032712, NM_02979 (REFSEQ9 identifiers) Two additional mRNAs thought to be on chromosome 7 have still not been identified in the finished sequence and have therefore been the subject of hybridization experiments. First, NM_005542 found positive clones on chromosome 7 as well as various other chromosomes. Thus, results are still not definitive. Second, hybridization experiments with NM_002607/NM_033023 identified one clone helping to extend the pter-most chromosome 7 contig. Two clones were sequenced as a result. However, a gap still remains and the full genomic mRNA for that region has not been sequenced.
Supplementary pseudogene analyses
The analysis of the distribution of non-processed pseudogene compared to that of genes (window size =2 Mb) confirmed their significant correlation (r = 0.68; p< 0.001), in accordance with their formation mechanism. In contrast, processed pseudogenes appeared not to share any significant correlation with genes (r = 0.26). This is in disagreement of what could be expected if we assume a higher accessibility for the insertion of these transposed elements in genomic regions likely to present a higher transcriptional activity. Processed pseudogenes show a higher density near the telomeric regions (12.8 processed pseudogenes per Kb compared with an average of 4.3 per Kb for the whole chromosome). In addition, we observed a clear absence of processed pseudogenes within the TCR region which presents the highest gene density of the whole chromosome (79.4 genes/Mb).
Supplementary Methods
Physical Mapping
The following libraries were probed for filling gaps in the chromosome 7 map:
Library Estimated Coverage
RPCI-4 RPCI Male PAC 4x
RPCI-5 RPCI Male PAC 6x
RPCI-11 Segments 1-5 Male BAC 32.2x10,11
RPCI-13 Segments 1-4 Female BAC 21.8x
Genome Systems BAC library 1 10x
Cal Tech Segments A, B 13x12
Cal Tech D 17x12
RP libraries : http://www.chori.org/bacpac/
Cal Tech libraries : http://informa.bio.caltech.edu/idx_www_tree.html
Genome Systems libraries: http://www.genomesystems.com/
Sequence and Assembly Validation
The chromosome 7 sequence was segmented into 150-kb fragments each with 40% overlap with the previous simulated clone. An in silico HindII digest (ISD) of these simluated clones was performed; bands less than 600 bp were removed from consideration to be consistent with the fingerprint data. Fingerprint data from the human genome physical mapping effort4 were converted from mobilities to sizes so that clones from the fingerprinting effort could then be directly compared to the chromosome 7 fragments.
The ISD of each 150-kb simulated clone was then compared to the overlapping clones in the chromosome 7 fingerprint map. We required that each band in the simulated clone be confirmed by no more than two of the overlapping fingerprints. A band was considered confirmed if the difference between the size of the fragment in the simulated clone varied by less than 3% from a band in the fingerprints under consideration. For bands between 600 and 700 bp (N=1119 bands in chromosome 7 are in this range) and for bands greater than 11-kb (N=2002), we allowed the discrepancy to be <4% given the slightly higher variability of the fingerprint gels in those ranges13. Only 24 additional bands were confirmed using the 4% threshold however rather than the 3% threshold.
An ISD of the full chromosome 7 sequence revealed 44,870 total bands of which 36,798 were greater than 600 bp. Using the above method, only 27 bands were not confirmed. Of those, only 12 sequence bands were smaller than their related fingerprint band. Those bands ranged in size from 80 to 1740 basepairs (average=549 bp) and from 3.5% to 7.7% difference between the sequence and fingerprint band.
Sequence-Map Comparisons
The sequence-based markers (i.e., STSs) mapped to a YAC-based STS-content map1 , the Genethon microsatellite-based genetic map5, and a chromosome 7-specific radiation-hybrid (RH) map1 were positioned within the finished chromosome 7 sequence by a combination of e-PCR6(Schuler 1997) and MegaBLAST7(Zhang et al., 2000). In a first stage, e-PCR was used [parameters: margin (M) to 10, 50, or 200 allowing a mismatch (N) of 0, 1, or 2 bp] to test each STS using the corresponding PCR primers. Of the starting 2,169 non-redundant STSs, 1,678could be localized to a single position in the chromosome 7 sequence by e-PCR. In a second stage, MegaBLAST was used to analyze the remaining STSs [parameters: expectation cutoff (e) set to 1e-25 and 1e-15]; an additional 300 STSs were localized to a single sequence position. In a third stage, the remaining 191 STSs were subjected to manual inspection and other comparisons, allowing the localization of an additional 59 STSs. Using WU-BLAST (W. Gish, personal communication; http://www.blast.wustl.edu) (-lcmask gapsepqmax=10000 sort_by_pvalue E=0.00000001 gapW=8 -hspmax=100000) and BLAT14 an additional 14 Yac-based STSs and 10 Genethon markers could be localized respectively. Thus, these analyses resulted in the confident positioning of 2,061 (95%) of the previously mapped STSs within the chromosome 7 sequence. Note that neither the primer sequences nor intervening sequence for 1 and 9 of the STSs on the Genethon genetic and chromosome 7 RH maps, respectively, could not be obtained; these STSs were excluded from the above analyses.
Identifying Known Genes
Each mRNA in the human division of REFSEQ9 as of February 25, 2002, and in the human division of the Mammalian Full Length cDNA collection15 as of January 31, 2002, was searched against the full set of repeatmasked chromosome 7 sequences using WU-BLAST and the following parameters:
-lcmask -filter seg S=170 S2=150 W=13 gapW=4 gapS2=150 M=5 N=-11 Q=11 R=11 B=10000 V=10000 -hspmax 10000
For matches with >95% identity and a P-value more significant that e-49, the mRNA was aligned with the genomic sequence using SPIDEY16 . The SPIDEY output was parsed to keep only those alignments where fewer than three of the exons had similarity less than 97% and there were fewer than three gaps in the alignment with the mRNA (where a gap is defined as basepairs in the mRNA which are not aligned with the genomic DNA). At this point, no penalty was applied for alignments extending off the end of each clone.
Each chromosome 7 candidate mRNA was searched against the Feb 23, 2002, version of the NCBI chromosome assemblies using WU-BLAST and the following parameters:
-lcmask -filter seg S=170 S2=150 W=13 gapW=4 gapS2=150 M=5 N=-11 Q=11 R=11 B=10000 V=10000 -hspmax 10000
Using those results (1169 mRNAs), it was established whether the chromosome 7 match was the best match in the genome for that mRNA by comparing alignments of the mRNAs to the contiguous chromosome sequences. Of the 1169 mRNAs which were potential chromosome 7 known genes, 1073 were confirmed as having their best to match in the genome on chromosome 7.
The resulting 1073 mRNAs were then mapped (both the CDS and mRNA using SPIDEY and BLAT) back to the individual clone sequences and annotated there by manual review. All discrepancies were manually evaluated both in the genomic and mRNA sequences. As a part of the checking process, the annotated CDS was translated and searched against the translated mRNAs for verification.
After creation of the final non-redundant set of mRNAs, SPIDEY was evaluated. Each genomic region (extended by 2kb in each direction) was given to SPIDEY along with the mRNA sequence identical to that region. In 65% of the cases, the SPIDEY gene prediction was identical to the final mapping of the mRNA. Further, 79% of the exons were predicted exactly.
Only genes with identities to known mRNAs have been entered into the GenBank annotation for each clone. Genes in other classes are available on our web site.