Combined Supplementary Text

SupplementaryMethods

For the shotgun phase1, pUC plasmids with inserts of mostly 1.4-2 kb were sequenced from both ends using the dideoxy chain termination method2 with different versions of big dye terminator chemistry3. The resulting sequencing reactions were analysed on various models of ABI sequencing machines and the generated data were processed by a suit of in-house programs

( prior to assembly with the PHRED4,5 and PHRAP ( algorithms. For the finishing phase, we used the GAP4 program6 to help assess, edit and select reactions to eliminate ambiguities and close sequence gaps. Sequence gaps were closed by a combination of primer walking, PCR, short/long insert sublibraries7, oligo screens of such sublibraries and transposon sublibraries. Unless annotated otherwise, each clone has been finished according to the agreed international finishing standard ( Based on internal and external8 quality checks, we estimate our sequence accuracy to exceed 99.99%. For a small number of clones, the initial draft sequence was carried out by others who are credited in the corresponding submissions to the EMBL/Genbank/DDBJ database. All clones submitted by the Sanger Institute were subjected to rigorous quality checks as described elsewhere9. All DNA sequences have been deposited in the EMBL, Genbank or DDBJ databases.

The finished genomic sequence was initially automatically analysed using an ENSEMBL pipeline with modifications introduced specifically to aid the manual curation process using a clone by clone method. The clone sequence was first analysed for G+C content and CpG islands, these were identified using ‘cpg’ with parameters as used by ENSEMBL (Maximum CpG length = 400bp, Minimum GC content = 50%, Minimum Observed/Expected ratio = 0.6). Interspersed and simple tandem repeats were masked using Repeatmasker ( and Tandem Repeats Finder10. The masked sequence was compared against, vertebrate cDNAs and expressed sequence tags (ESTs) using WU-BLASTN ( and against a protein database containing data from SwissProt and TrEMBL using WU-BLASTX. ESTs were realigned back to clones using a combination of WU-BLASTN and EST2genome. Ab initio gene structures were predicted using FGENESH and GENESCAN, and also ENSEMBL prediction including the EST genebuild were mapped to each clone if it was present in the NCBI 31 human genome assembly. Using these predictions gene structures were manually annotated in accordance with the human annotation workshop (HAWK) guidelines ( Once annotation had been completed were possible symbols for known genes and novel genes were approved or obtained from the HUGO Gene Nomenclature Committee11. Annotation is available in VEGA ( and at EMBL, GenBank and DDBJ databases.

The alignment of the mouse and rat genomes to the masked sequence of chromosome 1 was initially performed using BLASTZ12. Post-processing of the resulting matches was done with axtBest and subsetAxt (W. J. Kent,

axtBest was used to select the best match (based on length and score) for regions of multiple alignment to the mouse and rat genomes. To then increase the exon specificity the alignments were re-scored and filtered using subsetAxt (the scoring matrix and threshold were as described in ref Schwartz).

T. nigoviridis sequences were aligned to the chromosome using WU-TBLASTX13

( with the same scoring matrix, parameters and filtering strategy used in the Exofish procedure (ref Roest Crollius et. al. 2000). Overlapping alignments to different T. nigoviridis sequences were merged to produce contiguous regions of sequence conservation, analogous to the "Ecores" reported by Exofish. Regions of conservation with the genomes of D. rerio and T. rubripes were produced in a similar manner. We define Evolutionary Conserved Regions (ECRs) to be those regions of human chromosome 1 that align to all five reference genomes by the methods outlined above.

Comparative resources used;

Mouse (Mus musculus): NCBI build 30

(

Rat (Rattus norvegicus): Assembly version 2.0.

Zebrafish (Danio rerio): Assembly version 2.

Fugu (Fugu rubripes): Assembly version 2.

Tetraodon (Tetraodon nigoviridis): Assembly version 6.

Chimpanzee reads from;

As present in the trace repository in 5/03;

SNP identification and mapping

A modification of the SSAHA program was used to identify SNPs from sequence overlaps determined as part of the sequencing project. High-quality base discrepancies (Q23 ) in aligned clone overlaps were identified as candidate SNPs provided that the total overlap was >6,000bp. Candidate SNPs were rejected if any of its neighbouring five bases had Phrap quality values of <15 (bases of finished sequence were assumed to have a quality value of 40; that is, one error in 10,000bp) or if fewer than nine of the ten neighbours matched. If the number of detected SNPs in one clique was greater than five in a 500bp interval, then all SNPs were discarded for that interval. All SNPs for a clone overlap were similarly discarded if the SNP density for the entire overlap region was less than one SNP per 4,000bp.

The flanking sequences for the set of SNPs mapped to chromosome 1 in dbSNP (release 115) were remapped, using SSAHA, to the contigs of the latest Sanger assembly. Cross_match was used to map SNPs to exact positions by aligning flanking sequence to one or more of those contigs. Any hits having cross_match_score/flanking_sequence_length of less than 0.5 were removed. The highest scoring match was retained. In cases where there were two equally good matches, both were retained.

1.Bankier, A. T., Weston, K. M. & Barrell, B. G. Random cloning and sequencing by the M13/dideoxynucleotide chain termination method. Methods Enzymol155, 51-93 (1987).

2.Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A74, 5463-7 (1977).

3.Rosenblum, B. B. et al. New dye-labeled terminators for improved DNA sequencing patterns. Nucleic Acids Res25, 4500-4 (1997).

4.Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res8, 175-85 (1998).

5.Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res8, 186-94 (1998).

6.Bonfield, J. K., Smith, K. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res23, 4992-9 (1995).

7.McMurray, A. A., Sulston, J. E. & Quail, M. A. Short-insert libraries as a method of problem solving in genome sequencing. Genome Res8, 562-6 (1998).

8.Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA sequence from the Human Genome Project. Genome Res9, 1-4 (1999).

9.Beck, S. in Encyclopedia of the Human Genome (Nature Publishing Group, 2003).

10.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res27, 573-80 (1999).

11.Wain, H. M., Lush, M., Ducluzeau, F. & Povey, S. Genew: the human gene nomenclature database. Nucleic Acids Res30, 169-71 (2002).

12.Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res13, 103-7 (2003).

13.Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol215, 403-10 (1990).