Additional methods and discussion

Generating the Draft Genome Sequence

Screening for contamination in the sequence. Decontamination of the sequence reads was carried out in multiple stages:

(1)In initial partial assemblies, we looked for unexpected clustering and for plates of reads whose coverage by the rest of the assembly was anomalous (too low or too high). Such suspicious plates were aligned against probable contaminants.

(2)After the final assembly, we discarded supercontigs whose reads tended to be plate-mates to reads residing in the bottom 5% of the assembly. More specifically:

(a)We scored each physical plate, by assigning it the ratio
(# of reads on plate whose supercontig has length ≥1 Mb)
------
(# of reads on plate whose supercontig has length <1 Mb).

(b)We scored each read, by assigning it the corresponding plate score.

(c)We scored each supercontig, by assigning it the median of its reads' scores.

(d)We discarded supercontigs having score <10. There were 446 such supercontigs, of which 66% had score 0. They contained a total of 11,850 reads, 81% of which were accounted for by the 18 largest of the discarded supercontigs, all of which had score 0.
Note: We also checked for single-center contigs, having probability <10–6, but all were already discarded.

(3)Delete supercontigs which have at least seven reads, all from the same library. Delete supercontigs that together with all supercontigs linked to them have at least seven reads, all from the same library. Repeat these steps on the assembly, prior to deletion of tiny supercontigs, and translate back.

(4)Suppose that all the reads from a supercontig come from a single center, and moreover that all the reads in all the supercontigs it links to come from the same center. Then delete the supercontig. This was applied to the assembly prior to the deletion of tiny supercontigs, and the list was then translated to the final (reduced) assembly.

(5)Remove suspected contaminants based on alignments to human sequence. This included looking for regions of the human genome that had too many mouse reads aligning to them.

Almost all the contamination was removed by (1). The remaining steps removed a total of 1,902 contigs, totaling 3.6 Mb of sequence. Very little was removed by (5), although the method of (5) was used to tune the methods (2), (3), and (4). In short, the decontamination methods were almost entirely synthetic: based on internal inconsistencies rather than alignment to specific contaminants. In spite of these precautionary steps, we note that the unanchored part of the assembly is necessarily enriched for errors of many kinds.

Genome size: Euchromatic genome size was estimated by looking at the scaffolds and captured gaps, which suggests a genome size of 2.5 Gb. A small fraction of the unanchored part of the assembly will also contribute to the euchromatic size of the genome. It is currently hard to estimate exactly how much of the unanchored sequence will fall in uncaptured gaps and at centromeric and telomeric ends that contribute to the euchromatic part of the genome, but we believe the majority will fall in captured gaps or in heterochromatic regions. Thus, we suggest that the genome size is 2.5 Gb or slightly larger.

Comparison to Mural et al. Chromosome 16. Finished sequence used for comparison: B6 BACs AC079043, AC079044, AC083895, AC087541, AC087556, AC087840, AC087899, AC087900, AC098735, AC098883, and 129 BACs AC000096, AC003060, AC003062, AC003063, AC003066, AC005816, AC005817, AC006082, AC008019, AC008020, AC010001, AC012526. We note that the BACs used for evaluation purposes in Mural et al.1 were actually from Chromosome 6.

Conservation of Synteny Between Mouse and Human Genomes

Identification of orthologous landmarks. Full genomic alignments of the masked mouse (MGSCv3) and human (NCBI build 30) assemblies were carried out using the PatternHunter program2.

Only those alignments that were:

(1)high scoring, i.e., scoring ≥40 according to a standard additive scoring scheme: match = +1, mismatch = –1, gap open = –5 gap extend = –1;

(2)bidirectionally unique at this scoring threshold; were used to identify orthologous landmarks in both genomes.

Identification of syntenic blocks and segments. We first identified syntenic blocks; from these we derived the collection of syntenic segments. Geometrically, syntenic blocks correspond to rectangular regions in the mouse/human dot plots, while segments are curves with clear directionality within each rectangle. Syntenic blocks are therefore defined by interchromosomal discontinuities, while syntenic segments are determined by intrachromosomal rearrangements, typically inversions.

A syntenic block of size X:

(1)is a contiguous region of at least size X along a mouse chromosome that is paired to a contiguous region in a human chromosome, also of size X or larger;

(2)for which all interruptions by other chromosomal regions (in either genome) are less than size X. Size can be measured either in terms of genomic extent in bases or as the number of consecutive orthologous landmarks.

Our methodology for constructing syntenic blocks constructs low-resolution blocks (large cutoff) from high-resolution blocks. For example, at the highest resolution possible, every anchoring alignment is allowed to define or interrupt a syntenic block. To then obtain blocks defined by at least two consecutive landmarks in both genomes, singletons in either genome would be identified in the highest-resolution list and absorbed into pre-existing larger blocks.

In a similar manner, our methodology will coalesce smaller blocks into larger blocks for any size cutoff while keeping segment boundaries as stable as possible. In our algorithm, one genome is selected as the reference genome and determines the order in which the blocks are listed. However, the blocks themselves are independent of the choice of reference genome. In fact, changing reference frame from mouse to human provided a non-trivial consistency check on the construction of the syntenic blocks. A syntenic segment of size X in mouse:

(1)is always contained within a syntenic block;

(2)exhibits clear directionality, with at least four successive markers in strictly increasing or decreasing order in both genomes;

(3)is interrupted only by segments smaller that X in mouse;

Note that there is no size restriction placed on the corresponding human extent.

Intrachromosomal rearrangements within syntenic blocks were grouped into syntenic segments by aggregating at successively coarser scales. More care is required when coalescing segments, compared to blocks, to ensure that the resulting segments are truly reciprocal (one mouse region paired to only one human region and conversely).

When defining segments, we excluded isolated outliers that seemed likely to be attributable to misassemblies or sequencing errors, a typical case being a single misplaced BAC. However, the fate of every syntenic landmark, including apparent outliers, was kept as part of a 'syntenic roadmap' to facilitate the coordinated, simultaneous navigation of both genomes.

Further details and dot plots can be found at:

Estimation of the minimal number of rearrangements. The estimate of the number of rearrangements is based on the Hannenhalli-Pevzner theory for computing a most parsimonious (minimum number of inversions) scenario to transform one uni-chromosomal genome into another. This approach was further extended to find a most parsimonious scenario for multi-chromosomal genomes under inversions, translocations, fusions, and fissions of chromosomes3,4. We used a fast implementation of this algorithm (5, available via the GRIMM web server at:

to analyze the human–mouse rearrangement scenario. Although the algorithm finds a most parsimonious scenario, the real scenario is not necessarily a most parsimonious one, and the order of rearrangement events within a most parsimonious scenario often remains uncertain. Availability of three or more mammalian genomes could remedy some of these limitations and provide a means to infer the gene order in the mammalian ancestor6.

The key element of the Hannenhalli-Pevzner theory is the notion of the breakpoint graph, which captures the relationships between different breakpoints (versus the analysis of individual breakpoints in previous studies). The breakpoint graph provides insights into rearrangements that may have occurred in the course of evolution. Some of these rearrangements are almost 'obvious', while others involve long series of interacting breakpoints and provide evidence for extensive breakpoint re-use in the course of evolution.

5. Genome Landscape

GC Content. The human genome NCBI build 30 assembly and mouse genome MGSCv3 assembly were taken as genomic sequence. Both include runs of Ns for gaps in the sequence of known, estimated, or unknown length, including centromeric sequence.

For analyses done on a genome wide or chromosomal scale, GC content was measured as the total number of G or C bases in the genome or chromosome divided by the total number of non-N bases in the same sequence.

For windowed analyses (histograms, correlations), each genome sequence was broken into non-overlapping, abutting windows of fixed size (20 kb for GC distributions, ~100 kb for syntenic windows, 320 kb for correlation with gene density) starting at the centromere for acrocentric chromosomes and at the distal end of the p arm for metacentric chromosomes. All windows are of identical size except the last window on the distal end (or distal q end) of each chromosome, which contains the remainder bases regardless of number. Windows were analyzed for GC content without regard to number of non-N bases; however, any window with fewer non-N bases than 50% of the nominal window size was eliminated to prevent artificially high variance in the distribution. This eliminated no more than 1.7% of the non-null windows (2-7% of windows depending on organism and window size consisted entirely of Ns as placeholders for centromeres) or 0.7% of the total non-N bases for any organism/window combination and never changed the global GC content value by more than 0.01%. The actual average number of non-N bases per remaining window was 19155 for mouse and 19872 for human for 20 kb windows and 300825 and 314366 respectively for 320 kb windows.

Analysis of GC content of syntenic regions started with the high quality bidirectionally unique anchors within syntenic segments (see above). Windows were selected as for the single genome analysis, starting at the centromere of each mouse chromosome. Regions where no clear synteny was present were skipped. We then selected non-overlapping, abutting windows which were exactly 100 kb in the mouse sequence and interpolated the equivalent human position using syntenic anchors and the average anchor spacing over the region. Due to small inversions in the syntenic anchor order, some regions in the human may overlap. Actual average size of human windows was 110 kb, as expected from the distribution of syntenic anchors. These regions were then analyzed for GC content separately in each organism as described above. Pairs of windows in which either organism had fewer than 50,000 non-N bases were discarded, effectively eliminating all regions which were 2-fold or more shorter in human than mouse. Regions which were 2-fold longer in human were not eliminated, but account for only 3% of windows.

For binned analyses of syntenic GC content, one organism was taken as the reference organism and all of its windows binned in 1% increments centered on an integral percent GC (i.e., 39.5–40.5). The GC distribution statistics of the second organism were then calculated by window using all windows syntenic to each bin in the reference. No attempt was made to adjust for different sizes and fractions of non-N bases in the windows.

For correlation of GC content with gene density, we took the ensembl sets of mouse and human gene predictions (mouse release 7.3b.2, July 12, 2002 and human release 8.30.1, September 2, 2002). This gave us 22,444 mouse genes and 22,920 human genes (60 human genes from the full ensembl set could not be used because they were predicted on unlocalized contigs not included in the NCBI genome build 30). These genes were then assigned to the same 320-kb bins in which GC content had been measured. If a gene spanned more than one bin it was fractionally assigned proportional to the fraction of its total transcript length which lay within that bin (so the total of all genes in all bins is the total number of genes, but bins may contain fractional numbers of genes).

CpG Islands. CpG islands were identified on masked versions of MGSCv3 and NCBI 30 using a modification of the program used in7 (K. Worley and L. Hillier, personal communication). This program uses the definition of CpG islands proposed by Gardiner-Garden and Frommer8 of 200 bp containing ≥50% GC and ≥0.6 ratio of observed to expected CpG sites (based on local GC content).

The calculations were also run independently varying minimum GC from 46–54%, o/e from 0.4–0.8, and length from 100–400 bp. While parameter shifts in o/e and length requirements significantly altered the total number of islands found in each organism, there was negligible effect on the ratio of islands between the two organisms. Changes to the minimum GC resulted in a very small change in the number of islands found, as the vast majority of islands in both organisms significantly exceed this threshold.

Expansion ratio. Syntenic windows were determined as above. The ratio mouse/human was calculated for all windows. Windows with a ratio <0.25 or >4 were excluded from calculations/plots.

6. Repeats

Additional legend Figure 10. Age distribution of Interspersed Repeats (IRs) in the mouse and human genomes. Bases covered by each repeat class were sorted by the estimated substitution level from their respective consensus sequences. Divergence levels from the RepeatMasker output were adjusted to account for 'mismatches' resulting from ambiguous bases in the consensus and genomic sequences. Often sequencing gaps represented by strings of 100–500 Ns were overlapped by the matches, which would lead to huge overestimates of the divergence levels if not adjusted for. Since CpG->TpG transitions are about 10-fold more likely to occur than all combined substitutions at another site, repeats with many CpG sites (like Alu) are more diverged than those of the same age with few CpGs. We estimated the divergence level excepting CpG->TpG transitions (Drest) from the adjusted observed divergence level (Dobs) and the CpG frequency in the consensus (Fcg) by Drest = Dobs/(1+9Fcg), with a minimum Drest of Dobs-Fcg. The substitution level K (which includes superimposed substitutions) were calculated with the simple Jukes-Cantor formula K = –3/4ln(1–4/3D rest)). Panels (b) and (d) show them grouped into bins of approximately equal time periods. On average, the substitution level has been 2-fold higher in the mouse than in the human lineage (Table 6), but currently may differ over 4-fold. Compared to the previous version, the scale on the X-axis in panel B is larger, as we estimate in this paper that the substitution level in mouse since the human-mouse speciation is at least 35%. Also, the time periods in panels (b) and (d) are smaller, assuming a speciation time of 75–80 Mya, rather than 100 Mya.

Additional legend Table 6. Divergence levels of 18 families of IRs that shortly predate the human-mouse speciation. Their copies are found at orthologous sites in mouse and human while having a relatively low divergence level or representing the youngest members in the evolutionary tree of LINE1 and MaLR elements. Shown are the number of kilobases matched by each subfamily (kb), the median divergence (mismatch) level of all copies from the consensus sequence (div), the interquartile range of these mismatch levels (range), and a Jukes-Cantor estimate of the substitution level to which the median divergence level corresponds (JC). The two right columns contain the ratio of the JC substitution level in mouse over human, and an 'adjusted ratio' of the mouse and human substitution level after subtraction of the approximate fraction accumulated in the common human-mouse ancestor. Many factors influence these numbers. For example, AT-rich LINE1 copies appear less diverged than the GC-richer MaLR and DNA transposon families of the same age, primarily because GC->AT substitutions are more common AT->GC substitutions, especially in the AT-rich DNA where most LINE copies reside7. Early rodent specific L1 and MaLR subfamilies are not yet defined, so that their copies were matched to the consensus sequences in the table (note that the youngest L1 subfamily, L1MA6, has relatively much DNA matched to it). The associated, unduly high mismatch levels (L1 evolves faster than the neutral rate!) will increase the rodent median and the substitution level ratio. On the other hand, inaccuracies in the consensus and not-represented minor ancient subfamilies contribute equally to the observed mismatches in both species and cause the ratio to be smaller.

Three more-important factors cause a significant underestimate of the substitution level in mouse compared to human. First, part of the substitutions in older families has accumulated in the common ancestor. The difference in substitution level between the family and the least diverged family in the class estimates this fraction, and is subtracted before calculating the ratio in the last column of the table. Second, by assuming that all substitutions are equally likely, the Jukes-Cantor formula significantly underestimates the number of superimposed substitutions at higher divergence levels. For example, when considering substitution patterns, 30% mismatches to an average DNA sequence in an average environment correspond to a 41% rather than 37–38% substitution level. Finally, there undoubtedly is a bias of ascertainment for the least diverged copies of the repeat family in mouse. Dependent on the length of the match, 30–35% mismatch level is about the maximum that can be detected by RepeatMasker, so that the more diverged copies are not tallied. The above suggest that the ratio of substitution rate in the lineages to human and mouse is at least 2.0-fold.