Supplementary Figure S1. The WFDC locus in muridae.(A) The long-range organization of the Wfdc locus on mouse chromosome 2qH3-qH4 is depicted [oriented relative to the centromere (CEN) and telomere (TEL)], and consists of the two indicated sub-loci separated by a 190-kb region containing unrelated genes (grey rectangle). The 195-kb centromeric Wfdc sub-locus contains the indicated eleven genes. The 181-kb telomeric WFDC sub-locus contains the indicated eleven WFDC genes. (B) A more detailed representation of the centromeric WFDC sub-locus in mouse and rat is depicted, including the two genes immediately flanking this region on each side. Black arrows represent small serine protease inhibitors genes (Wfdc), open arrows represent seminal vesicle secreted protein genes (Svs), and hatched arrows represent non-related neighboring genes. The direction of all arrows reflects the transcriptional orientation. The two boxed areas reflect the WFDC12-to-Svs4 and Svs3b-to-Svs5 paralogous duplicons, respectively. (C) The G+C content of the mouse centromeric Wfdc sub-locus is shown. The y-axis is scaled to reflect a minimum of 43% and maximum of 70% G+C content. Note the alternating genomic segments of low (I and III) and high (II and IV) G+C content. Also shown are the positions of CpG islands (defined as >200-bp stretches of sequence with a G+C content of >50% and an observed CpG to expected CpG ratio of >0.6). Note that panel C is drawn to scale relative to panel B and that this figure parallels Figure 1 (which depicts the same genomic region in primates).

Supplementary Figure S2. Sequence conservation between paralogous segments in the centromeric WFDC sub-locus in different primates. Dot plots depict sequence conservation between the two paralogous segments that reside within thecentromeric WFDC sub-locus in 12 primate species. For each primate species, the x- and y-axis represent the PI3-to-WFCD15d, and WFDC12-to-PI3 paralogous duplicons, respectively.A schematic representation of the exon-based gene structures and pseudogenes is shown for each, with the positions of exons along the x-axis highlighted. Note that the letters and names for the genes are the same as in Figure 4. The two paralogous segments have significant coding and non-coding sequence conservation in all species (highlighted by dashed circles). This figure parallels Figure 2A (which depicts the human genomic region).

Supplementary Figure S3. Sequence conservation between orthologous segments in the centromeric WFDC sub-locus in different primates. The sequence of each primate centromeric WFDC sub-locuswas divided into segments based on its architectural features (specifically, STK4-to-WFDC5, WFDC12-to-PI3, and PI3-to-SLPI). The sequence of each human segment was then compared with the corresponding orthologous sequence from each primate species using blastz (Schwartz et al. 2003b). The displayed dot plots depict sequence conservation between the human sequence (x-axis) and the corresponding orthologous segments (y-axis) for a representative group of primates (gorilla, colobus, owl monkey, and galago). The positions of exons along the x-axis are highlighted. The PI3-to-SLPI interval spans roughly 100 kb in human, 86 kb in colobus, 73 kb in owl monkey, and 82 kb in galago. Within this region, the fraction of the human sequence that aligns to the orthologous primate sequence is 97% for gorilla, 73% for colobus, 61% for owl monkey, and 30% for galago. The WFDC12-to-PI3 interval (45 kb in human) is drastically smaller than its paralogous PI3-to-SLPI counterpart in all primates, as shown for colobus (22 kb), owl monkey (17 kb), and galago (12 kb), species for which the fraction of aligning human sequence is only 49%, 42%, and 29%, respectively. In contrast, the regions flanking the centromeric WFDC sub-locus show more-typical patterns of sequence conservation (e.g., with the fraction of aligning human sequence within the STK4-to-WFDC5 intervalbeing 100% for gorilla, 99% for colobus, 94% for owl monkey, and 65% for galago). The estimated evolutionary divergence times since the last common ancestor for the species pairs shown are: human-gorilla, 7 million years ago (MYA); human-colobus, 25 MYA; human-owl monkey, 40 MYA; and human-galago, 63 MYA(Goodman et al. 1998).

Supplementary Figure S4. Architecture of the SEMG-containing genomic region in different primates. Primates are grouped into four phylogenetic clades (hominoids, Old World monkeys, New World monkeys, and prosimians); note that the depicted branch lengths are not intended to reflect the evolutionary distances among the species. For the hominoids, all of the genes and pseudogenes within the WDFC12-to-WFDC15d interval are depicted and labeled; for the other three clades, all of the genes are depicted, but only the SEMG genes are labeled. Exon-based structures are represented for each gene. Grey X’s denote WFDC, trappin, and SEMG pseudogenes. The depiction of each SEMG-containing genomic region is not drawn to scale. The open rectangles indicate the positions of the four paralogousSEMG genomic blocks (Sgb1 to Sgb4), with the arrows below reflecting the relative orientation of each Sgbs. Sgb3 and Sgb4 are truncated blocks that have lost all SEMG-coding sequences. Sgb1 is larger than previously reported (Ulvsback et al. 1992), and is frequently fragmented, as determined by comparison with the lemur sequence. Squirrel monkey has two SEMG1 genes, labeled “1a” and “1b”, and owl monkey has a single unique SEMG gene (a SEMG1-SEMG2 chimera, which is labeled with “*”). The SEMG gene fragments deleted in cotton top tamarin and owl monkey are depicted in light grey and demarcated by dashed lines. The cotton top tamarin genomic sequence was obtained from public databases (Lundwall and Olsson 2001). The hatched region within the galago SEMG2 gene corresponds to a roughly 3-kb galago-specific inserted sequence encoding 77 additional transglutaminase domains (each 13 amino acids in length). Note the rapid evolutionary divergence of the SEMG cluster in New World monkeys (see text for details).