Interspersed Repeats

2005-10-11925A

Supplemental Information

Interspersed repeats

The repeat content of HSA11 consists of 21.1% LINEs, 13.3% SINEs, 8.1% LTR elements, 2.7% DNA elements, with the remaining 2.7% composed of small RNAs, satellite repeats, simple repeats, low complexity sequence and VNTRs (Table S5). The percent of Alu repeats (9.42%) is slightly lower than the genome average of 10.6%, other than that all of the other elements are very close to the genome averages. The Tandem Repeat Finder (TRF) program1 is often used to detect tandem repeats larger than six bp; however, the bulk of the results overlap with those from the RepeatMasker program (A.F.A. Smit, R. Hubley & P. Green; In the case of HSA11, out of nearly 3 Mb predicted by TRF, 83% overlap with RepeatMasker as follows: satellite repeats - 36.87%, simple repeats - 20.94%, SINEs - 10.11%, LINEs - 4.52%, LTR elements - 4.00%, low complexity - 3.28%, unclassified repeats - 2.74%, and DNA elements - 0.58%. The remaining 17% (511 Kb) are tandem repeats that were predicted by TRF only.

Gene catalog

The 1524 protein-coding genes have an average exon length of 301 bp. The largest number of exons, at least 64, was found in the ATM gene, which encodes a phosphatidylinositol-3 kinase which when mutated affects DNA damage repair and can lead to the autosomal recessive disorder ataxia telangiectasia. The longest intron, just over 589 Kb, was found in the OPCML gene, which encodes a protein that binds opioid alkaloids in the presence of acidic lipids.The longest gene on HSA11 is DLG2, a membrane-associated guanylate kinase family member, which spans nearly 2.17 Mb over 28 exons and encodes an 870 amino acid protein.

Overlapping genes

We found many cases of internal genes (genes within genes) and overlapping genes (neighboring genes that overlap by at least one base pair, usually in their untranslated 5’- and 3’-ends, and are most commonly in opposite orientation with one another). In most cases, the overlaps reported here are supported by multiple mRNA or EST sequences. In addition to 242 pseudogenes and RNA genes contained within another gene, we identified 369 protein-coding genes with some form of overlap with at least one other gene. Of these, 90 genes were found to be completely contained within 42 surrounding genes, 182 genes showed partial overlaps with other genes, 19 genes had a mixed overlap type (i.e., they contained at least one gene and partially overlapped another), and 36 genes were involved in a co-transcribed or read-through manner (that is, 12 cases, each involving the joining of a pair of genes). The total genomic overlap is 1.9 Mb, with the longest being 43.3 Kb for genes BDNF (brain-derived neurotrophic factor) and BDNFOS (brain-derived neurotrophic factor opposite strand). BDNFOS is a non-coding RNA in reverse orientation with BDNF and may regulate the expression of this gene. Both genes have distinct CpG islands at their 5'-ends. BDNF is induced by cortical neuron activity and is necessary for survival of striatal neurons in the brain2,3. This gene may also be important in the regulation of stress response and in the biology of mood disorders4. We emphasize that the above cases will each require careful analysis to confirm the true boundaries of the genes; it is possible that some cases of apparent overlap may reflect inaccurate determination of gene boundaries. The rate of overlapping genes found here, 24%, is similar to previously published observations5.

Clustered gene families

In addition to the olfactory receptors genes, we identified 142 genes (140 expressed, two pseudogenes) in 37 clusters with at least two members from the same gene family (Table S7). The cluster with the most members, 13, is from the MS4A family, proteins with at least 4 potential transmembrane domains and N- and C-terminal cytoplasmic domains. The apolipoprotein gene cluster on 11q23 consists of four members which are components of high-density lipoprotein. These genes are strongly associated with plasma triglyceride levels and are major risk factors for various disorders including coronary artery disease, hypertriglyceridemia, and systemic non-neuropathic amyloidosis. They have also been associated with blood glucose, plasma lipoprotein levels, total cholesterol, and triglycerides in a gender-specific manner. The beta hemoglobin gene cluster on 11p15.4 consists of five members, in 5' to 3' order HBE1 (epsilon 1), HBG2 (gamma G), HBG1 (gamma A), HBD (delta), and HBB (beta). Along with the alpha globin gene cluster located on chromosome 16, which also contains five members, these loci determine the structure of the 2 types of polypeptide chains in adult hemoglobin, Hb A. The normal adult hemoglobin tetramer consists of two alpha chains and two beta chains. Mutations in beta-globin can lead to a number of disorders including sickle cell anemia, beta thalassemia, erythrocytosis and so on. In addition, there are two ultrahigh-sulfur keratin-associated protein (KRTAP5) gene clusters at 11p15.5 (6 members) and 11q13.4 (7 members) which resulted from an intrachromosomal duplication on HSA116. These genes show preferential expression in human hair root, suggesting they are required for hair formation. All of the KRTAP5 genes are highly conserved in chimpanzee, but interestingly, in mouse they are part of two non-adjacent, significantly larger blocks of synteny to chromosome 7; however, there is only one KRTAP cluster on mouse chromosome 7. The two human clusters lie adjacent to synteny breakpoints which, in mouse, are found in proximity to the single cluster.

Pseudogenes and RNAs

We annotated 765 pseudogenes, including 205 olfactory receptor pseudogenes, 558 non-olfactory receptor pseudogenes, and two tRNA pseudogenes. Most of the non-olfactory receptor pseudogenes identified in this report were derived from the "Retroposed Genes" UCSC genome browser track (Supplemental Methods). The average pseudogene spans about 1.2 Kb, and all but three are currently annotated as processed pseudogenes. The 203 olfactory receptor pseudogenes are the exception and have arisen by duplications of large chromosomal domains followed by extensive gene duplication and divergence. 204 of the pseudogenes are internal to other expressed genes. TRIM5, a known gene, which spans over 275 Kb, contains six olfactory receptor pseudogenes, the TRIMP1-2 pseudogene, nine expressed olfactory receptor genes, and theTRIM22 gene.

Of the 60 RNA genes, 38 are internal to other genes, with the most extreme cases being eight small nucleolar RNAs which are internal to the novel transcript predicted from accession AK095849, and nine (eight small nucleolar, one small Cajal body-specific) RNAs which are internal to the novel CDS KIAA1731. Eight predicted tRNAs are located in a small cluster around 59.08 Mb.

Gene deserts

According to the criteria of Ovcharenko et al.7, we annotated 19 gene deserts greater than 651 Kb, the longest being over 3.4 Mb (36,652,967-40,088,070 bp) flanked by genes RAG1 and AK127441, a novel transcript (Table S17). In total the 19 gene deserts, five of which are less than 100 Kb apart from one another, account for 23.4 Mb of the HSA11 sequence. These gene deserts contain 88 pseudogenes and one RNA, but there are no annotated expressed genes of any type within them.

In a separate analysis8, we identified two neighboring large, ancient duplications, which are also conserved in mouse and dog. These duplications from 22.8 to 24.8 (2 Mb), and 24.8 to 26.3 (1.5 Mb), completely overlap with two of the gene deserts we identified here. These intrachromosomal duplications are composed of long intermittent sequences with similarity as low as 60% and are suggestive that some gene deserts originated from duplications of segments lacking genes in a mammalian common ancestor.

CpG islands

CpG islands are unmethylated regions of the genome that are associated with the 5'-ends of many house-keeping genes and regulated genes. Of the 1369 calculated CpG islands9 (Supplemental Methods, Table S11), 895 are associated with expressed genes, including 781 (58.7%) known genes, 53 (51%) novel CDSs, 60 (27.1%) novel transcripts, and one (25%) putative genes. 806 of these genes contain CpG islands in or near their 5'-ends, 23 in their 3'-ends, 60 internally, and six which are completely encompassed by the CpG island. 291 genes share a CpG island with at least one other member, which means they may be under the same regulatory control. Of the 157 shared CpG islands, 149 are shared by two genes and eight are shared by three genes. The longest CpG island on HSA11, which is highly conserved in chimp, dog, mouse, and rat, is 7,460 bp and is found at the 5'-end of CCND1. CCND1, or cyclin D1, is a member of the highly conserved cyclin family, whose members are characterized by a dramatic periodicity in protein abundance throughout the cell cycle. Mutations, amplification and over expression of this gene, which alters cell cycle progression, are observed frequently in a variety of tumors and may contribute to tumorigenesis.

Imprinting

While CpG islands are not normally methylated, there are several cases in the human genome where methylation occurs on one of the parental alleles, violating the usual rule of inheritance that both alleles in a heterozygote are equally expressed. This phenomenon is called genomic imprinting. When a gene is suppressed through imprinting from one parent, and the allele from the other parent is not expressed because of mutation, the child will be deficient for that gene. The 11p15.5 region of HSA11 contains two of the most well studied and reciprocally imprinted genes H19 and IGF2, in which the disruption can lead to Beckwith-Wiedemann syndrome, of which the cardinal features are exomphalos, macroglossia, and gigantism in the neonate. Mutations in several imprinted genes in this region can lead to this, as well as other syndromes. TableS18 lists all of the imprinted-related genes on HSA11 as derived from the OMIM database.

Duplication Analysis

We performed a detailed analysis of duplicated genomic sequence (≥90% sequence identity and ≥1 kb in length) comparing HSA11 against the May 2004 assembly of the human genome (Supplemental Methods). We estimated that 4.23% (5.55 Mb) of HSA11 consists of segmental duplications (Tables S19, S20). Compared to other finished chromosomes as well as the genome average (5.3%), HSA11 is not enriched for segmental duplications. Unlike the genome wide distribution, in which the aligned base pairs of interchromosomal duplications are slightly lower than the intrachromosomal duplications, duplications in HSA11 are predominantly interchromosomal: 14.14 Mb out of 17.75 Mb of aligned base pairs and 1399 out of 1667 pairwise alignments are with the non-homologous chromosomes (Fig. S3, Table S19). While the duplications with higher divergence (>0.08) tended to be short, those with lower divergence are more scattered in the length distribution (Fig. S4). A bimodal distribution pattern of sequence identity is observed based on the distribution pattern of the alignments. The majority of interchromosomal duplication alignments show 93-95% sequence identity while intrachromosomal duplications show 95- 97% sequence identity.

Segmental duplications are particularly clustered in the subtelomeric and pericentromeric regions of HSA11p, with the subtelomeric region accounting for 18.3% (305/1667) and the pericentromeric region accounting for 13.6% (226/1667) of the total alignment (Table S21). This subtelomeric region is clustered with interchromosomal duplications mostly mapping to the subtelomeric or pericentromeric regions in other chromosomes. The pericentromeric region on the HSA11p is mainly clustered with intrachromosomal duplications (Fig. S5). In addition, the overwhelming majority of the other segmental duplications are clustered in another 12 blocks (>100 kb and > 5 duplication alignments) (Figs. S6, S7). These regions contain genes or fragment of genes, such as IFITM, FOLH, ALDH, NOX4, TRIM49, NAALAD as well as OR family members (Table S22). However, only 41 OR pseudogenes and 18 intact genes, all but two from class II, overlap with segmentally duplicated regions, mostly scattered across the chromosome. This suggests that segmental duplication was not a major factor in the expansion of this large gene family, at least on HSA11. Copy number polymorphism and macro insertion deletion and inversion among different human populations have been recently reported10-13. We observed that at least four of the HSA11 duplication clusters overlap with or are adjacent to known copy number polymorphism sites, suggesting the clustered duplications play a role in the generation of these polymorphisms.

Comparative biology

To define further the chromosomal landscape, we performed a comparative analysis of finished HSA11 versus the draft mouse14, rat15, dog16, chimpanzee17 and chicken18 genomes. Using DNA alignments we constructed a map of conserved syntenybetween HSA11 and the other genomes (Fig. S8, Methods). By scanning these regions for contiguous collinear nucleotide similarity, 36 blocks of conserved synteny larger than 250kb were identified between human and mouse, with the longest segment being 17.4Mb. Results for the other organisms can be found in Table S23. In the map of conserved synteny, the chicken seems to lack some of the gene-rich regions. However, this may simply be due to the exclusion of smaller blocks of conserved synteny. Indeed, comparative analysis using the Tblastx programs suggests that many of the genes in these regions are present in the chicken genome.

Additionally, we identified 6218 conserved non-coding elements (CNEs) by combining the blastn hits for mammals (mouse, rat and dog), and separately for mammals plus chicken (942 CNEs) (Table S24). These conserved non-coding elements defined among mammals are fairly evenly distributed along the chromosome for mammals only, with a slightly higher density in the region of 90-120 Mb on HSA11 (Fig. S9). However, the elements that are conserved with chicken show a more skewed distribution with a higher density of CNEs from 11p-tel to approximately 18 Mb, and then much lower from there to the centromere. The reasons for these trends are unclear, but may partially reflect the lack of clearly defined synteny with chicken.

Out of 7,487 evolutionary conserved regions (ECRs) with an average length of 112 bp covering 841,811 bp across HSA11 (data courtesy of O. Jaillon, Genoscope), 5,964 (79.66%) overlapped potential protein-coding genes and 798 (10.66%) with pseudogenes. 661 (8.83%) of the ECRs overlapped with a CpG island, while only 154 (2.06%) fell within gene desert regions.

Supplemental Methods

Large insert clone sequencing and mapping at RIKEN Genomic Sciences Center

Large-insert BAC and fosmid clone DNA was prepared by the standard alkaline lysis method (Kurabo PI-1100). Shotgun libraries were constructed by random sheared DNA (1-2 kb) (HydroShear, GeneMachines) and cloned in plasmid vector. The template DNA was prepared either by PCR amplification of the insert DNA (TaKaRa Ex Taq, Biometra and ABI GeneAmp PCR System 9700), GenomiPhi amplification (Amersham Biosciences) or plasmid DNA isolation (Kurabo PI-1100). Cycle sequencing was performed by BigDye v3.1 chemistry and ABI3700 and ABI3730 sequencers (Applied Biosystems), and by ET chemistry and the MegaBACE1000 sequencer (Amersham Biosciences). Basecalling, quality assessment and assembly were carried out using the Phred/Phrap software package19,20. Assemblies of clones sequenced at 8-10 fold redundancy were visualized for finishing with Consed21 and Sequencher (Gene Codes Corp.). A combination of the following methods was used to close sequence gaps and resolve low-quality or problematic regions: nested deletion22, primer walking, PCR, direct sequencing of large-insert BAC and fosmid clones, and subcloning of BAC clones into fosmid vectors. The average accuracy of the finished sequence data was estimated to be greater than Phrap 40. Clones were finished according to the agreed international standard for the human genome ( There were 41 sequence gaps, for an estimated total of 13.7 Kb, which could not be resolved by sequencing (Table S25). For the Human Genome Project (HGP), there were a number of quality control checks that were performed to ensure the highest quality data and uniformity throughout the genome. Like all the other human chromosomes, HSA11 was inspected to make sure that none of the following applied: missing known genes, missing STSs, contamination, partially present genes, compressions or insertions, and false clone overlaps.

One particularly difficult region to finish was the 566 Kb interval from 88.92-89.49 Mb, which consists of several large intra- and inter-chromosomal duplications (Fig. S10). There is a 350 Kb, near-perfect (99.65%) intrachromosomal duplication in this region which contains two copies each of the PSMAL and TRIM49 genes, and at least four retrotranposed processed pseudogenes, making this region also a challenge to annotate.

As noted in the main text, gaps were size-estimated by fiber-FISH analysis were possible. Initial size estimates were roughly made to the nearest 1-10 Kb. In cases where additional sequence data was incorporated to reduce the gaps, the estimated size of the gaps were decreased by the exact amount of new data adding, thus leading to what seem as very precise estimates of the gaps sizes. Fiber FISH analysis at best can only give estimates in the 1-10 Kb range.

Large insert clone sequencing at Broad Institute/Whitehead Center for Genome Research

Subclone libraries of large-insert clones were prepared in m13 or one of several plasmid subclone vectors, and sequenced with the dideoxy chain termination method using one of several versions of big dye chemistry23. Data were detected on several models of ABI sequencing machines and assembled with Phrap ( or Arachne24,25. Assemblies were visualized for finishing with either Gap426 or Consed21. A combination of the following methods was used to close sequence gaps and resolve low quality regions and misassemblies: transposon insertion-based sequencing, primer walking, PCR, and shattered insert libraries27. Finished sequence assemblies of all large insert clones were validated by comparison to restriction digestion patterns generated by 3-5 6 cutter enzymes28.

STS markers on genetic maps

Thisdata is from the UCSC Genome Browser (

Positions of STS markers are determined using both full sequences and primer information. Full sequences are aligned using blat29, while isPCR (Jim Kent) and ePCR ( are used to find locations using primer information. Both sets of placements are combined to give final positions. In nearly all cases, full sequence and primer-based locations are in agreement, but in cases of disagreement, full sequence positions are used. Sequence and primer information for the markers were obtained from the primary sites for each of the maps, and from UniSTS ( This track was designed and implemented by Terry Furey.

Construction of gene catalog

Alignments of all available (as of August 2005) human RefSeq30 and GenBank31 messenger RNA sequences to the finished sequence were derived from the UCSC Genome Browser according to their methodology ( Data from the following tracks were inspected manually to ensure accurate transcriptional start and stop sites, and to correct splice sites: Known Genes, RefSeq Genes, Human mRNAs, Ensembl Genes, CCDS, Retroposed Genes, and sno/miRNA. In addition, data from these tracks was reviewed: Human ESTs, Other RefSeq, Other mRNAs, and Other ESTs. Non-canonical splice sites were used only if supported by sufficient complementary DNA-based evidence. Partial transcripts (those containing a partial open reading frame) were annotated in cases for which there was firm evidence of their existence. All gene models (Table S6) were created manually using these aligned sequences as evidence, following HAWK2 ( transcript type conventions. Evidence was given relative priority as follows (high–low): RefSeq, other mRNAs, spliced ESTs, unspliced ESTs, non-human orthologous mRNAs. When there was more than one variant for a gene, we selected the longest genomic transcript as the representative model. Gene symbols for biologically characterized loci were assigned by the HUGO Gene Nomenclature Committee( Our annotations will bemade available to the Vertebrate Genome Annotation database (VEGA,