Supporting Online Material

Supplemental research data (Versteeg et al.)

A. Supplementary figures:

Fig. 6: Profiles of gene expression, gene density, inverse intron length, GC content, SINE content, and inverse LINE content for all human chromosomes. The profiles show the moving medians at window size 49. The expression data are based on all combined SAGE libraries (SAGE all tissues).

Fig. 7: Regions of extremes in gene expression, gene density, intron length and GC content for all human chromosomes. The horizontal axis of the triangles represent the transcriptional units (TUs) on each chromosome. The vertical dimension is the window size, increasing from 19 to the maximal possible window size for each chromosome. All chromosomes are scaled to the same unit length. Red are regions of significantly increased gene expression, gene density, inverse intron length and GC content. Blue are regions of significantly decreased values of the same parameters.

Fig. 8: Gene expression profiles of human chromosome 6 in a series of different tissues. The median expression patterns (w=49) are shown for six different compound SAGE libraries of specific tissues. Expression data are based on combined SAGE libraries of a specific tissue, with combined library sizes ranging from 341,980 tags (Colon tumor all) to 862,017 tags (Brain tumor all). Red bars indicate a median expression level above 25 per 100,000 tags.

Fig. 9: The probability density function (pdf) of median gene expression from randomly permuted genomes, pR, is shown together with the pdf of medians from gene expression from the human genome, pG, at w=49 (see text). The False Discovery Rate FDR is the ratio of the areas (mm>mm0 (w)) residing in the tails of the functions pR and pG.

Fig. 10: Analysis of the Spearman Rank correlation between gene expression, gene density, intron length, GC content, SINE content, and LINE content. The three graphs show from top to bottom, the quantity ZD (see supplementary information section 4.3) as function of w for the paired quantities (expression,G+C), (expression, inverse intron length), (expression, gene density), (expression, SINE content), and (expression, inverse LINE content), respectively.

Fig. 11: Enrichment of ridges for genes with short introns, high gene density, high GC content, high SINE content and low LINE content. The red curves show the cumulative number of genesembedded in ridges up to the rank of expression, G+C content, gene
density, inverse intron length, SINE content, and inverse LINE content respectively. These curves are compared to the blue ones that represent the cumulative number of ridge genes as function of the rank of an uncorrelated quantity (see text sec. 4.4).

Fig. 12: Data model used to develop the sequence based Human Transcriptome Map.

B. Supplementary technical information

1.0 The HTMseq database

The construction of the HTMseq involved the following steps:

1.1 Mapping of transcripts to the draft human genome sequence

1.2. Identification of the UniGene numbers of mapped transcripts.

1.3 Selection for UniGene clusters with identified SAGE tag(s).

1.4 Splitting of UniGene clusters with more than one map position.

1.5 Merging of overlapping UniGene clusters

1.6 Removal of UniGene clusters without intron-exon structure.

1.7 Discarding transcriptional units (TUs) on contigs without a chromosomal assignment.

1.8 Rejection of tags for statistical analyses of HTMseq (and whole chromosome view and concise view of HTMseq website).

1.9 Annotation of TUs after splitting and merging.

1.10 Determining the expression level of the TUs

These steps are summarized in a data model (Fig. 12) and described in chapters 1.1 to 1.10.

1.1 Mapping of transcripts to the draft human genome sequence

As a basis for the map we used the UCSC Golden Path database of the draft human genome sequence (august 2001 freeze, http://genome.ucsc.edu). The UCSC database contains alignments of transcripts (mRNA and EST sequences) to the human genomic sequence.

1.2 Identification of the UniGene numbers of mapped transcripts.

We assigned each mapped transcript to its corresponding UniGene cluster via transcript accession numbers. We used the build 141 of the UniGene database. This gave us the map position(s) of UniGene clusters. We could identify a chromosomal map position for 81,975 UniGene clusters of the 96,470 UniGene clusters in build 141.

1.3 Selection for UniGene clusters with identified SAGE tag(s).

We analyzed which of the UniGene clusters had a valid SAGE tag(s). We therefore used the AMCtagmap program1. The AMCtagmap database lists all accession numbers with their validated SAGE tags. A total of 29,719 UniGene clusters had no identified SAGE tags in the AMCtagmap program and were discarded.

1.4 Splitting of UniGene clusters with more than one map position.

Many UniGene clusters were found to have multiple map positions, e.g. on two chromosomes or at very different positions on one chromosome. This was a result of 1) errors in the UniGene clustering, combining transcripts of multiple genes into one UniGene cluster (usually caused by hybrid cDNA sequences); 2) The occurrence of pseudo-genes in the genome, that are highly homologous to the true coding sequences of a gene; 3) The unfinished state of the genome (e.g., false genomic duplications); 4) true genomic duplications.

We designed an algorithm to split all UniGene clusters with multiple map positions in (sub)clusters for each map position. This was based on the map positions of the transcripts included in the UniGene cluster. Transcripts were considered to be located on different map positions if they were located on different chromosomes or if there existed gaps between exon alignments larger than 100.000bp (note that hybrid cDNA clones that consist of parts of two genes mapped to different locations are split in this way). The distance between successive exons on the genome was determined by subtracting the end and start position of these exons, while correcting for sequence gaps in the genomic sequence as annotated in the UCSC database. Very short introns (<10bp) were ignored as they may represent coincidentally matching exon sequences. The splitting procedure resulted in a collection of transcripts with a reliable chromosomal position, linked to 70,751 (split) UniGene clusters (corresponding to 52,256 UniGene clusters). The split UniGene clusters kept their UniGene cluster number, but with the extension ‘s‘ and an index number according to the number of transcripts included in the split UniGene cluster (S1 for the cluster with the highest number of transcripts, followed by S2 etc.).

1.5 Merging of overlapping UniGene clusters.

Inspection of the map locations of (split) UniGene clusters revealed that many clusters overlapped. This could represent: 1) Two different genes encoded on opposite strands; 2) Transcript fragments from one gene that were assigned to two different UniGene clusters; 3) A consequence of the splitting of UniGene clusters containing contaminating transcripts. The latter possibility occurs when a UniGene cluster erroneously includes two transcripts of an unrelated gene. The splitting-procedure (1.3) will re-assign the erroneously clustered transcripts to their true genomic position. There they will be found in overlap with the UniGene cluster of the gene that they truly belonged to.

We designed an algorithm to merge overlapping clusters when appropriate. We used the conditions that (1) the clusters had an identical strand assignment and (2) the genomic sequence of the UniGene clusters did not include a sequence gap. Gaps make strand assignments unreliable, as the orientation of the contigs on either side of a gap with respect to each other is arbitrary. Each transcript that has been aligned against the genomic sequence is assigned to either the + or – strand in the UCSC database. The assignment depends on the sequence orientation of the transcript in the GenBank database and, consequently, a single UniGene cluster may contain a mixture of sequences that are assigned to either the UCSC + or – strand. Therefore, strand assignments for UniGene clusters cannot be based on the UCSC strand assignment of transcripts. We therefore designed our own strand assignment algorithms. We used the information on the orientation of transcripts available in the AMCtagmap database (extraction of SAGE tags in this database is based on information of the poly-A tail and poly-adenylation signal and can therefore be used for strand assignment). We analyzed the strand of the SAGE tags of each UniGene cluster (using the collection of sense-tags). As SAGE tags are in the original mRNA by definition preceded by the sequence CATG, we expanded the tag with this sequence, giving a 14 bp sequence. The full genomic sequence of a (split) UniGene cluster was searched for these 14 bp sequences (1 substitution, deletion or insertion was allowed). Tags that did not align with a genomic sequence were split in all possible combinations to analyze whether they were derived from a position in the transcript that crossed an intron-exon boundary. Subsequently, each UniGene cluster was assigned to the strand to which the highest number of tags matched. In exceptional situations where both strands matched an equal number of tags, no strand assignment was made. To validate this method we compared the results with 7.029 sequences in the RefSeq database of the NCBI (http://www.ncbi.nlm.nih.gov:80/locuslink/refseq.html). We found that 95% of the clusters with one RefSeq sequence were assigned to the same strand as the RefSeq sequence.

Finally, 16,621 overlapping (split)UniGene clusters with an identical strand assignment and without a sequence gap in the genomic sequences were merged into 6,266 new clusters (for nomenclature see below). After the merging step, the map counted a total of 60,396 clusters. The resulting clusters are indicated as Transcriptional Units (TUs).

1.6 Removal of UniGene clusters without intron-exon structure.

Many genes have pseudogenes elsewhere in the genome. Most pseudogenes are characterized by the absence of intron sequences. We therefore selected against pseudogenes by rejecting all TUs that did not include at least one intron of at least 50 bp. This way we rejected TUs from 30,569 map positions.

1.7 Discarding TUs on contigs without a chromosomal assignment.

Some sequences of the draft human genome sequence are not yet tied to a specific chromosome or chromosomal position. An additional 435 TUs mapping to such sequences were discarded.

1.8 Rejection of tags for statistical analysis of HTMseq (and whole chromosome view and concise view).

Certain SAGE tags are unreliable as they result from technical artifacts or happen to belong to many different UniGene clusters. Some of them are found in high numbers in SAGE libraries. To improve the quality of the database and the statistical analysis of the data, we applied the following rules for these tags:

a) Tags that belong to more than three TUs: The expression level of these tags is shown in the ‘extended view’ of HTMseq. These tags are shown in red and the number of TUs to which they belong is shown. However, tags belonging to more than three TUs were discarded for the generation of the expression profiles shown in the ‘whole chromosome view’ and ‘concise view’ levels of HTMseq. These tags were also ignored for all statistical analyses of Ridges, anti-ridges and correlations of gene expression levels with other genomic parameters. We discarded 1,662 TUs which only contained tags belonging to >3 TUs.

b) Anti-sense tags: Also tags that have an anti-sense orientation relative to the majority of the tags from a UniGene cluster are unreliable (see ref. 1). They either are artifacts or represent true anti-sense transcripts of a UniGene sequence. They are shown in red with the warning ‘anti-sense’ in the ‘extended view’ option of HTMseq, but they were not used for the expression profiles shown in the ‘whole chromosome view’ and ‘concise view’ levels of HTMseq. These tags were also ignored for all statistical analyses of Ridges, anti-ridges and correlations of gene expression levels with other genomic parameters. We removed 1,465 TUs with only anti-sense tags from the indicated sections of the database.

c) Furthermore, a tag was considered unreliable when it was found

1) in only one transcript of a TU

AND

2) when this tag did not map on the genomic sequence of the TU.

These tags are likely to be produced by chimeric cDNA sequences of which the part that contained the tag did not align to the genomic sequence. A number of 1,262 TUs were removed on this ground.

d) Linker tags: they arise as technical artifacts from the construction of SAGE libraries. All linker tags that also happened to belong to a TU were fully discarded, as their expression data in SAGE libraries are unreliable. As these tags were also removed in the AMCtagmap program, this resulted not in a further decrease of the number of TUs at this step.

1.9 Annotation of UniGene clusters after splitting and merging.

The split Unigene clusters kept their UniGene cluster number, but with the extension ‘s‘ and an index number according to the number of transcripts included in the split UniGene cluster (s1 for the cluster with the highest number of transcripts, followed by s2 etc.). The corresponding description lines of UniGene clusters were extended with the number of transcripts in the split cluster and the number of transcripts in the original UniGene cluster. When two or more UniGene clusters were merged, the UniGene cluster number was replaced by the text ‘merged’ (Hs.merged). The description line was modified to indicate which UniGene clusters were merged into the new cluster (e.g. Hs.2345_s2 and Hs.77821).

1.10 Determining the expression levels of Transcriptional Units

We used the SAGE libraries published at the NCBI SAGEmap database (httl://www.ncbi.nlm.nih.gov/SAGE/). There are 116 human SAGE libraries from this database included in HTMseq, totaling to 5,301,177 tags. The SAGE libraries can be analyzed individually, but we also combined SAGE libraries from one tissue in ‘tissue type’ libraries (Normal tissues of brain, cerebellum, white matter, breast, colon, pancreas, ovary, heart, kidney, leukocyte, liver, lung, peritoneum, prostate, vascular endothelium). SAGE libraries from the same tumor tissues were likewise combined. In addition, we used 657,869 tags that we identified in different neuroblastoma SAGE libraries. Finally we made combined libraries of all normal tissues (‘normal tissues’; 1,785,370 tags), tumor tissues (‘tumor tissues’; 3,419,180 tags) and of all combined SAGE libraries (‘all tissues’). The ‘all tissues’ data (5,959,046 tags) were used to identify ridges and perform all other statistical analyses.