Assignment 7: Finding protein-coding genes

The purpose of this exercise is to illustrate some of the concepts in the lectures and readings by using web servers to annotate genes. As with all my assignments, if your interests lead you in a different direction, you are free to follow that direction as long as it deals with gene annotation. You may do the assignment on genomic regions from ANY organism (including bacteria, plants, and fungi) but you will probably have to do more independent investigation than if you choose to use the assigned sequence. Of course, please tell me what you did. The report from this exercise should be around two to four pages, including figures. Quantitative answers are preferable to qualitative ones. Describe your observations in your own words, and cite your sources for information.

Pick a genetic locus (single gene or multiple genes) that you are interested in. You can choose the locus from any organism. The following description of the assignment is based on a gene that almost everyone is interested in at some level, TP53. This gene encodes a transcription factor, “tumor protein 53”, that regulates several aspects of cell growth. It is also frequently mutated in many cancers. If you have no better preference, then work on TP53. It and some adjacent genes are located at chr17:7,550,001-7,608,000 in the GRCh37/hg19 assembly of the human genome. This 58 kb sequence (in fastA format) is at the Angel course site.

(1) Run the sequence through Genscan to find the predicted genes. Genscan and the associated server were developed by Chris Burge (now at MIT) and it is still supported there:

http://genes.mit.edu/GENSCAN.html

If you are working with a bacterial sequence, try Glimmer (Salzberg lab); you can use the server at NCBI:

http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi

Briefly state how the gene predictions were produced, and describe the results of the gene predictions.

(2) Now compare these results to (a) evidence of transcription and (b) gene models built by a comprehensive pipeline, such as “UCSC genes” or “GENCODE”. A good way to do this is to examine tracks in the UCSC Genome Browser for

- A comprehensive pipeline, such as “UCSC genes” or “GENCODE”

- mRNA data

- Genscan predictions

- results of RNA-seq

Describe the gene annotations from these different sources. What similarities and differences to you see? What is the basis for the differences? (This is asking about the power and limitations – the good points and not-so-good points – about the different methods.)

To help in getting started, I have shared a “browser session” with you. Copying and pasting the following URL into your internet browser will open a view at the UCSC Genome Browser.

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=rosshardison&hgS_otherUserSessionName=TP53andFlanks

This is a good starting point, but I encourage you to explore these tracks, change the settings, open other tracks, etc. This is an opportunity to delve more deeply into the material we covered.

1