Galaxy Workshop

Course on Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders

Jackson Laboratory, Bar Harbor, Maine

September 10, 2008

These exercises explore the public interface of Galaxy, which is designed for integrative, reproducible analysis of genomic data. This is a project from Penn State Center for Comparative Genomics and Bioinformatics ( and it is headed by Anton Nekrutenko and James Taylor.

Start at the homepage for Galaxy.

You can pick one of three paths (A, B, or C) through this workshop.

Option A. If you have not had much experience with the UCSC Table Browser, Ensembl's BioMart and use of web-based tools, we provide a series of on-line tutorials that show you step-by-step how do use Galaxy. Under “Use it now” choose “Watch screencasts”.

Option B. Many of you will want to explore the web-based tools that Galaxy offers to get more insight into data that you have obtained, either in your own research or from other sources (e.g. UCSC Table Browser, Ensembl's BioMart). Click on “Use it now” and explore as your curiosity drives you. For those who would like a more focused approach, follow the exercise below.

Option C. Some of you may want to see how to add your own programs to Galaxy or set up your own platform. On the Galaxy homepage, under “Designed for biologists and developers”, choose the links under “Developers”.

Exercises for Option B.

1. Upload precomputed preCRMs and other files of interest into Galaxy.

I built a site (Fig. 1) to access several datasets relevant to regulation of transcription

Fig. 1. Part of the web page with datasets on regulation of transcription.

Choose the datasets you are interested in, copy the URL for the dataset and paste it intoGalaxy. As shown in Fig. 2, use "Get Data" and "Upload File from your computer". This puts the datasets into Galaxy; you will see them in your history column. They are lists of chromosomal coordinates for the DNA intervals on the hg18 (2006) assembly of the human genome.

Fig. 2. Galaxy screen showing pasting of URL for dataset into the window for data uploads.

Note that the set of highly conserved regions is very large and will take a while to load. Choose at least two datasets.

2. (Optional)Choose a region to study (or you can do whole genome analysis, it just takes longer).

If you are interested in a particular region, of course choose that. Particularly deep information is available on the ENCODE pilot project regions (Birney et al. 2007) ( Here are some examples:

ENm001CFTR and surrounding genes

ENm002Interleukins

ENm003Apolipoprotein cluster

ENm006region on ChrX

ENm008alpha-globin and related genes

ENm009beta-globin and related genes

ENm010HOXA cluster

ENm011IGF2/H19 cluster

ENm012FOXP2 region

You can get the coordinates for an interval from the UCSC Genome Browser or the Table Browser proxy under “Get Data” in Galaxy. Use these coordinates as one (very simple) data item in your history.

3. Use "Operate on Genomics Intervals", choosing the "Intersect" tool (Fig. 3) to:

a. Find the predicted CRMs (based on alignments or on direct biochemical data) in your genomic region of interest.

b. Compare the two sets of preCRMs. What fraction overlap, and what fraction are distinctive?

Fig. 3. Galaxy screen showing use of the Intersect command.

4. Compute the phastCons scores for preCRMs of interest.

Under "Get Genomic Scores", choose "Aggregate datapoints" (Fig. 4), and follow the directions to get the mean, minimum, and maximum, phastCons scores for each interval.

Fig. 4. Galaxy screen showing use of the “Aggregate datapoints” command.

5. Plot the distribution of these scores; under the “Graph/Display Data” choose “Histogram” (Fig. 5).

Fig. 4. Galaxy screen showing use of the “Histogram” command.

Do these preCRMs seem to be homogeneous with respect to these scores?

Reference

Birney, E. Stamatoyannopoulos, J.A. Dutta, A. Guigo, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature447: 799-816.

Appendix: More tools and datasets for gene regulation in mammals

Tool or resource / Source / Section or URL / Track or comment
phastCons scores / UCSC Browser / Comparative Genomics / Conservation
phastCons highly conserved elements (HCEs) / UCSC Browser / Comparative Genomics / Most Conserved
Conserved TFBSs / UCSC Browser / Expression and Regulation / TFBS Conserved
Regulatory potential / UCSC Browser / Expression and Regulation / Reg Potential 7 species
Segments with high regulatory potential / PSU CCGB /
CpG islands / UCSC Browser / Expression and Regulation / CpG islands
Clusters of conserved TFBSs / PReMods / / “Search Predicted Modules”
Known motifs in a single sequence / TESS /
Known motifs in a single sequence / MOTIF (GenomeNet) /
Known motifs in a single sequence / MatInspector / / Note: requires registration and has limit on number of analyses
Precomputed evolutionarily conserved regions / ECR browser /
Conserved TFBSs / Consite /
Experimental tests on highly conserved noncoding sequences / Enhancer browser /
Experimentally investigated regulatory regions / ORegAnno / Expression and Regulation /

1