Practical – Alignments and genome browsers

Table of Contents

DNA Motif discovery 1

BLAST 2

PSI-Blast and HMMer 4

InterProScan 5

The UCSC Genome Browser 5

IGV & Bowtie 8

Data that you will need for these assignments can be found at:

http://130.237.142.51/media/data/courses/align_browse/

DNA Motif discovery

A ChIP-seq experiment has been performed to find where Sox2 binds in neural stem/progenitor cells.

From the resulting genomic positions (which are accurate to on the order of 100 bp), the genomic

sequences around the ChIP-seq peaks have been extracted.

1. Download Sox2NPC_top500.fa from the assignment website.

2. Start running MEME-ChIP (http://meme.sdsc.edu/meme/cgi-bin/meme-chip.cgi), which is a tool

that both tries to create motifs from your data, and looks for known motifs from a database. It will take

hours to run (~2h for this file), so continue with the next step and following exercises, when the MEME output is finished, continue with step 5.

3. Run CisFinder (http://lgsun.grc.nia.nih.gov/CisFinder/), another de novo motif discovery tool.

Log in with username guest and no password.

To upload a file: 'Use file' Browse, then click the button Upload.

Select your file in the dropdown for 'Sequence file #1 (test)'

Select one of their control files in the dropdown for 'Sequence file #2 (control)'. This control is not

required, but get rid of motifs created by low-complexity regions of the genome, e.g. CA repeats.

Then click 'Identify motifs'.

Use the default settings on the next page and click continue.

Click 'Show elementary motifs' and 'Show clusters of motifs', look at the output

4. Look at Sox motifs in the transcription factor motif database Jaspar (http://jaspar.genereg.net/). The

family binds to pretty much the same set of sequences all of them. Click the vertebrata button. The

page has a search function, but it's easier to use your web browsers 'find on this page' search (ctrl+F).

Can you find a Sox motif among the enriched motifs in CisFinder's output?

If you look closer at the Jaspar motif labeled 'Sox2', the motif only matches CisFinder's Sox motif and

Jaspar's other motifs in one half. Jaspar's Sox motif comes from ES cell ChIP-seq, where the motif

discovery program reported a joint Sox2-Oct4 motif, but Jaspar's labeling missed that.

5. Check if MEME-ChIP finished. Browse the output from MEME-ChIP: DREME, AME and

DREME->TOMTOM.

Does it find a Sox motif?

Are there motifs that could correspond to binding partners?

BLAST

(http://blast.ncbi.nlm.nih.gov/)

For detailed information on NCBI Blast please have a look in the NCBI Handbook (http://www.ncbi.nlm.nih.gov/books/NBK21097/). In the whole blast server, each entry has a blue “?” next to each option, if you are unsure what each setting means, click on the question mark to find out more.

1. FETCH THE QUERY SEQUENCE

The sequence “citron” we will use can be found at the assignment website.

Choose what program to use. Blast is really a bundle of programs such as blastn (nucleotide blast), blastp (protein blast), blastx (compares a nucleotide query sequence translated in all reading frames against a protein sequence database) and many others.

In our case, we will compare protein sequences and therefore use blastp.

Choose “protein blast”.

2.CHOOSE DATABASE

You are now at the input page. Paste your sequence in the search field.

Almost as important as what program to use is to choose the correct database to perform your search against, this is done under “Chose search set”. The default is a database called nr. This database combines sequences from GenBank, RefSeq, EMBL, DDBJ, etc. in case of nucleotides, and Entrez protein, PDB, PIR PDB sequences etc. for proteins. The nucleotide db is not non-redundant (as the abbreviation implies).

The choice of database is important! If you were only interested in high-quality hits, it would be prudent to limit yourself to e.g. the SWISS-PROT database. On the other hand, if your query is hard to find, consider the possibility to search e.g. unordered BACS from the sequence projects (the htgs database). More information on the various databases can be found at the page.

For now, leave it at nr.

3. PROGRAM SELECTION

With protein blast there are several different programs to chose from. The default is regular protein blast (blastp) but you can also select PSI-Blast, PHI-Blast and Delta-Blast.

Make sure that blastp is selected.

4. PARAMETER SETTINGS

There are many options available to adjust, to see them, select the “+” next to Algorithm Parameters.

The most important ones:

E value:

The expected value cut-off: “Expect”, where the standard is 10. If you want only very close hits, change E to something smaller. If more you want more distant homologies reported, increase E.

Score-matrix:

Then standard is BLOSUM 62, but other matrices might be better in some cases.

Gap cost/extension:

The cost of opening and extending gaps in the alignments.

Filtering/masking:

Select if you want to filter out low-complexity regions or other

For now, leave all settings at default parameters.

For now, use the defaults. Push the BLAST button.

5. THE FORMAT PAGE

In the case of a protein sequence, you might be notified that there are some conserved domains in your sequence. Note that this has nothing whatsoever to do with blast itself, it is just an auxiliary service. Just wait a bit and the blast results will be displayed. If you are submitting longer sequences, it might be a good idea to make a note of the job id. That way you can retrieve the output the other day if you wish.

6. THE RESULTS PAGE

After waiting a while, you will get to the result page. Scroll down to see the result figure, which should say something in the line of “ Distribution of xxx Blast Hits on the Query Sequence”.

All hits are listed as lines, coloured after their quality (E-values and bit-scores). Bit scores are basically the scores of an alignment: the higher the bit score, the better.

E-values are an estimate of how likely it is to find an alignment with this score just by chance. The statistics here are rather complicated, but some things are easily understood: the size of the database clearly has an impact on the E-value, but not on the bit-scores. Therefore, E-values can change when different databases are queried, even though it is the same alignments.

The line colour indicates the alignment quality. All lines (=found sequences, or 'hits') have mouse-over capabilities. If moving the mouse over a line, a description of the specific sequence will appear in the text box at the top.

Scroll down further, and you will see a list of hits (the classical, non-graphical BLAST output).

Each text line is clickable; clicking the hit sequence name will give you the corresponding database sequence entry, and clicking the bit score to the right will bring you to the actual alignment of the query sequence to that hit sequence. As you can see, the top 10 entries are all very good, and most are cyclins from mouse or human. A pretty strong claim would be that the citron sequence is a cyclin, and that it is of a mammalian origin, if not human or mouse. To be more specific, a detailed study of the alignments would be appropriate.

7. ASSIGNMENTS

a) There are more sequences to try out on the assignment webpage: sallad and vanilj.

Have in mind what kind of sequences you submit and choose the BLAST program accordingly. Also, think about the following: is it given that you always will have a ‘red’ match? If not, why are there no good matches in some cases?

When you run nucleotide blast, the default setting is to run megablast, i.e. search for closely related sequences. If there are no close relatives, it may be a good idea to try blastn or blastx instead.

b) Retrieve the sequence AF227957 from Genbank at NCBI and BLAST it against the standard nucleotide databases. Repeat the analysis, but this time using the appropriate program to instead search the protein databases with the sequence.

Is there a difference in the hit distribution? Why?

PSI-Blast and HMMer

PSI-Blast can be performed at the NCBI site, however, their server is very slow and the download options are limited. Instead use the HMMER server at http://hmmer.janelia.org/. This server includes several search programs: Phmmer for protein alignments, Hmmscan for searching with a protein sequence against an hmm database (Pfam), Hmmsearch for searching with an hmm against a protein database and Jackhammer, which is PSI-Blast.

Task:

You have a protein sequence, from Aspergillus oryzae (NCBI GI 83766847), and you want to find out what it may be related to and if there are any homologous structures that you can use for structure modelling.

Use the following methods:

Hmmscan – protein sequence vs. profile-HMM database

Use your query sequence to see if you can detect any Pfam domains in your sequence. Are there any hits?

Phmmer – protein alignment (similar to blastp)

Search for homologs in nr and in PDB. Do you get any hits? Do any of the hits have names that give a clue to the function of your protein? Are there any closely related structures in PDB?

Jackhammer – PSI-Blast

To search for more distant homologs it is often useful to use PSI-Blast to create a profile using several similar sequences, instead of a single sequence. Run PSI-Blast with your sequence and NR as the database. PSI-Blast should run sufficient iterations until you do not get any more new hits (convergence), or until the number of hits expands too much and you might expect that you are including too many non-related hits.

Run five iterations and select an iteration that you think is appropriate to use for further searching. Take a look at the hit distribution (coloured bars above result list) to make sure that you do not include too many hits with low significance.

Select Download and HMM, this will create a Hidden Markov Model from all the sequences in your search.

Question: How many iterations do you have to run until you found a human homolog to the A. oryzae protein? ( Hint: follow the Taxonomy link.) In what phylum/phyla do you find most of the homologs?

Question: What domains can you find in the hits of the first iteration?

Question: What is the most conserved residue after 2 iterations? Scroll down to the bottom of the score page to see the sequence logo for the profile. Is the same residue equally conserved after five iterations?

HMMsearch

Now you can use your hmm to search for homologs in PDB. Select “Upload a file” and use the HMM that you created, make sure that you select PDB as database. Run HMMsearch and see if you can find any related structures. Did you find any similar structures that you could use for homology modelling?

Another approach to finding related structures could be to run PSI-Blast sufficient iterations until you have a structure among the hits. If you have the time, check all iterations for PDB structures. Click “Customize” above the result list and check the box for “Known structure”. Now you should be able to see if there were any hits to PDB in each iteration (Note: only check the significant hits, not the yellow fields with e-values below cutoff). This of course is time consuming to check all the list, but with automated scripts for running PSI-Blast and checking for hits in PDB, this is an efficient way of finding hits in PDB.

InterProScan

(http://www.ebi.ac.uk/Tools/pfa/iprscan/)

InterPro classifies protein motifs and domains from several different databases. To learn more about the different motifs in InterPro, please check out the tutorial at http://www.ebi.ac.uk/interpro/tutorial.html. InterProScan can be used to search for InterPro motifs in a query protein sequence. Use the sequence MDprotein at the assignment website and paste into the search field. Do you find any predicted domains in your sequence?

Do the domain assignments from the different methods agree well? For which domain is the agreement best?

The UCSC Genome Browser

(http://www.genome.ucsc.edu/)

Briefly, the genome browser is a concept where mRNA sequences and other information is ‘mapped’ on the genome sequence. Usually, information from one specific source (such as ‘mRNAs from genbank’ or ‘human-mouse conservation’) is in a separate ‘track’. The trick is how to select the information (the tracks) you are interested in, and not get overwhelmed by the rest.

Go to the UCSC Genome Bioinformatics website (http://www.genome.ucsc.edu). From the start page, you can click on the blue bar at the top of the screen to access the resources of main interest: Genomes, Blat, Tables, etc. The table browser provides a textual (i.e. non-graphical) interface to genomic data; this can be useful for larger, systematic analyses. You may want to have a look at the ‘Help’ page before moving on.

1.FINDING A GENE IN THE GENOME

a) Click “Genomes” on the blue bar at the top of the screen. This brings you to the Genome Browser Gateway, where you can select between different assemblies for different genomes. Select the human genome assembly from March 2006 (the most recent human assembly). In the box labeled “position or search term”, you can type in the name of a gene, an accession number or a chromosomal region. Some examples are given further down on the web page. For this exercise, we will investigate a gene called ADAM2, so enter that name in the position-box and click “Submit”.