Lab Exercise #3: BLAST, PSI-BLAST, and PHI-BLAST

Lab Exercise #3: BLAST, PSI-BLAST, and PHI-BLAST

CISC 4020 Bioinformatics Tuesday, February 22, 2011

Lab Exercise #3: BLAST, PSI-BLAST, and PHI-BLAST

(due March 1 – submit on Blackboard)

Resources:

NCBI BLAST:

(1) Perform a blastp search at NCBI using the following query of just 12 amino acids: PNLHGLFGRKTG. By default, the parameters are adjusted for short queries (You can view the settings used in the “Search Summary” link). Inspect the search summary of the output. What is the E value cutoff? What is the word size? What is the scoring matrix? How do these settings compare to the default parameters?

(2)Protein searches are usually more informative than DNA searches. Do a blastp search using RBP4 (NP_006735), restricting the output to Arthropoda (insects). Next, do a blastn search using the RBP4 nucleotide sequence (NM_006744; select only the nucleotides corresponding to the coding region of the DNA). Which search is more informative? How many databases matches have an E value less than 1.0 in each search?

Hint:Go to Entrez Nucleotide, enter the query NM_006744, click on CDS (coding sequences) on the lower left part of the page, and select FASTA as the format. Using this query, search Arthropods in the Reference RNA sequence database.

(3) “The Iceman” is a man who lived 5300 years ago and whose body was recovered from the Italian Alps in 1991. Some fungal material was recovered from his clothing and sequenced. To what modern species is the fungal DNA most related?

Hint: Search Entrez nucleotide with the query "iceman" and look for fungal entries. If you are not sure which entries are fungal, you can start by going to the Taxonomy home. On the left sidebar click "eukaryota" then scroll down and click on "fungi." Click "fungi" again and you will open the taxonomy page at the root of all fungi. There is a link to Entrez nucleotide entries; click it, add the query term iceman (so your query reads: txid4751[Organism:exp] AND iceman).

(4) The malarium parasite Plasmodium vivax has a multigene family called vir that is specific to that organism (del Portillo et al., 2001). There are 600 to 1000 copies of these genes, and they may have a role in causing chronic infection through antigenic variation. Select vir1 and perform a blastp search of the nonredundant database. Then perform a PSI-BLAST search with the same entry.

(a) In an initial search, approximately how many proteins have an E value less than 0.002, and how many have a score greater than 0.002?

(b) What is the score of the best new sequence that is added between the first iteration and the second iteration of PSI-BLAST?

(5) Provided for you are 4 protein accession numbers:
gi|151567676, gi|1680618, gi|4503761, gi|32699184

Steps to follow:

  • For each of the above protein ids use PSI-BLAST to find the protein family.
  • Under “Algorithm parameters” apply the Filter "Low complexity regions."
  • Iterate the search using the derived profile. Perform five iterations.
    (You will notice there is a "Run PSI-BLAST" option on the query results page.)
  • For every iteration - what are the top new proteins identified (will be labeled as “new”)?
  • Record the E-values of the top five new sequences and then compare the E-valuesof the firstiteration to the fifth iteration.
  • On the fifth iteration, find the types of proteins you get as new top hits.
  • Compare the protein accession ids above to the Swiss-Prot database.
  • Name 5 of the reported hits that you get from SwissProt.

(6) Explore PHI-BLAST using human RBP4 (NP_006735) as a query, restricting the output to bacteria and the RefSeq database. Use the PHI pattern GXW[YF]X[VILMAFY]A[RKH]. Perform this search, and save the results. Then repeat the search using the PHI pattern GXW[YF][EA][IVLM]. How do the results differ? Select one protein that appears as a bacterial protein in a pairwise alignment with the human RBP4 query; what are the E values, and why do they differ?