Worksheets Must Be Completed and Handed in by the End of Class

BIO 345

Intro to Bioinformatics

Worksheets must be completed and handed in by the end of class

Name:

mRNA Accession #:

Protein Accession #:

1) (5pts) Using your mRNA accession number, obtain the FASTA sequence file for your mRNA via NCBI Nucleotide. Using this sequence, ORF Finderand UCSC BLATprovide the following information:(Answers off by1 nucleotide will be marked wrong).

What is the full name of your gene?

What organism is your sequence from?

What is the total length of your mRNA sequence?

What is the nucleotide position (residue #) of the first nucleotide in the START codon?

What is the nucleotide position (residue #) of the last nucleotide in the STOP codon?

What are the nucleotide positions of the 5’UTR (where does it begin and end)?

What are the nucleotide positions of the 3’UTR (where does it begin and end)?

What is the total length of the protein generated by this mRNA?

How many intronswere removed from this sequence?

What are the nucleotide positions (residue #) of the two nucleotides surrounding the removed introns for each intron that was removed?

2)(5pts.) Using your protein accession number, perform a BLASTp search to find potential paralogsusing the human refseq database. Create a multiple sequence alignment in Clustal Omegausingyour sequence and the 3 human genes most similar to your sequence (be sure that you have 4 unique genes—be careful not to use multiple isoforms of the same gene these usually contain the words isoform/predicted/X1).

Does your protein contain any conserved domains? If so, what is the name of each domain? In your own words (not copy and pasted), what is the function of proteins that contain this domain?

Print your alignment with numbers and attach it to your worksheet (you can also paste it into the worksheet).

Attach the percent identity matrix to your worksheet.

Which sequence is most identical to your sequence?

Which sequence is most distant to your sequence?

Which two sequences in your matrix are most closely related?

3)(5pts) Using your protein accession number, perform a BLASTp search to identify potentialorthologous sequences in the refseq database (exclude human sequences). Identify 9 homologous sequences each from a unique species (these can be predicted sequences). Only two of nine sequences can come from primates (if you are not sure if your sequence comes from a primate look at the taxonomy information in the NCBI sequence record). To searchspecifically for non-primate sequences, go back to your BLASTp search and exclude primates from the organism option. Use these 10 sequences (including your human sequence) to generate a multiple alignment in Clustal Omega. Use this alignment file to generate a phylogenetic tree in Clustal_W2. The settings in Clustal_W2 should be as follows: Tree Format (Distance Matrix), Distance Correction (ON), Exclude Gaps (ON), Clustering Method (UPGMA) and PIM (ON). Display your tree as a Cladogram.

Copy and paste/attach your tree to this document.

What are the species represented in your tree (what are the common names of these species)?

How many sister taxa does your tree have?

List the pairs of species found in sister taxa?

Which species is most similar to your human sequence?

Which species is least similar to your human sequence?