1

BIOINFORMATICS

BIO 208 Geneticsrevised s10

Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. The simplest tasks used in bioinformatics concern the creation and maintenance of databases of biological information. Nucleic acid sequences (and the protein sequences derived from them) comprise the majority of such databases. Bioinformatics includes the development of new algorithms and statistics with which to assess relationships among members of these large data bases and the analysis and interpretation of various data including nucleotide and amino acid sequences, protein domains, and protein structures (computational biology).

Computational molecular biology includes:

  • finding the genes in the DNA sequences of various organisms
  • developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences
  • clustering protein sequences into families of related sequences and the development of protein models
  • aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships.

NCBI= National Center for Biotechnology Information Established in 1988 as a national resource for molecular biology information, the NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information.

BLAST= Basic Local Alignment Search Tool BLAST programs can be used to search both DNA and protein sequences on the NCBI server. The program you will use is BLASTp (p for protein).

GenBank contains over 6,000 whole genome sequences, and over 100,000,000 DNA sequences. Over 30,000 people per day access GenBank online.

OMIM Online Mendelian Inheritance in Man is a comprehensive compendium of human genes and genetic phenotypes.OMIM contains information on all known Mendelian disorders and over 12,000 genes.

PubMedPubMed is a service of the U.S. National Library of Medicine that includes over 18 million citations from MEDLINE and other life science journals for biomedical articles back to 1948.

NCBI resources

Literature

  • Download a large, custom set of records from NCBI, Obtain the full text of an article
  • Find articles about a topic similar to that in a given article. Find published information on a gene or sequence

DNA & RNA

  • Download a large, custom set of records from NCBI
  • View/download features around an object or between two objects on a chromosome
  • Link from an object on a map to another resource, Obtain a genomic DNA clone for a gene
  • Retrieve all sequences for an organism or taxon
  • Find a curated version of a sequence record (NCBI Reference Sequence), Find transcript sequences for a gene
  • Design PCR primers and check them for specificity
  • Save text search and receive regular search results by e-mail. Find published information on gene or sequence

Proteins

  • Download a large, custom set of records from NCBI, View a mutation site in a 3D structure
  • View the 3D structure of a protein, Align two or more 3D structures to a given structure
  • Find the function of a gene or gene product, Link from an object on a map to another resource
  • Retrieve all sequences for an organism or taxon, Find curated version of sequence record
  • Find transcript sequences for a gene, Save a text search and/or receive regular search results by e-mail
  • Find published information on a gene or sequence

Sequence Analysis

  • Run BLAST software on a local computer. Design PCR primers and check them for specificity
  • Automate BLAST searches performed on NCBI servers. Run BLAST searches against custom, local databases
  • Submit multiple query sequences in a single BLAST search
  • Obtain genomic sequence for/near a gene, marker, transcript or protein

Genes & Expression

  • View all SNPs associated with a gene. View genotype frequency data for a gene, disease or SNP
  • Find genes associated with a phenotype or disease
  • Find human variants associated with a phenotype or disease as reported in the literature
  • Download a large, custom set of records from NCBI. Find the function of a gene or gene product
  • View/download features around an object or between two objects on a chromosome
  • Link from object on map to another resource. Find human variants with clinical association in SNP database
  • Find syntenic regions between the genomes of two organisms. Obtain a genomic DNA clone for a gene
  • Find transcript sequences for a gene. Save a text search and/or receive regular search results by e-mail
  • Find published information on a gene or sequence. Find a homolog for a gene in another organism
  • Obtain genomic sequence for/near a gene, marker, transcript or protein

Genomes

  • Download the complete genome for an organism
  • View/download features around an object or between two objects on a chromosome
  • Link from an object on a map to another resource. Find syntenic regions between the genomes of two organisms
  • Obtain a genomic DNA clone for a gene. Check the status of genome sequencing for an organism

Maps & Markers

  • View/download features around an object or between two objects on a chromosome
  • Link from an object on a map to another resource. Find syntenic regions between the genomes of two organisms
  • Obtain a genomic DNA clone for a gene

Domains & Structures

  • View a mutation site in a 3D structure. View the 3D structure of a protein. Align two or more 3D structures to a given structure
  • Find the function of a gene or gene product. Save a text search and/or receive regular search results by e-mail

Genetics & Medicine

  • View genotype frequency data for a gene, disease or SNP. Find genes associated with a phenotype or disease
  • Find human variants associated with a phenotype or disease as reported in the literature
  • Find human variants with a clinical association in the SNP database

Taxonomy

  • Retrieve all sequences for an organism or taxon. Find the complete taxonomic lineage for an organism
  • Generate a Common Tree for a set of taxa

Data & Software

  • Download the complete genome for an organism, large, custom set of records from NCBI, NCBI Software

Training & Tutorials

  • Learn about the basics of molecular biology and bioinformatics, Learn about an NCBI resource
  • Complete an NCBI tutorial. Find out what's new at NCBI

Homology

  • Find syntenic regions between the genomes of two organisms. Find a homolog for a gene in another organism

Small Molecules

  • Find bioassays in which a given drug is active. Find bioassays that test a particular disease or protein target

Variation

  • View all SNPs associated with a gene. View genotype frequency data for a gene, disease or SNP
  • Find genes associated with a phenotype or disease
  • Find human variants associated with a phenotype or disease as reported in the literature
  • Download a large, custom set of records from NCBI. View a mutation site in a 3D structure
  • Find human variants with a clinical association in the SNP database

PROBLEM CONTEXT(adapted from National Science Foundation)

You and your research partners attended a presentation at which you learned of an effective folk remedy used for the prevention of fungal disease in humans. The pasty nature of the remedy is provided by the structural protein, keratin.During the presentation, evidence was presented suggesting that there is a lower incidence of breast cancer and heart disease among those who take the folk remedy.

The folk remedy contains the following:

Water

Salt

Pigeon feather extract

Muskmelon seeds

Southern copperhead snake venom

Your start-up biotech company is interested in the therapeutic effects of these agents and needs to identify the specific protein responsible for the observed anti-cancer and anti-heart disease effects.

After identifying the active protein, your company will isolate the gene encoding the protein. The gene will be engineered and cloned into bacteria. By growing the bacteria in culture, you will be able to purify large quantities of the protein which can be used in FDA regulated clinical trials.

Three amino acid sequences have been isolated from the folklore remedy. You will use the NCBI’s BLAST program to search the protein database to identify the protein from which these amino acid sequences were obtained. You will hypothesize as to which protein might possess the anti-cancer, anti-heart disease function desired.

Objectives:

To fully understand the purpose of the laboratory exercise including the problem context

To view some of the organism databases offered in the BLAST assembled genomes

To convert amino acid sequences into FASTA format

To utilizethe BLAST searching toolto identify proteins using short amino acid sequences

Toevaluate the contents of the folklore remedy with respect to their use in medicine

To identify a journal article relevant to the anti-cancer or anti-heart disease effects of the component(s) in the folk remedy

To explain why the selected protein is likely to have anti-cancer and anti-heart disease activity

I. IDENTITY OF THE UNKNOWN PROTEINS

All sequences for the BLAST alignment programs are entered in FASTA format. View the table on the last page of this handout to see the one letter FASTA code for each of the amino acids.

Q1.What is the 1 letter code for the following amino acid sequence?

Methionine – Lysine – Leucine- Tyrosine – Serine –Leucine-Leucine-Serine-Leucine-Leucine-Phenylanaline-Leucine-Glycine-Valine-Leucine-Tryptophan-Arginine-Serine-Glutaminc Acid-Glycine-Valine-Alanine- Serine-Serine-Serine-Asparagine-Aspartic acid-Aspartic acid-Valine-Glycine

1 letter code (FASTA format)  ______

The three amino acid sequences isolated from the folklore remedy are:

1: The above sequence from question1

2: MSCYNPCLPC QPCGPTPLAN

3: dapanpccda atcklttgsq cadglccdqc

Searching the BLAST database

Access the NCBI home page at

selectBLAST(right column)

select list all genomic species

Q2. Explore the species in GenBank by clicking on the organism groups (primates, rodents, monotremes, marsupials, invertebrates, protozoa, plants, fungi, etc). List the scientific name (Genus and species) of the following organisms.

Human

Chimpanzee

Mouse

Zebrafish

Fruit fly

Maize (corn)

Bakers yeast

Return and select protein BLAST(searches protein data base with amino acid query)

Enter the amino acid sequence of protein 1in FASTA format (see above)

Select the nr database = non-redundant protein sequences. Scroll down and click on BLAST.

View the color key for alignment scores. A score above 50 will be considered relevant in this exercise. Scroll down to sequences producing significant alignments and view the sequence that produced the strongest hit.

Q3. The sequence that has the best match has the highest score. What is the probable identity of the protein?

Q4. Select the link to the left of the sequence description (begins with sp or ref)

What is the source of the protein? List genus and species AND common name.

Q5. Scroll down to examine the amino acid sequence alignment. How many amino acids long is the protein?

Q6. Examine the amino acid sequence carefully to locate your original query sequence within in it. Which amino acids (from what position to what position in the protein) was your query sequence? ______to ______

Determination of protein function

Search Wikipedia to identify the function of protein 1 in the organism you identified. Wikipedia contains accurate information with respect to the bioinformatics exercise.

Q7.Examine the applications and ingredients of the folklore remedy. This will assist you in determining the function of the protein you have identified. What is the function of this proteinin the folklore remedy? Based on the problem context, would your company pursue this protein?

Identification of Additional Proteins in the Folklore Medicine

Examine the FASTA sequence of protein 2. Use the table on the last page of the handout to determine the amino acid sequence of this protein.

Q8.What is the amino acid sequence of protein 2 (use 3 letter amino acid abbreviations):

Q9. Use BLAST to determine the identity of this protein. What is the identity of protein2?

Q10. What is the genus and speciesof the organism that protein 2 was isolated from?

Q11. What is the common name of the organism?

Q12. Which one of the folklore ingredients did you identify?

Q13. Reread the problem context that describes the uses of the folk remedy. The protein you identified is has a structural, or binding, function. Explain (Wikipedia can be used)

Q14. Use protein BLAST to determine the identity of protein 3. Although you may obtain a number of hits with a score of 100, observe only those that correspond to full sequences (not partial, or protein chains). What is the identity of protein 3?

Q15. What is the genus and species from which protein 3 was isolated?

Q16. What is the common name of the organism?

Q17.Which one of the folklore ingredients did you identify?

Q18. What is (are) thefunction(s) of the protein with respect to your interests in the biotech startup company described in the problem context (search Wikipedia and read at least 3 article links)

Q11. List 4 journal names in which research biologists have published papers on this protein. Journal names can be found at the end of each article.

II. SUMMARY

Provide an analysis of your investigation. Of the 3 proteins analyzed, which do you think is the responsible for reducing the risk of breast cancer and heart disease? Which element of the folklore remedy would you purify and pursue as a drug? What is the basis for your opinion?What are the roles of the 2 proteins not selected for further study in the folklore remedy?

Amino acid symbols

Amino acid / One letter symbol / Three letter symbol
alanine / A / Ala
arginine / R / Arg
asparagine / N / Asn
aspartic acid / D / Asp
cysteine / C / Cys
glutamic acid / E / Glu
glutamine / Q / Gln
glycine / G / Gly
histidine / H / His
isoleucine / I / Ile
leucine / L / Leu
lysine / K / Lys
methionine / M / Met
phenylalanine / F / Phe
proline / P / Pro
serine / S / Ser
threonine / T / Thr
tryptophan / W / Trp
tyrosine / Y / Tyr
valine / V / Val