Department of Molecular and Cellular Biology

MCB 415

Bioinformatics Exercises

University of Arizona

Department of Molecular and Cellular Biology

Tucson, Arizona 85721

Phone:

Fax:

Email:

Developed by: Dr. Frans Tax, Kimberly Finnell

Table of Contents

Table of Contents………………………………………………………………..2-3

Introduction………………………………………………………………………...4

A Primer of Important Things to Know When Analyzing DNA…………………..5

1. Protein Sequence………………………………………………………..5

Figure 1: DNA Polarity…………………………………………………....5

Figure 2: Gene Colinearity………………………………………………...6

2. Protein Structure………………………………………………………...6

Figure 3: Primary, Secondary, Tertiary, and Quaternary Structure……….7

Figure 4: IGF-1 Protein Structure…………………………………………7

3. Protein Function…………………………………………………………8

A. Biochemical Functions………………………………………………….8

Figure 5: Location of Plasma Membranes……………………………..….8

Table 1: Protein Functions in the Cell……………………………………..9

B. Genetic Functions………………………………………………..….....10

Types of Bioinformatics Programs………………………………………….….….10

1. Open Reading Frame Database …………………………………….…10

Table 2 Codon Table………………………………………………….…...10

2. ExPaSy…………………………………………………………….…...11

3. PROSITE……………………………………………………………….11

4. BLOCKS…..……………………………………………………………11

5. BLAST……….…………………………………………………………11

Growth Factors……………………………………………………………………..12

Gene 1 (IGF-2 gene)……………………………………………………………..12 Figure 6: IGF-2 Growth Regulation………………………………………12

Figure 7: Membrane Receptor/Ligand Dependent Tyrosine Kinase……...13

Deletions in the IGF-2 gene………………………………………………………...13

Gene 2 (Phytosulfokine gene)………………………………………………………13

Figure 8: Processing of the Phytosulfokine gene……………………….....14

Figure 9: Nucleotide and Amino Acid Sequence of AtPSK2 and AtPSK3.15

Figure 10: PSK Amino Acid Sequence Indicating Intron Location……. ...15

Figure 11: Arabidopsis thaliana plant……………………………………...16

Deletions in the Phytosulfokine gene……………………………………………...... 16

Database Procedures IGF-2 gene……………………………………………….…...16

Step 1 DNA sequence and PubMed……………………………………...16-17

Step 2 ORF……………………………………………………………….....17

Step 3 ExPaSy and PROSITE (domains)………………………………..17-18

Step 5 BLAST…………………………………………………………...19-20

Database Procedures Phytosulfokine gene……………………………………….....20

Step 1 DNA Sequence and PubMed…………………………………….20-21

Step 2 ORF……………………………………………………………….…21

Step 3 ExPaSy and PROSITE (domains)……………………………….21-22

Step 5 BLAST…………………………………………………………..23-24

List of other databases…………………………………………………………..24-25

Other applications for DNA analysis……………………………………………….25

Glossary…………………………………………………………………………25-26

References……………………………………………………………………….27-28

MCB415

Fall 2007

Bioinformatics is the science of managing and analyzing biological data using advanced computing programs.23 In this exercise our goal is to analyze the function and structure of a protein of interest by analyzing a sequence of DNA. First, we will take a DNA sequence and determine its protein coding capability. This is done by entering a DNA sequence into a universal database provided by the National Center for Biotechnology Information (NCBI). Second, we will go to the ORF (open reading frame) database to find a subset of the sequenced piece of DNA that begins with an initiation codon (methionine ATG) codon and ends with a nonsense codon. These ORFs have the potential to encode for our proteins of interest. Next; the PROSITE database will determine which family and domain of proteins this sequence fits into. Based on the information obtained, protein function can be analyzed. Finally, the DNA sequence will be compared to similar sequences using the BLAST database. Along the way, we will show you how analyzing sequences is incorporated into the biological sciences literature databases (PubMed). The NCBI sequence database can be linked to PubMed to find data concerning the specific function of a protein. PubMed has numerous citations and abstracts from published literature referenced by genetic sequence records.

You will be working with the two following growth factors.

Insulin Human Growth Factor (IGF )1

NM_000612 Insulin-like human growth factor gene (this is the NCBI nucleotide entry)

Phytosulfokine (PSK)1

NM_127851 Arabidopsis thaliana phytosulfokine-related gene (this is the NCBI nucleotide entry)

Both of these growth factors are small, secreted proteins or peptides that activate cell-surface receptors.

A Primer of Important things to know when analyzing DNA.

1. DNA sequence – The two DNA strands in each of the chromosomes of an organism are antiparallel to each other. This means they have opposite chemical polarity or their sugar phosphate backbones run opposite to each other as shown in figure 1 below. The 5’ end of DNA indicates that is the 5th carbon in the ring structure and the 3’ indicates it is the 3rd carbon in the ring structure. When DNA is copied by RNA, synthesis always occurs in the 5’- 3’ direction. This is because the 5th carbon has an available phosphate bond for the new base pair addition and the 3’ carbon ring structure does not have this bond availability. For some genes the top DNA strand is copied with mRNA and other genes are encoded by the bottom strand.

Figure 1: DNA polarity 2

After we have the sequence of the base pairs of the gene we can determine the specific order of amino acids in the protein. Since a gene and protein are colinear then the linear nucleotide sequence becomes translated into the linear amino acid sequence in a protein. This concept is called gene-protein colinearity

Figure 2: Gene Colinearity

2. Protein structure – There are four levels of organization in the structure of a protein. The primary structure of the protein is the amino acid sequence. Sometimes a pattern of identical or similar amino acids with a particular spacing are a signature that a specific domain of a protein is conserved. The secondary protein structure is formed by alpha helices and beta sheets. A tertiary protein structure consists of the 3-dimensional organization of a polypeptide chain. The final protein structure is the quaternary protein molecule which is formed by a complex of more than one polypeptide chain.26 By looking at a protein’s three dimensional structure, protein function can be determined. For example, knowledge of the active sites of enzymes and the ligand binding sites of receptors will help show protein function. A protein domain is a discrete portion of the proteins’ three dimensional structure assumed to fold independently of the rest of the protein and possessing its own function. In some cases, when the domain is separated from the rest of the protein it can still function. A domain usually contains between 40 to 350 amino acids.26 Some proteins have similar three dimensional structures but different amino acid sequences.

Figure 3: Primary, Secondary, Tertiary, and Quaternary protein structure.27

Figure 4: IGF gene three dimensional structure 3

3. Protein function

A. Biochemical functions – The biochemical function of a protein may include catalytic and structural roles. One of these specialized functions can be in how cells respond to extracellular signals. For a given signaling pathway there can be components localized within cell membranes, attached to the cytosolic surface of the membrane, within the cytosol, or nuclear-localized. Some proteins serve as a structural link between the intra and extracellular surface, others help to transfer molecules or ions in and out of the cell, some membrane proteins help to catalyze reactions, and other membrane proteins provide structural links between the extracellular matrix and the cell’s cytoskeleton.4

Figure 5: Location of membrane proteins 4

Table 1: This table represents some of the various roles proteins can play within and outside of the cell.

Protein Functions in the Cell
Protein Function Protein Type Protein Example
1. Signal Transduction and Metabolism: These proteins detect, amplify, and integrate extracellular proteins into ion channels and initiate gene expression.7 / Integral Membrane Proteins / Hormones and Receptors
For example: Phytosulfokine peptide growth factor and its LRR (leucine rich receptor) kinase.
2. Fibrous and
Structural: Serves as the biological glue for keeping structures in place.7
Mechanical: Proteins that do
mechanical work.7 / Fibrous and Structural: Cytoskeleton Proteins
Mechanical: Contractile Muscle Proteins / Fibrous and Structural: Intracellular: Actin filaments and microtubules.
Extracellular: Collagen, Fibrin, and Keratin.
Mechanical: Myosin and Actin
3. Immune Response: The antibody immune response consists of several hundred genes to initiate an immune attack.5
/ Antibodies: Are an amino acid protein that protect the body against antigens. / An antibody protein is called an immunoglobin.
For example: IgG antigen binding site is composed of 108 amino acids.5
4. Nucleic Acid Binding: These proteins interact in the DNA regulatory process. They regulate DNA packing, replication, regulation, and transcription. / Regulatory proteins are involved in recognizing specific DNA sequences. / These include: The Helix turn helix motif, Leucine zipper motif, and the Zinc finger motif.
5. Catalyst: Increases the rate of a reaction.7 / Enzymes are proteins that act as biological catalyst. / Ligases are a class of enzymes that catalyze the joining of two molecules.
6. Storage and Transport: Storage proteins serve as reservoirs for amino acids and have good nutritional value for us. Transport proteins move substances against a diffusion gradient.6 / Storage proteins provide essential nutrients.
Transport proteins transport glucose, fatty acids, and gases.
. / Ovalbumin (egg white) acts as a storage protein.
Hemoglobin acts as a gas (O2) transporter in the cell.6

B. Genetic Functions- The genetic function is defined by the role of the protein in the organism. When the gene is altered by mutation, alterations in the organism’s phenotype may become apparent. For example a case study of a 15 year old boy that was homozygous for a deletion of the IGF-1 gene had delays in intrauterine growth, sensorineural deafness, and mild mental retardation. Absence of the IGF-1 gene indicates this gene is necessary for fetal growth and development, including central nervous system development.8

4. ORF (open reading frame)

An ORF is a section of a sequenced piece of DNA or cDNA that begins with an initiation codon (methionone ATG) and ends with a nonsense codon (TAA, TAG, and TGA). The ORF is defined by the placement of start and stop codons. These are the sites on the mRNA where translation starts and stops. The region between the start and stop codons determines the protein sequence. The following table shows amino acids and their corresponding codons.

Table 2: Reverse codon table. This table shows the 20 amino acids used in proteins,
together with the mRNA codons that code for them. 9
Ala / GCU, GCC, GCA, GCG / Leu / UUA, UUG, CUU, CUC, CUA, CUG
Arg / CGU, CGC, CGA, CGG, AGA, AGG / Lys / AAA, AAG
Asn / AAU, AAC / Met / AUG
Asp / GAU, GAC / Phe / UUU, UUC
Cys / UGU, UGC / Pro / CCU, CCC, CCA, CCG
Gln / CAA, CAG / Ser / UCU, UCC, UCA, UCG, AGU, AGC
Glu / GAA, GAG / Thr / ACU, ACC, ACA, ACG
Gly / GGU, GGC, GGA, GGG / Trp / UGG
His / CAU, CAC / Tyr / UAU, UAC
Ile / AUU, AUC, AUA / Val / GUU, GUC, GUA, GUG
START / AUG, AUU*, GUG* / STOP / UAG, UGA, UAA

Note: The GUG and AUU start codons are for Prokaryotes only.28

Each region of a DNA sequence has six open reading frames to choose from. For instance, the upper strand has the 0, +1, and +2 frames and the lower strand also has 0, -1, and –2 frames. Generally, since most proteins are large, the longest open reading frame found is assumed to be the correct reading frame.

Why do we pick the longest open reading frame? Let’s think about why a large open reading frame is significant. In a random stretch of DNA there are 3 stop codons out of 64 codons. Thus, you would predict a stop codon for every 20 codons. Therefore, any amino acid sequence of > 80 amino acids is not expected by chance and has a good chance to encode a protein. Of course this is more useful when analyzing large proteins.

Types of Bioinformatics tools

1. NCBI web site, we will be using the ORF finder (open reading frame finder) database . The ORF finder is a graphical analysis tool that will help to choose the most appropriate reading frame. This amino acid sequence can be used to search against other databases for protein analysis.

2. ExPASy (Expert Protein Analysis System) allows one to translate a DNA sequence into a protein sequence and predicted structure.12

3. PROSITE is a database at the ExPASy web site that scans new protein sequence and identifies which protein family and domain it belongs to. When the protein sequence does not have similarities with other proteins in the database the system will use sequence patterns or (motifs) within the protein to determine functions of proteins. PROSITE contains a database of known protein domains.12 Protein domains that belong to a particular family of proteins share functional similarities and are derived from a common ancestor. These functional similarities may include: an amino acid sequence and a three-dimensional conformation that resemble those of the other family members.11

4. BLAST (Basic Local Alignment Search Tool), will do a comparison of your DNA sequence with similar nucleotide and protein databases. BLAST can be used to analyze functional and evolutionary relationships between sequences as well as identify gene families. There are five different types of BLAST programs.

a. blastp, compares an amino acid query sequence against a protein sequence database.

b. blastn, compares a nucleotide query sequence against a nucleotide sequence database.

c. blastx, compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

d. tblastn, compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

e. tblastx, compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.10

How might you use a BLAST search for something other than identifying a protein for which you have partial sequence information?

Answer: BLAST also identifies other sequences that are homologous with the protein.

The proteins that we will be looking at are from a group of proteins called growth factors.

Growth Factors

Growth Factors are small proteins which bind specific transmembrane receptors that control cell growth and differentiation. Growth factors signal from the outside to initiate a response within cells. We will look at two different examples, one in animals (IGF) and one in plants (PSK).13, 14

IGF (Insulin-like growth factor)

IGF is a protein produced primarily in the liver in response to stimulation by growth hormone (GH). IGF provides the best indicator of growth hormone levels and optimal levels are linked to healthy bone, heart, thyroid, skin, and nervous system. IGF is produced by many different tissues within the body, such as the heart, lung, kidney, liver, pancreas, spleen, small intestines, large intestines, ovaries, placenta, testes, brain, bone, and pituitary.