Nifty Bioinformatics Assignments

By Dean Zeller

Kent State University

Summer, 2007

http://www.cs.kent.edu/~dzeller

Table of Contents

Assignment 1 – DNA Modeling 1

Assignment 2 – Evolution Models I (Binary Evolution Trees) 3

Assignment 3 – Evolution Models II (Incremental K-Leaf Root) 6

Assignment 4 – DNA Pattern Matching Statistics 8

Assignment 5 – DNA Visualization 13

Assignment 6 – Longest Common Subsequence 17

Introduction

The following is a collection of assignments relating to bioinformatics. All assignments have been written in the past two years and were tested in an actual classroom environment. Bioinformatics itself is an involved and complex field, combining concepts of biology, mathematics, and computer science. However, the introductory topics of bioinformatics are not difficult, and need not require a college level mentality. The prerequisite knowledge for these assignments are purposely low, and are intended to “spark” potential interest within students.

Copyright notice

These assignments and accompanying materials are copyright ©2007 Dean Zeller are all available at the author’s web site. Permission of these assignments is granted for the copy and use for personal or educational purposes. Please contact the author with questions, comments, or feedback.

Assignment 1 – DNA Modeling

Dean Zeller Due: ______

C&I 47330 10 points

Fall, 2006

Objective The student will create structurally accurate models of DNA out of pipe cleaners.

Materials pipe cleaners, scissors, 3x5 card (optional), hole punch (optional)

Level 0 – no experience required

Topics

This assignment introduces students to concepts of modeling using DNA as the modeled system. It is appropriate for any age over 5, is challenging, and contains of breadth of useful skills in a wide variety of areas. This assignment is suitable for any mathematics, science, or art class. No prior experience or knowledge is required to complete the assignment.

Background: Modeling

Chemists, mathematicians, and many other scientific professions use models as visual representations of structures. One important aspect of modeling is structural accuracy, i.e. the extent to which it represents key elements of the system it is modeling. In this assignment, the main element to represent about DNA is the bonding of AT and CG nucleotide pairs. As you increase in skills, there are other elements that can be added to increase accuracy.

When crafting a model, structural integrity (or stability) is important for the model to keep its shape. It cannot fall apart while being analyzed or modified. There are many materials used in modeling of this nature. For this assignment, the modeling material is pipe cleaners because of their ability to bend and keep their shape. The instability of a pipe cleaner model is in how the pieces are put together. Pieces can be connected through a variety of twisting, wrapping, and coiling. Doing this correctly takes skill and practice.

Instructions

Follow the steps below to create a single DNA strand of 10 nucleotides. Create additional strands as time permits. See the poster for picture diagrams of the instructions.

Step 0: Setup

A. Select colors for DNA stalks and nucleotides (A, T, C, G).

B. Cut the nucleotide pipe cleaners in fourths.

C. Put the nucleotides into AT and CG pairs.

Step 1: Create nucleotides

Use these steps to create ten (10) nucleotide pairs.

A. Select two pipe cleaner pieces of the appropriate colors to form a nucleotide pair (AT or CG).

B. Put in a V-shape with a small overlap (1/4”). Choose a long end and the corresponding short end on the same side of the V.

C. Coil the long end three or four revolutions around the other short end, covering it completely. There should be some extra left over to attach to a stalk.

D. Bend the remaining long and short ends 90° to form a T-shape.

E. Repeat step 1C with the remaining long and short ends.

Step 2: Connect Nucleotides to Stalks

Attach nucleotides to stalks to form a “ladder”

A. Attach one side of nucleotide to stalk by wrapping extra length around stalk twice (once on the left, and once on the right). Wrap any remaining length around itself.

B. Repeat for other nucleotides, spreading evenly across the stalk.

C. Attach opposite stalk in similar way. Leave enough room at ends to connect to other DNA strands.

Step 3: …and do the Twist

In “ladder” form, the DNA strand may look inconsistent in size. When twisted into a double-helix form, the inconsistencies tend to work themselves out.

Further Activities

· Create a model of an actual DNA sequence.

· Longer strands: strands of length 100 can be completed in about two hours, once the procedure is learned and practiced.

· In groups, use an “assembly line” approach to create a very long DNA strand.

· Class contests: longest DNA strand, strongest DNA strand

Grading

You will be graded on the following criteria:

Accuracy Correct representation of AT/CG pairs in the model

Stability Structural integrity of the model

Effort Quantity, length, and quality of models created

Extra credit will be given for the following:

· Use a 3x5 card to label the model with the appropriate model nucleotide characters (both sides of the DNA strand).

· Connect models together to create longer models

Assignment 2 – Evolution Models I (Binary Evolution Trees)

Dean Zeller Due: ______

CS10051 10 points

Spring, 2006

Objective The student will use a graphics package to create diagrams of binary evolution trees.

Level 1 – use of a graphics package and/or word processor required

Readings Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.

Background: Evolution Trees

This assignment deals with the cutting-edge topic of bioinformatics. It is a complex field of graph theory with applications to mathematics, computer science, and genetics. An evolutionary tree (or phylogeny) is a tree-structure that demonstrates evolution of species over time. A tree consists of vertices (nodes) connected by edges (links). In evolution, a node represents a point in which a species population “splits” into two genetically different species. Once a species splits, the two species created are genetically unable to produce offspring. The leaves of the tree represent the extant (non-extinct) species.

Background: Isomorphism

Two trees are isomorphic if they contain the same structure and are different only through symmetry of a node. In order to simplify the problem, isomorphic trees are considered the same for purposes of this assignment.

Background: Classification labels

In order to easily name trees, they are given a text description called a classification label. This allows a text label instead of a visual picture to represent a tree structure. A good classification system has a unique name for each tree. For this assignment, the classification system is simply a listing of the number of nodes at each level. Trees A, B, C, and D above have the label (1242), indicating the first level has one node, the second has two nodes, the third has four nodes, and the fourth has two. Since the four trees are isomorphic, the same label represents all four trees. Ultimately, this classification system is incomplete. While simple to understand, it does not provide unique names for each tree at the higher levels. Trees E and F are not isomorphic, and thus are given separate labels (1244a and 1244b). This system will suffice for now, but can get confusing as the size of the trees increase.

Background: Assumptions

Introductory study of phylogenies makes another assumption that greatly simplifies the problem. For purposes of this assignment, the only non-leaf structure possible are nodes with exactly two offspring, representing a point in time in which a population “splits” into two populations. This is called a binary evolution tree. A node with a single offspring is called a redundant node and does not significantly add to the tree structure. Nodes with three or more children can be isomorphically approximated, and thus can be ignored at this stage with only a minimal loss of information. While all nodes in evolution trees are unique species, at this point only the leaf nodes need to be considered.

Task 1 – Draw Given Trees (5 points)

Given below are thirteen evolution trees. This represents the complete set of all non-isomorphic evolution trees of up to six leaves (11 nodes). Use a graphics package to recreate these trees. Give the classification for the tree and label its leaves with successive letters (a, b, c, etc…) Your design style may differ from the trees below, but the structure must be correct.

Task 2 – Generate Trees (5 points)

Use a graphics package to create the eleven isomorphically unique evolutionary trees of 7 leaves (13 nodes). The following labels are the classifications for the unique trees: 1222222, 122224, 122242, 122422, 12244a, 12244b, 124222, 12424, 12442a, 12442b, and 1246.

Grading:

You will be graded on the following criteria:

Accuracy Correctly drawing the evolutionary trees, attention to detail.

Creativity Style, appearance, and consistency of the trees

Extra credit will be given for including the following:

· Create the sixteen 8-leaf trees: 12222222, 1222224, 1222242, 1222422, 1224222, 1242222, 124224, 124242, 124422a, 124422b, 12444aa, 12444ab, 12444ba, 12444bb, 12462a, 12462b, 1248.

· Create the 9 leaf (17 node) non-isomorphic trees and classifications.

· Create the trees without the redundant node and isomorphic approximation assumptions, allowing for any number of offspring from a node. This will exponentially increase the number of possible trees.

Graphs Diagram Drawing Guidelines

Creating well-drawn graphs is an art form in itself. A diagram is a visual representation, and care should be taken to make sure diagrams are organized, readable, and easily understood. When using a graphics package to draw graphs, it is imperative that students use a consistent and visually pleasing style. Sloppy graph diagrams stick out like a sore thumb and can completely ruin a professional presentation. Follow the below guidelines for drawing your graphs. See the figures for examples of following and violating these guidelines.

Vertices All vertex markers should be the same size and shape. If necessary, different markers can represent types of vertices.

Labels It should be clear which vertex the label references. Use a reasonable location scheme for the labels. Vertex labels should not cover any vertex, edge, or other label. A logical labeling sequence may help in understanding the graph structure.

Edges All edges should be straight lines of the same thickness and connect to the center of the vertex markers. Edge highlighting can be done with thicker or grayed lines. Planar graphs should be drawn without any crossing edges. Curved edges are generally avoided, but may be used for artistic effect.

Graph The graph shape should be pleasing to the eye. Use a regular polygon or other meaningful shape whenever possible. Physical distances between vertices should be relatively consistent. Use the alignment and distribution tool within the graphics package.

Assignment 3 – Evolution Models II (Incremental k-Leaf Root)

Dean Zeller Due: Wednesday, March 22nd by 9:00 pm

CS10051 10 points

Spring, 2006

Objective The student will build on assignment 2 to create diagrams of the incremental k-leaf root evolution model.

Level 1 – use of a graphics package and/or word processor required

Readings Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.

Background: k-leaf root, cliques

As an alternate to phylogenies, evolution can be modeled by a series of incremental graphs called k-leaf roots that contain visual information indicating the distances between species. The tree leaves serve as the graph vertices. Edges within the graph indicate the distance between the two species is no more than a specified k-value in a corresponding tree. A clique is defined as a subset of vertices such that all members are connected to all other members within the clique. An incremental k-leaf root model is essentially a series of (possibly overlapping) cliques. In the diagrams below, the corresponding cliques are labeled below the graphs.

Instructions

Use a word processor to create a table similar to the example below.
Set your document Page Setup to Landscape to give more space widthwise.
The first cell in each row should contain the phylogenies from assignment 2.
The column header should contain increasing k-values from two (2) up to the number of leaves in the tree.
Within each row…
Draw the k-leaf root for increasing k-values. Start at k=2 and end when the graph is fully connected. Draw a left arrow (ß) for graphs that show no change from the previous k-value.
Indicate the cliques created within each distance graph.
Create a phylogeny with at least 10 leaves and the corresponding incremental k-leaf root evolution model.
Review the graphs drawn so far. Indicate which distance graphs you think contain the most useful information for geneticists.

Grading

You will be graded on the following criteria:

Accuracy Correctly drawing the evolutionary trees.

Creativity Style, appearance, and consistency of the graphs

Extra credit will be given for any of the following:

· Create more complex phylogenies for the report.

2 leaves (3 vertices)
phylogeny / k = 2 / k = 3 / k = 4 / k = 5
3 leaves (5 vertices)
phylogeny / k = 2 / k = 3 / k = 4 / k = 5
4 leaves (7 vertices)
phylogeny / k = 2 / k = 3 / k = 4 / k = 5
5 leaves (9 vertices)
phylogeny / k = 2 / k = 3 / k = 4 / k = 5

Assignment 4 – DNA Pattern Matching Statistics

Dean Zeller Due: ______

CS10061 10 points

Spring, 2007

Objective The student will create a Python program to calculate pattern matching statistics on DNA sequences.

Level 2 – understanding and use of the Python programming language

Background: Bioinformatics

DNA can be represented digitally by long strings of A, C, G, and T. In this assignment, you will develop a program to perform statistics on a given DNA sequence.

Definitions

Sequence A long string of characters.

DNA Sequence A long string of A’s, C’s, G’s, and T’s representing a DNA strand.

Pattern A short string of characters to be searched within a sequence

Concepts Introduced

Strings as character arrays

# can access each character in a string individually

name = "Dean Zeller"

print name[0], name[1], name[2], name[3]