Assignment #7

This is a regular assignment and is worth 105 points, plus 40+10 pts extra credit.

  1. (15 points) Introductory Perl Programming
  2. Access Perl on UH Unix (Perl resource list)
  3. If you like, download and install the free Perl language for your PC.
  4. Try the Perl Tutorial
  5. The Unix operating system keeps track of many environment variables to make your life easier - paths to programs you use, etc. You can see the list by typing env at the unix prompt. Write Perl functions (code) to access that information and print it in 3 different ways.
  6. Turn this part in - your code and script files of it working.
  7. Display the associative %ENV array.
  8. Display the associative %ENV array sorted alphabetically by key (term name).
  9. Display the associative %ENV array sorted alphabetically by value (of the variables).

%ENV hash already exists in Perl as a built-in variable. It contains the values it has inherited from its parent process (generally the shell). Modifying this hash changes environment variables, which will then be inherited by new processes.

For example, add a new path to the existing path and then delete the current value of IPS variable:

$ENV{‘PATH’} = “/new/path:$ENV{‘PATH’}”;

delete $ENV{‘IPS’};

Perl Resources

  • Perl resource list
  • associative arrays
  • string functions.
  • processing every word in a file,
  • string matching,
  • translation,
  • substr function
  1. (45 points) Just Enough Unix, Chapter 35: Scripting with Perl, Exercises 1,2,6, 7 and 9 (repeated below) pp. 481-482.
  2. 1. Write a Perl program to accomplish each of the following on the file solar_system.txt (see link below)
  3. a. Print all records that do not list a discoverer in the eighth field.
  4. b. Print every record after erasing the second field. Note: It would be better to say "print every record" omitting the second field.
  5. c. Print the records for satellites that have negative orbital periods. (A negative orbital period simply means that the satellite orbits in a counterclockwise direction.)
  6. d. Print the data for the objects discovered by the Voyager2 space probe.
  7. e. Print each record with the orbital period given in seconds rather than days.
  • 2. The periapsis is the point of closest approach between a satellite and the object around which it orbits; the apoapsis is the point of greatest separation between the two. The periapsis distance and the apoapsis distance can be computed from the formulas
    P = alpha (1-epsilon)
    A = alpha(1 + epsilon)
    where alpha is the semimajor axis and epsilon is the eccentricity of the orbit. For a perfect circle, epsilon = 0; for an ellipse, 0 < epsilon <= 1. Print each record from solar_system.txt with the values of P and A inserted between the orbital radius and the orbital period.

Use the data file solar_system.txt
This file contains lines of 9 items, the first being:
Adrastea XV Jupiter 129000 0.30 0.00 0.00 Jewitt 1979
in alphabetical order by the name of the planet or moon (first field).
The text in [] is the corresponding field from the line above.
The fields in this file are:

  1. Name of planet or moon [Adrastea]
  2. Number of moon or planet (roman numerals) [XV]
  3. Name of the abject around which the satellite orbits [Jupiter]
  4. Orbital radius (semimajor axis) in kilometers [129000]
  5. Orbital period in days [0.30]
  6. Orbital inclination in degrees [0.00]
  7. Orbital eccentricity [0.00]
  8. Discoverer [Jewitt]
  9. Year of discovery [1979]
  • 3. Write a Perl script named test.regexp.pl to determine whether a given string matches a regular expression. If the user types the command line
    test.regexp.pl /Ma/ Mars
    the script should print
    Match: /Ma/ matches "Mars"
  • 4. The rand(x) function computes pseudo-random numbers uniformly distributed between 0 and x (between 0 and 1 if no argument is supplied). Write a subroutine rand2 that computes a pseudo-random number x evenly distributed between two limits low and high, according to the formula
    x = random_number_between_0_and_1 *(high - low) + low
  • 5. Write a Perl script mygrade.pl that computes a student's grade given a score between 0 and 100. Thus, the command line
    mygrade.pl 90.1
    should produce the output You have an A!

Account for the following cases:
100 < score ==> error message (scores may not exceed 100%)
90 <= score < 100 ==> A
80 <= score < 90 ==> B
70 <= score < 80 ==> C
60 <= score < 70 ==> D
0 <= score < 60 ==> F
if score < 0 ==> error message (scores may not be less than 0%)

  1. (40 points) Perl in Bioinformatics – extra credit
    Write Perl programs to
  2. read in a file of DNA codes and translates them to RNA, and writes the result to a file.

DNA is the basic code of life, determining what proteins are produced and what their sizes and shapes are. The ENTIRE LIFE ON EARTH has DNA consisting of only 4 nucleotide bases: A, C, G and T.

Proteins are produced through a complex chemical process that can be summarized as follows: the DNA is read and transcribed into messenger RNA (mRNA), the mRNA enters the ribosomes and binds to complementary transfer RNA (tRNA) sequences with attached amino acids (each tRNA molecule binds to three complementary bases or one codon). These amino acids are then chemically joined (via dehydration or condensation synthesis) and the tRNA leaves. All of these steps are catalyzed themselves by already existing proteins functioning as enzymes. Of course, we omit some steps (i.e. mRNA splicing).

RNA is synthesized in the 5' → 3' direction (from the point of view of the growing RNA transcript). Only one of the two DNA strands is transcribed. This strand is called the template strand, because it provides the template for ordering the sequence of nucleotides in an RNA transcript. The other strand is called the coding strand, because its sequence is the same as the newly created RNA transcript (except for thymine being substituted for uracil). The DNA template strand is read 3' → 5' by RNA polymerase and the new RNA strand is synthesized in the 5'→ 3' direction. RNA polymerase binds to the 3' end of a gene (promoter) on the DNA template strand and travels toward the 5' end.

Algorithm:

First, convert each template DNA base to its RNA complement (note that the complement of A is now U), as shown below. Note that the template strand of the DNA is the one the RNA is polymerized against; the other DNA strand would be the same as the RNA, but with thymine instead of uracil.

DNA -> RNA

A -> U

T -> A

G -> C

C -> G

Then split the RNA into triplets (groups of three bases). Note that there are 3 translation "windows" depending on where you start reading the code. Finally, use the table at Genetic code to translate the above into a structural formula as used in chemistry.

This will give you the primary structure of the protein.

However, proteins tend to fold, depending in part on hydrophilic and hydrophobic segments along the chain. Secondary structure can often still be guessed at, but the proper tertiary structure is often very hard to determine.

This approach may not give the correct amino acid composition of the protein, in particular if unconventional amino acids such as selenocysteine are incorporated into the protein, which is coded for by a conventional stop codon in combination with a downstream hairpin (SElenoCysteine Insertion Sequence, or SECIS).

FYI: Scientists at the Scripps Research Institute are attempting to find out what life would look like if DNA contained more than four nucleotide bases and proteins more than 20 amino acids. By reengineering DNA, RNA, and the proteins that interact with them, they hope to create synthetic organisms with a chemical makeup fundamentally different from all life that has existed on Earth for the last 3.8 billion years. If they succeed, their biochemical reengineering could have a profound effect on everything from basic molecular biology to industrial chemistry.

Perl code: (there is also Ruby code on the web)

Notes: Your program must do error checking - is the input in the correct format? DNA? RNA? Amino acids? [How can you tell?] If not a valid file, report that to the user. You should ignore blank spaces, tabs, newlines, etc. which are usually added for human readability. If the first line is a description of the data, ignore it for error checking purposes. Copy the description and indicate file modifications when the file is translated.

Example: DNA and RNA to amino acid example

Bioinformatics Resources

  • ASCII version of table of codes DNA <-> Amino acids.
  • Tables of amino acid and protein codes.
  • A sample DNA source file.
  1. (20 points) - Text manipulation.
    Separate, count and sort the words in medium sized text files. Sort in the following orders
  2. alphabetically (ignoring capitalization),
  3. alphabetically with upper case words just in front of lower case words with the same initial characters
  4. by frequency, from high to low, (any order for equal frequency) and
  5. by frequency, with alphabetical order for words with the same frequency.

Your output should be nicely lined up in columns. Your program should work on moderately sized text files, like the ones below.

  • Electricity : electricity.txt
    Example partial output for the electricity file electricity.out.txt
  • The GNU manifesto:
  • My Lisp Experiences and the Development of GNU Emacs (by Richard Stallman at ILC 2002)
  1. (10 points) - OS file conversion - extra credit
    Create Perl programs that will convert a text file stored as Unix, Mac, or DOS/Win, into any of the other formats. The file's name should be entered on the command line. Save the old version of the file under another name.
  1. (25 points) - Perl quick reference.
    Create a 2 page quick reference to Perl in a format similar to the Emacs/vi/pico quick reference.

**** DO THIS PART AS YOU ARE WORKING ON THE REST 