Perl Laboratory Study Guide – Section I

1. Getting started

· Everyone should use ssh2 to connect to Watson, at: watson.ecs.baylor.edu

· Everyone is sharing a user space, so create a subdirectory using your name. The remainder of this work should be done within your subdirectory.

· To make sure that everything is working, write a simple perl script. Your script may be named anything you wish, but must have the .pl extension: for example, test.pl. Watson has vi, pico, and emacs editors. If you use the ‘perl’ executable there is no need to make your program exicutable.

#!usr/bin/perl –w

$DNA = “ATCGATGA”;

print $DNA,”\n”;

exit;

· Perl compiles at run time. At the command prompt, run the program by typing: perl test.pl

· The result should be: ATCGATGA

Question: Is the command ‘perl’ required to run this program? Also, is the first line of perl ‘#!usr/bin/perl’ required?

2. Getting help

· Perl comes with many built-in help pages. In order to get familiar with these vast resources, explore the man pages. The main perl man page has a list of some helpful links.

o man perl

o man perlintro

o man perltoc

· In addition, perl comes with many built-in functions and the help pages that describe them. Try finding some common functions.

o perldoc –f reverse

o perldoc –f push

o perldoc –f shift

3. Getting a FASTA reference file

· Because we are going to use FASTA files as practice sets for opening and writing to files, we need to get a test file. The easiest way to do this is to point your browser to the NCBI homepage: http://www.ncbi.nlm.nih.gov/

· Search Entrez for you favorite gene. (I have many favorites; if you can’t think of one, try prkr or cos1.)

· On the results page, follow the link for the protein database. If one doesn’t exist, pick another gene.

· Using the drop-down menu, display the file in FASTA format and save it to your [yourname]/ directory. We will use this file to test our perl scripts.

· Repeat this process using a nucleotide file.

4. Printing the contents of file

· The object of this section is to use perl to output the contents of a file to the screen using several different approaches. In each case, your script should open a filename given at a prompt and should include error catching. Save each step as a separate file under [yourname]/. Name each file appropriately: ex4-1.pl, ex4-2.pl, ex4-3.pl, etc.

4-1. use a while (<FH>) loop.

> protein name | number | length

ACHYTCAHCYACHSGCETYAGCYSTGCA

ACTGACTACSHACSYFLASCHUICECIQUH

4-2. use an array to produce the same results as 4-1.

4-3. use an array that concatenates every line into one single line, removing all special end-of-line characters and white spaces. This line might come in handy: $seq =~ s/[\s\r\n\t ]//g;

>proteinname|number|lengthACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYFLASCHUICECIQUH

4-4. Modify 4-3 to (1) take the file name directly from the command line and (2) crate a single line that does not include the FASTA header line. This method usually requires that you know something about regular expressions: $seq =~ m/text/i is an example of a regular expression. Because we know that by definition the first line of a FASTA format must include a ‘>’, we can write a regular expression that will skip this line: if ($seq !~ /^>/) {…}

ACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYFLASCHUICECIQUH

5. Determining the frequency of nucleotides

· In order to get more comfortable using perl data structures, writing a few short scripts that count bases or amino acids is important. Create a script named ex5-1.pl that accomplishes the same task as the script written for example 4-4. If everything is working correctly, you will type at the prompt: perl ex5-1.pl filename.fasta and the result will be: ATCGATCGATCAGTCGATCGATGCATCGATCGCTGATGATCGTCGATCGATCGATCGATCGTACGATCGATCGATCGATCGATCGATCGCTGACTGATAGCTACGTACGATGACGT

· Now, we’re going to alter ex5-1.pl to count the numbers of A’s, G’s, C’s, and T’s. First, incorporate a command that splits this single string into a large array; for example, @dna = split(‘’, $string);

· Make sure your program initializes a counter.

$A_Number = 0;

$C_Number = 0;

$G_Number = 0;

$T_Number = 0;

$Error = 0;

· Loop through the bases, keeping count of the appropriate number of nucleotides.

foreach $base (@dna) {

if ($base eq ‘A’) { ++$A_Number; }

elsif ($base eq ‘C’) { ++$C_Number; }

elsif ($base eq ‘G’) { ++$G_Number; }

elsif ($base eq ‘T’) { ++$T_Number; }

else {

print “Error: I don’t recognize the bases\n”;

++$Error;

}

}

· Perl has many built-in short cuts that will make this easier, but more complicated at the same time. For example, in the first line above, the loop assigns each element in @dna to the temporary variable $base. But it only does this because I have specified the variable. If I left out the variable $base the compiler would assign the value to the temporary variable $_. The first line would now read: if ($_ eq ‘A’) { ++$A_Number; }

· Another short cut is the implicit nature of equality and pattern matching. Instead of asking if $_ is eq to ‘A’, we could ask if the pattern ‘A’ is found in the string: if ($_ =~ m/A/) {…}. Because the temporary is already assigned to $_ if it is not declared in the foreach line, we can leverage its implication: if (/A/) {…}. I know that this may be a bit confusing, re-write ex5-1.pl as ex5-2.pl, using this shorthand approach.

· We are going to count the number of bases without looping through as array, i.e. keeping the sequence as a string. Use this method for script ex5-3.pl

for ($position=0; $position < length $dna; ++$position) {

$base = substr($dna, $position, 1);

while ($base =~ /a/gi) { ++$A_Number; }

while ($base =~ /g/gi) { ++$G_Number; }

while ($base =~ /c/gi) { ++$C_Number; }

while ($base =~ /t/gi) { ++$T_Number; }

while ($base !~ /[acgt]/gi) { ++$Error; }

}

· Finally, we are going to count the number of A’s, G’s, C’s, and T’s in the DNA string using the transliterate operator. Remember from lecture: $DNA =~ tr/AGCT/TCGA/ Also, tr/// is the same as y/// In our version of perl, it is easy to use this to return the occurrence number of any character by binding the post-transliteration operation to an integer. Create a script, ex5-4.pl, that uses this approach to count and display the number of A’s, C’s, G’s, and T’s in your DNA sequence. For example:

$A_Number = $DNA =~ y/A//;

6. Writing out to files

· In this section you will learn to write text to a file. First, copy ex5-4.pl to ex6-1.pl

· Add a line that takes an output filename from the command line. For example, the command line should be something like: perl ex6-1.pl infile.fasta outfile.txt

· At the end of the script, add a couple of lines that open, and write to, a results file. Below is an example of what writing to a file might look like. Notice that the outfile is preceeded by ‘>’, which indicates that the file must be created.

open ( RESULTFILE, “>$outfile”) or die (“Error: $!”);

print RESULTFILE “The results are overwriting everything that existed in $outfile\n”;

close RESULTFILE;

· Use this opportunity to explore some of perl’s special variables.

o What does the variable $0 hold?

o Print out the contents of @ARGV