Perl Laboratory Study Guide – Section I
1. Getting started
· Everyone should use ssh2 to connect to Watson, at: watson.ecs.baylor.edu
· Everyone is sharing a user space, so create a subdirectory using your name. The remainder of this work should be done within your subdirectory.
· To make sure that everything is working, write a simple perl script. Your script may be named anything you wish, but must have the .pl extension: for example, test.pl. Watson has vi, pico, and emacs editors. If you use the ‘perl’ executable there is no need to make your program exicutable.
#!usr/bin/perl –w
$DNA = “ATCGATGA”;
print $DNA,”\n”;
exit;
· Perl compiles at run time. At the command prompt, run the program by typing: perl test.pl
· The result should be: ATCGATGA
Question: Is the command ‘perl’ required to run this program? Also, is the first line of perl ‘#!usr/bin/perl’ required?
2. Getting help
· Perl comes with many built-in help pages. In order to get familiar with these vast resources, explore the man pages. The main perl man page has a list of some helpful links.
o man perl
o man perlintro
o man perltoc
· In addition, perl comes with many built-in functions and the help pages that describe them. Try finding some common functions.
o perldoc –f reverse
o perldoc –f push
o perldoc –f shift
3. Getting a FASTA reference file
· Because we are going to use FASTA files as practice sets for opening and writing to files, we need to get a test file. The easiest way to do this is to point your browser to the NCBI homepage: http://www.ncbi.nlm.nih.gov/
· Search Entrez for you favorite gene. (I have many favorites; if you can’t think of one, try prkr or cos1.)
· On the results page, follow the link for the protein database. If one doesn’t exist, pick another gene.
· Using the drop-down menu, display the file in FASTA format and save it to your [yourname]/ directory. We will use this file to test our perl scripts.
· Repeat this process using a nucleotide file.
4. Printing the contents of file
· The object of this section is to use perl to output the contents of a file to the screen using several different approaches. In each case, your script should open a filename given at a prompt and should include error catching. Save each step as a separate file under [yourname]/. Name each file appropriately: ex4-1.pl, ex4-2.pl, ex4-3.pl, etc.
4-1. use a while (<FH>) loop.
> protein name | number | length
ACHYTCAHCYACHSGCETYAGCYSTGCA
ACTGACTACSHACSYFLASCHUICECIQUH
4-2. use an array to produce the same results as 4-1.
4-3. use an array that concatenates every line into one single line, removing all special end-of-line characters and white spaces. This line might come in handy: $seq =~ s/[\s\r\n\t ]//g;
>proteinname|number|lengthACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYFLASCHUICECIQUH
4-4. Modify 4-3 to (1) take the file name directly from the command line and (2) crate a single line that does not include the FASTA header line. This method usually requires that you know something about regular expressions: $seq =~ m/text/i is an example of a regular expression. Because we know that by definition the first line of a FASTA format must include a ‘>’, we can write a regular expression that will skip this line: if ($seq !~ /^>/) {…}
ACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYFLASCHUICECIQUH
5. Determining the frequency of nucleotides
· In order to get more comfortable using perl data structures, writing a few short scripts that count bases or amino acids is important. Create a script named ex5-1.pl that accomplishes the same task as the script written for example 4-4. If everything is working correctly, you will type at the prompt: perl ex5-1.pl filename.fasta and the result will be: ATCGATCGATCAGTCGATCGATGCATCGATCGCTGATGATCGTCGATCGATCGATCGATCGTACGATCGATCGATCGATCGATCGATCGCTGACTGATAGCTACGTACGATGACGT
· Now, we’re going to alter ex5-1.pl to count the numbers of A’s, G’s, C’s, and T’s. First, incorporate a command that splits this single string into a large array; for example, @dna = split(‘’, $string);
· Make sure your program initializes a counter.
$A_Number = 0;
$C_Number = 0;
$G_Number = 0;
$T_Number = 0;
$Error = 0;
· Loop through the bases, keeping count of the appropriate number of nucleotides.
foreach $base (@dna) {
if ($base eq ‘A’) { ++$A_Number; }
elsif ($base eq ‘C’) { ++$C_Number; }
elsif ($base eq ‘G’) { ++$G_Number; }
elsif ($base eq ‘T’) { ++$T_Number; }
else {
print “Error: I don’t recognize the bases\n”;
++$Error;
}
}
· Perl has many built-in short cuts that will make this easier, but more complicated at the same time. For example, in the first line above, the loop assigns each element in @dna to the temporary variable $base. But it only does this because I have specified the variable. If I left out the variable $base the compiler would assign the value to the temporary variable $_. The first line would now read: if ($_ eq ‘A’) { ++$A_Number; }
· Another short cut is the implicit nature of equality and pattern matching. Instead of asking if $_ is eq to ‘A’, we could ask if the pattern ‘A’ is found in the string: if ($_ =~ m/A/) {…}. Because the temporary is already assigned to $_ if it is not declared in the foreach line, we can leverage its implication: if (/A/) {…}. I know that this may be a bit confusing, re-write ex5-1.pl as ex5-2.pl, using this shorthand approach.
· We are going to count the number of bases without looping through as array, i.e. keeping the sequence as a string. Use this method for script ex5-3.pl
for ($position=0; $position < length $dna; ++$position) {
$base = substr($dna, $position, 1);
while ($base =~ /a/gi) { ++$A_Number; }
while ($base =~ /g/gi) { ++$G_Number; }
while ($base =~ /c/gi) { ++$C_Number; }
while ($base =~ /t/gi) { ++$T_Number; }
while ($base !~ /[acgt]/gi) { ++$Error; }
}
· Finally, we are going to count the number of A’s, G’s, C’s, and T’s in the DNA string using the transliterate operator. Remember from lecture: $DNA =~ tr/AGCT/TCGA/ Also, tr/// is the same as y/// In our version of perl, it is easy to use this to return the occurrence number of any character by binding the post-transliteration operation to an integer. Create a script, ex5-4.pl, that uses this approach to count and display the number of A’s, C’s, G’s, and T’s in your DNA sequence. For example:
$A_Number = $DNA =~ y/A//;
6. Writing out to files
· In this section you will learn to write text to a file. First, copy ex5-4.pl to ex6-1.pl
· Add a line that takes an output filename from the command line. For example, the command line should be something like: perl ex6-1.pl infile.fasta outfile.txt
· At the end of the script, add a couple of lines that open, and write to, a results file. Below is an example of what writing to a file might look like. Notice that the outfile is preceeded by ‘>’, which indicates that the file must be created.
open ( RESULTFILE, “>$outfile”) or die (“Error: $!”);
print RESULTFILE “The results are overwriting everything that existed in $outfile\n”;
close RESULTFILE;
· Use this opportunity to explore some of perl’s special variables.
o What does the variable $0 hold?
o Print out the contents of @ARGV