Basics of Perl Regular Expressions (“regexp”)
Jon Radoff // // Biophysics 101, Fall 2002
Simplistic use of a regular expression:
$_ = "this is a test";
if(/est/)
{
print "Match!\n";
}
In the above code, the /est/ is the regular expression. It succeeds because est is a substring of this is a test. The string may also contain “meta-characters” that allow you to specify special rules about how you would like to match.
Meta-characters:
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
The most common meta-character in regular expressions is . which matches anything. For example, if you used /te.t/ as the regular expression in the above code, it would succeed, because the s character counts as the “any character.” /foo|test/ would succeed because the | (read as “or”) finds anything that contains either foo or test. The [] operator let’s you check for any one of a class of characters. For example, if you wanted to see if a codon contained AGU or AGC you could use either /AG[UC]/ or /AGU|AGC/.
“Quantifiers” may be added to the regular expression to control how many of a certain character to look for.
Quantifiers:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Examples: /this.*test/ would succeed for any string containing with this and test separated by any number of arbitrary characters. /thi+s/ would succeed for a string containing th followed by one or more i characters followed by s.
Modifiers are appended to the end of a regular expression and apply special rules to your entire expression.
Modifiers:
i Do case-insensitive pattern matching.
g global (in substitutions, repeat substitution multiple times – see below)
m Treat string as multiple lines
s Treat string as single line; i.e., treat newlines as “dots”
x Allow whitespace and comments in your regular expression
Example: /[acgt]+/i checks if a string contains any number of valid DNA sequence characters of either case.
Using the caret (^) with character class
In practice, it is often useful to check if a string contains anything except the characters of a particular class. The example above will still return positive even if it contains invalid DNA sequence characters. Insert a character in the beginning of the class to tell it to return positive for any exceptions to the class.
Example: /[^acgt]+/i checks if a string contains anything except valid DNA sequence characters of either case.
Substitutions with s///
In addition to matching strings, you may also use regular expressions to perform substitutions. Do this by creating a regular expression that is prepended with s, and then append it with the string you want to replace with, followed by another /. Note that substitutions can be placed on a line of code by themselves (they do not need to be part of an assignment or a conditional statement).
Example:
$_ = "this will be a test";
s/will be/is/;
print "$_\n";
will output:
this is a test
By default, only the first one substitution is performed. To perform multiple, append the g modifier.
Example:
$_ = "Frodo Baggins and Bilbo Baggins are both hobbits.";
s/ baggins//gi;
print "$_\n";
will output:
Frodo and Bilbo are both hobbits.