Basics of Pattern Matching & Substitutions with Perl Regular Expressions ( Regexp )

Basics of Perl Regular Expressions (“regexp”)

Jon Radoff // // Biophysics 101, Fall 2002

Simplistic use of a regular expression:

$_ = "this is a test";

if(/est/)

{

print "Match!\n";

}

In the above code, the /est/ is the regular expression. It succeeds because est is a substring of this is a test. The string may also contain “meta-characters” that allow you to specify special rules about how you would like to match.

Meta-characters:

\ Quote the next metacharacter

^ Match the beginning of the line

. Match any character (except newline)

$ Match the end of the line (or before newline at the end)

| Alternation

() Grouping

[] Character class

The most common meta-character in regular expressions is . which matches anything. For example, if you used /te.t/ as the regular expression in the above code, it would succeed, because the s character counts as the “any character.” /foo|test/ would succeed because the | (read as “or”) finds anything that contains either foo or test. The [] operator let’s you check for any one of a class of characters. For example, if you wanted to see if a codon contained AGU or AGC you could use either /AG[UC]/ or /AGU|AGC/.

“Quantifiers” may be added to the regular expression to control how many of a certain character to look for.

Quantifiers:

* Match 0 or more times

+ Match 1 or more times

? Match 1 or 0 times

{n} Match exactly n times

{n,} Match at least n times

{n,m} Match at least n but not more than m times

Examples: /this.*test/ would succeed for any string containing with this and test separated by any number of arbitrary characters. /thi+s/ would succeed for a string containing th followed by one or more i characters followed by s.

Modifiers are appended to the end of a regular expression and apply special rules to your entire expression.

Modifiers:

i Do case-insensitive pattern matching.

g global (in substitutions, repeat substitution multiple times – see below)

m Treat string as multiple lines

s Treat string as single line; i.e., treat newlines as “dots”

x Allow whitespace and comments in your regular expression

Example: /[acgt]+/i checks if a string contains any number of valid DNA sequence characters of either case.

Using the caret (^) with character class

In practice, it is often useful to check if a string contains anything except the characters of a particular class. The example above will still return positive even if it contains invalid DNA sequence characters. Insert a character in the beginning of the class to tell it to return positive for any exceptions to the class.

Example: /[^acgt]+/i checks if a string contains anything except valid DNA sequence characters of either case.

Substitutions with s///

In addition to matching strings, you may also use regular expressions to perform substitutions. Do this by creating a regular expression that is prepended with s, and then append it with the string you want to replace with, followed by another /. Note that substitutions can be placed on a line of code by themselves (they do not need to be part of an assignment or a conditional statement).

Example:

$_ = "this will be a test";

s/will be/is/;

print "$_\n";

will output:

this is a test

By default, only the first one substitution is performed. To perform multiple, append the g modifier.

Example:

$_ = "Frodo Baggins and Bilbo Baggins are both hobbits.";

s/ baggins//gi;

print "$_\n";

will output:

Frodo and Bilbo are both hobbits.