Regular Expression Basic Syntax Reference

PERL printf Function

Syntax
printf FILEHANDLE FORMAT, LIST
printf FORMAT, LIST
Definition and Usage
Prints the value of LIST interpreted via the format specified by FORMAT to the current output filehandle, or to the one specified by FILEHANDLE.
Effectively equivalent to print FILEHANDLE sprintf(FORMAT, LIST)
You can use print in place of printf if you do not require a specific output format.
Following is the list of accepted formatting conversions.
Format / Result
%% / A percent sign
%c / A character with the given ASCII code
%s / A string
%d / A signed integer (decimal)
%u / An unsigned integer (decimal)
%o / An unsigned integer (octal)
%x / An unsigned integer (hexadecimal)
%X / An unsigned integer (hexadecimal using uppercase characters)
%e / A floating point number (scientific notation)
%E / A floating point number, uses E instead of e
%f / A floating point number (fixed decimal notation)
%g / A floating point number (%e or %f notation according to value size)
%G / A floating point number (as %g, but using .E. in place of .e. when appropriate)
%p / A pointer (prints the memory address of the value in hexadecimal)
%n / Stores the number of characters output so far into the next variable in the parameter list
Perl also supports flags that optionally adjust the output format. These are specified between the % and conversion letter. They are shown in the following table:
Flag / Result
space / Prefix positive number with a space
+ / Prefix positive number with a plus sign
- / Left-justify within field
0 / Use zeros, not spaces, to right-justify
# / Prefix non-zero octal with .0. and hexadecimal with .0x.
The # symbol between the percent and the three non-decimal bases makes printf produce output that indicates which base the integer is in. For example, if you enter the number 255, the output would be:
255 0xff 0377 0b11111111
But without the # sign, you would only get:
255 ff 377 11111111
number / Minimum field width
.number / Specify precision (number of digits after decimal point) for floating point numbers
l / Interpret integer as C-type .long. or .unsigned long.
h / Interpret integer as C-type .short. or .unsigned short.
V / Interpret integer as Perl.s standard integer type
v / Interpret the string as a series of integers and output as numbers separated by periods or by an arbitrary string extracted from the argument when the flag is preceded by *.
Return Value

0 on failure
1 on success

Example
Try out following example:
#!/usr/bin/perl -w
printf "%d\n", 3.1415126;
printf "The cost is \$%6.2f\n",499;
printf "Perl's version is v%vd\n",%^V;
printf "%04d\n", 20;
It will produce following results: Try more options yourself.
3
The cost is $499.00
Perl's version is v
0020

Regular Expression Basic Syntax Reference

Characters
Character / Description / Example
Any character except [\^$.|?*+() / All characters except the listed special characters match a single instance of themselves. { and } are literal characters, unless they're part of a valid regular expression token (e.g. the {n} quantifier). / a matches a
\ (backslash) followed by any of [\^$.|?*+(){} / A backslash escapes special characters to suppress their special meaning. / \+ matches +
\Q...\E / Matches the characters between \Q and \E literally, suppressing the meaning of special characters. / \Q+-*/\E matches +-*/
\xFF where FF are 2 hexadecimal digits / Matches the character with the specified ASCII/ANSI value, which depends on the code page used. Can be used in character classes. / \xA9 matches © when using the Latin-1 code page.
\n, \r and \t / Match an LF character, CR character and a tab character respectively. Can be used in character classes. / \r\n matches a DOS/Windows CRLF line break.
\a, \e, \f and \v / Match a bell character (\x07), escape character (\x1B), form feed (\x0C) and vertical tab (\x0B) respectively. Can be used in character classes.
\cA through \cZ / Match an ASCII character Control+A through Control+Z, equivalent to \x01 through \x1A. Can be used in character classes. / \cM\cJ matches a DOS/Windows CRLF line break.
Character Classes or Character Sets [abc]
Character / Description / Example
[ (opening square bracket) / Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply. The rules in this section are only valid inside character classes. The rules outside this section are not valid in character classes, except for a few character escapes that are indicated with "can be used inside character classes".
Any character except ^-]\ add that character to the possible matches for the character class. / All characters except the listed special characters. / [abc] matches a, bor c
\ (backslash) followed by any of ^-]\ / A backslash escapes special characters to suppress their special meaning. / [\^\]] matches ^ or ]
- (hyphen) except immediately after the opening [ / Specifies a range of characters. (Specifies a hyphen if placed immediately after the opening [) / [a-zA-Z0-9] matches any letter or digit
^ (caret) immediately after the opening [ / Negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [) / [^a-d] matches x (any character except a, b, c or d)
\d, \w and \s / Shorthand character classes matching digits, word characters (letters, digits, and underscores), and whitespace (spaces, tabs, and line breaks). Can be used inside and outside character classes. / [\d\s] matches a character that is a digit or whitespace
\D, \W and \S / Negated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing.) / \D matches a character that is not a digit
[\b] / Inside a character class, \b is a backspace character. / [\b\t] matches a backspace or tab character
Dot
Character / Description / Example
. (dot) / Matches any single character except line break characters \r and \n. Most regex flavors have an option to make the dot match line break characters too. / . matches x or (almost) any other character
Anchors
Character / Description / Example
^ (caret) / Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well. / ^. matches a in abc\ndef. Also matches d in "multi-line" mode.
$ (dollar) / Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break. / .$ matches f in abc\ndef. Also matches c in "multi-line" mode.
\A / Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Never matches after line breaks. / \A. matches a in abc
\Z / Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks, except for the very last line break if the string ends with a line break. / .\Z matches f in abc\ndef
\z / Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks. / .\z matches f in abc\ndef
Word Boundaries
Character / Description / Example
\b / Matches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start and/or end of the string if the first and/or last characters in the string are word characters. / .\b matches c in abc
\B / Matches at the position between two word characters (i.e the position between \w\w) as well as at the position between two non-word characters (i.e. \W\W). / \B.\B matches b in abc
Alternation
Character / Description / Example
| (pipe) / Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options. / abc|def|xyz matches abc, def or xyz
| (pipe) / The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression. / abc(def|xyz) matches abcdef or abcxyz
Quantifiers
Character / Description / Example
? (question mark) / Makes the preceding item optional. Greedy, so the optional item is included in the match if possible. / abc? matches ab or abc
?? / Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use. / abc?? matches ab or abc
* (star) / Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all. / ".*" matches "def" "ghi" in abc "def" "ghi" jkl
*? (lazy star) / Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item. / ".*?" matches "def" in abc "def" "ghi" jkl
+ (plus) / Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once. / ".+" matches "def" "ghi" in abc "def" "ghi" jkl
+? (lazy plus) / Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item. / ".+?" matches "def" in abc "def" "ghi" jkl
{n} where n is an integer >= 1 / Repeats the previous item exactly n times. / a{3} matches aaa
{n,m} where n >= 0 and m >= n / Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times. / a{2,4} matches aaaa, aaa or aa
{n,m}? where n >= 0 and m >= n / Repeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times. / a{2,4}? matches aa, aaa or aaaa
{n,} where n >= 0 / Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times. / a{2,} matches aaaaa in aaaaa
{n,}? where n >= 0 / Repeats the previous item n or more times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item. / a{2,}? matches aa in aaaaa

Perl's Rich Support for Regular Expressions

Perl was originally designed by Larry Wall as a flexible text-processing language. Over the years, it has grown into a full-fledged programming language, keeping a strong focus on text processing. When the world wide web became popular, Perl became the de facto standard for creating CGI scripts. A CGI script is a small piece of software that generates a dynamic web page, based on a database and/or input from the person visiting the website. Since CGI script basically is a text-processing script, Perl was and still is a natural choice.

Because of Perl's focus on managing and mangling text, regular expression text patterns are an integral part of the Perl language. This in contrast with most other languages, where regular expressions are available as add-on libraries. In Perl, you can use the m// operator to test if a regex can match a string, e.g.:

if ($string =~ m/regex/) {

print 'match';

} else {

print 'no match';

}

Performing a regex search-and-replace is just as easy:

$string =~ s/regex/replacement/g;

I added a "g" after the last forward slash. The "g" stands for "global", which tells Perl to replace all matches, and not just the first one. Options are typically indicated including the slash, like "/g", even though you do not add an extra slash, and even though you could use any non-word character instead of slashes. If your regex contains slashes, use another character, like s!regex!replacement!g.

You can add an "i" to make the regex match case insensitive. You can add an "s" to make the dot match newlines. You can add an "m" to make the dollar and caret match at newlines embedded in the string, as well as at the start and end of the string.

Together you would get something like m/regex/sim;

Regex-Related Special Variables

Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.

@- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).

$' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.

All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match.

Finding All Matches In a String

The "/g" modifier can be used to process all regex matches in a string. The first m/regex/g will find the first match, the second m/regex/g the second match, etc. The location in the string where the next match attempt will begin is automatically remembered by Perl, separately for each string. Here is an example:

while ($string =~ m/regex/g) {

print "Found '$&'. Next attempt at character " . pos($string)+1 . "\n";

}

The pos() function retrieves the position where the next attempt begins. The first character in the string has position zero. You can modify this position by using the function as the left side of an assignment, like in pos($string) = 123;.

Perl Extensions

Regular Expression
Class / Type / Meaning
\t / Character Set / tab
\n / Character Set / newline
\r / Character Set / return
\f / Character Set / form
\a / Character Set / alarm
\e / Character Set / escape
\033 / Character Set / octal
\x1B / Character Set / hex
\c[ / Character Set / control
\l / Character Set / lowercase
\u / Character Set / uppercase
\L / Character Set / lowercase
\U / Character Set / uppercase
\E / Character Set / end
\Q / Character Set / quote
\w / Character Set / Match a "word" character
\W / Character Set / Match a non-word character
\s / Character Set / Match a whitespace character
\S / Character Set / Match a non-whitespace character
\d / Character Set / Match a digit character
\D / Character Set / Match a non-digit character
\b / Anchor / Match a word boundary
\B / Anchor / Match a non-(word boundary)
\A / Anchor / Match only at beginning of string
\Z / Anchor / Match only at EOS, or before newline
\z / Anchor / Match only at end of string
\G / Anchor / Match only where previous m//g left off

Example of PERL Extended, multi-line regular expression

m{ \(

( # Start group

[^()]+ # anything but '(' or ')'

| # or

$ [^()]* $

)+ # end group