Regular Expressions

When searching for a text string using the Find, Replace, Replace Line Enders or Find Text in Disk Files commands, Boxer supports the use of Regular Expressions, a pattern matching grammar first popularized on the Unix operating system. Regular Expressions make it possible to specify a search string which can match many different target strings, or to restrict the ways in which a search string can be matched.

Boxer uses Perl-Compatible Regular Expressions as implemented by the increasingly popular PCRE 5.0 library. See the end of this topic for further information and acknowledgements.

A complete treatment of the topic of regular expressions could--and does--fill an entire book. Mastering Regular Expressions, by Jeffrey Friedl is one such book, and a good one at that. This help topic was written to acquaint the typical user with the most common regular expression features, without getting too bogged down in fine details. The advanced reader is encouraged to seek out additional information on the web, or within the PCRE documentation itself. We have posted one such reference document on our site for your convenience.

Regular Expressions are very powerful, and can be more easily understood by studying several examples.

The dot (.) will match any single character, except the newline character. Example: p.t will match pat, pet, pit, pot, and put, and in fact any 3-character sequence with p and t at its ends and a single character in the middle.

The asterisk (*) will match zero or more occurrences of the preceding character. Example: zo*m will match zm, zom, zoom and zooooooooom, among others. Note that the character preceding the asterisk can be the dot, so zero or more occurrences of any character will be matched when the construction .* is used. Example: Bo.*r will match Boxer, Bowler, Bookmaker, Bookkeeper and Building Manager.

The plus sign (+) will match one or more occurrences of the preceding character. Example: ho+p will match hop, hoop and hooooooop, among others. Note that the character preceding the plus sign can be the dot, so that one or more occurrences of any character will be matched when the construction .+ is used.

The caret (^) can be used to force a match to occur at the start of a line. Example: ^The will match any line beginning with The.

The dollar sign ($) can be used to force a match to occur at the end of a line. Example: result$ will match any line ending with the word result.

One or more characters can be placed within square brackets to designate the characters which can match in that position. Example: p[aeiou]t will match pat, pet, pit, pot and put. Note that digits are also characters, so an expression such as 201[1234] will match any of 2011, 2012, 2013 or 2014.

Characters can also be placed within square brackets with a dash between them to designate a range of characters. Example: [b-d]ent will match bent, cent and dent because the expression [b-d] is shorthand for all characters in that range. The character range can be entered in ascending or descending order; both [A-Z] and [Z-A] are allowed and are functionally equivalent.

The character set appearing within square brackets can be negated by using the caret (^) as the first character within the opening square bracket. Example: [^cb]ent will match tent, rent, sent, dent and others, but not cent or bent. The caret can also be applied to negate a character range within square brackets: [^a-e] will match all characters except a, b, c, d and e. If the caret appears anywhere else within the range expression, its meaning reverts to that of matching the caret itself.

The vertical rule (|) can be used to separate two or more regular expressions so that any of the patterns will match. Example: red|green|blue|yellow will match any of the color names that are separated by the vertical rules.

Left and right parentheses can be used to start and end a subpattern. Example: c(ar|en|oun)t will match cart, cent and count. In absence of the parentheses, car|en|ount would match car, en or ount... a very different result.

The backslash can be used to remove significance from a pattern matching character. Example: if you need to search for an asterisk, use \*. To search for a dot, use \.. To search for a plus sign, use \+. To search for the backslash itself, use \\.

To force a pattern to find only those occurrences of a search string which appear as whole words, the pattern can be surrounded with a sequence that forces a match at a word boundary. Example: to find the word sign, but not words such as assign, signature or assignment, use \bsign\b.

Several characters that are not readily typed from the keyboard can be matched using special character sequences:

\\	match a backslash character
\a	match a bell (alarm) character (ASCII 7)
\b	match a backspace character (ASCII 8) (only if used in a character class)
\cx	match character Control-x (x = any character)
\e	match an escape character (ASCII 27)
\f	match a formfeed character (ASCII 12)
\n	match a newline character (ASCII 10)
\r	match a carriage return character (ASCII 13)
\t	match a tab character (ASCII 9)
\ddd	match octal character ddd (d = any digit 0-7)
\xhh	match hexadecimal character hh (h = any hex digit)

There are several convenient shorthand sequences for matching common character classes:

\d	match a decimal digit (0-9), equivalent to: [0-9]
\D	match any character except a decimal digit, equivalent to: [^0-9]
\s	match any whitespace character, equivalent to: [\t\n\f\r ]
\S	match any character except whitespace, equivalent to [^\t\n\f\r ]
\w	match any word character, equivalent to: [_a-zA-Z]
\W	match any character except a word character, equivalent to: [^_a-zA-Z]

The following sequences can be used to force a match to occur only at a required position:

\b	match at a word boundary
\B	match when not at a word boundary
\A	match at start of subject
\Z	match at end of subject or before newline
\z	match at end of subject
\G	match at first matching position in subject

The following examples illustrate some common constructions, and give examples of the utility--and complexity--of some advanced regular expressions:

.*	match zero or more occurrences of any character
.+	match one or more occurrences of any character
^$	match an empty line
^\s+$	match a line containing only whitespace
^\s+	match leading whitespace
\s+$	match trailing whitespace
[a-zA-Z]	match any alphabetic character
this\|that	match 'this' or 'that'
\b(\w+)\s+\1\b	match repeated words (such as 'the the')
\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b	match a valid email address

A min/max quantifier can be used to control how many instances of the preceding entity are to be allowed within a match. The syntax for min/max quantifiers is summarized in this table:

{	start a min/max quantifier
}	end a min/max quantifier
{3}	match exactly 3 of the previous item
{3,}	match at least 3 of the previous item
{3,5}	match at least 3, but no more than 5 of the previous item

Example: the pattern [abc]{4,8} would match a sequence of characters consisting of the letters a, b or c, so long as at least 4 characters are present, and no more than 8 appear. Potential matches: aaaa, accb, abcabc, bbbbcccc. Non matches: aaa, abcd, abcabcabc.

One of the more powerful features of Perl regular expressions is the ability to make reference within a pattern to the string that matched a subpattern which occurred earlier in the pattern. Subpatterns are created when a portion of a pattern is enclosed in left and right parentheses. The first opening left parenthesis encountered starts a subpattern whose number is 1. The second left parenthesis creates subpattern 2, and so on. To make a back reference to a subpattern by number, this syntax is used:

Referring to subpatterns by number can get confusing when a complex regular expression is being created. For this reason, named subpatterns are also permitted. To start a subpattern named 'foo', the following syntax would be used:

Later on in the pattern, the string that matched subpattern 'foo' could be referenced using this syntax:

The subpattern (\w+) matches any string that contains one or more word characters. In order for the entire pattern to match, that same string must appear again (due to the \1 reference) with one or more spaces (\s+) in between. Finally, the \b sequences at each end ensure that the pattern matches only at a word boundary.

Finally, it's worth mentioning that any or all of the expressions presented above can be used within the same regular expression. This artificially complex example:

The Perl-Compatible Regular Expression (PCRE) package used by Boxer was written by Philip Hazel, and is used in accordance with the PCRE license:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

* Neither the name of the University of Cambridge nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.