Regular Expressions

Contents  Previous  Next

 

When searching for a text string using the Find, Replace, Replace Line Enders or Find Text in Disk Files commands, Boxer supports the use of Regular Expressions, a pattern matching grammar first popularized on the Unix operating system. Regular Expressions make it possible to specify a search string which can match many different target strings, or to restrict the ways in which a search string can be matched.

 

Boxer uses Perl-Compatible Regular Expressions as implemented by the increasingly popular PCRE 5.0 library.  See the end of this topic for further information and acknowledgements. 

 

A complete treatment of the topic of regular expressions could--and does--fill an entire book.  Mastering Regular Expressions, by Jeffrey Friedl is one such book, and a good one at that.  This help topic was written to acquaint the typical user with the most common regular expression features, without getting too bogged down in fine details.  The advanced reader is encouraged to seek out additional information on the web, or within the PCRE documentation itself.  We have posted one such reference document on our site for your convenience.

 

Regular Expressions are very powerful, and can be more easily understood by studying several examples.

 

Matching a Single Character

The dot (.) will match any single character, except the newline character.  Example: p.t will match pat, pet, pit, pot, and put, and in fact any 3-character sequence with p and t at its ends and a single character in the middle.

 

Matching with an Asterisk

The asterisk (*) will match zero or more occurrences of the preceding character. Example: zo*m will match zm, zom, zoom and zooooooooom, among others.  Note that the character preceding the asterisk can be the dot, so zero or more occurrences of any character will be matched when the construction .* is used.  Example: Bo.*r will match Boxer, Bowler, Bookmaker, Bookkeeper and Building Manager.

 

Matching with a Plus Sign

The plus sign (+) will match one or more occurrences of the preceding character. Example: ho+p will match hop, hoop and hooooooop, among others.  Note that the character preceding the plus sign can be the dot, so that one or more occurrences of any character will be matched when the construction .+ is used.

 

bm2Patterns that use either * or + can often result in more than one possible matching string. This concept is know as minimal or maximal matching.  You can control whether Boxer will return the shortest or longest matching string using the Maximal matching checkbox on any dialog where regular expressions are permitted.

 

Matching at Start of Line

The caret (^) can be used to force a match to occur at the start of a line. Example: ^The will match any line beginning with The.

 

bm1You can also force a start-of-line match using the checkbox provided on the dialog.

 

Matching at End of Line

The dollar sign ($) can be used to force a match to occur at the end of a line. Example: result$ will match any line ending with the word result.

 

bm1You can also force an end-of-line match using the checkbox provided on the dialog.

 

Character Classes or Range Expressions

One or more characters can be placed within square brackets to designate the characters which can match in that position.  Example: p[aeiou]t will match pat, pet, pit, pot and put.  Note that digits are also characters, so an expression such as 201[1234] will match any of 2011, 2012, 2013 or 2014.

 

Characters can also be placed within square brackets with a dash between them to designate a range of characters.  Example: [b-d]ent will match bent, cent and dent because the expression [b-d] is shorthand for all characters in that range.  The character range can be entered in ascending or descending order; both [A-Z] and [Z-A] are allowed and are functionally equivalent.

 

The character set appearing within square brackets can be negated by using the caret (^) as the first character within the opening square bracket.  Example: [^cb]ent will match tent, rent, sent, dent and others, but not cent or bent.  The caret can also be applied to negate a character range within square brackets: [^a-e] will match all characters except a, b, c, d and e.  If the caret appears anywhere else within the range expression, its meaning reverts to that of matching the caret itself.

 

Matching Multiple Strings

The vertical rule (|) can be used to separate two or more regular expressions so that any of the patterns will match.  Example: red|green|blue|yellow will match any of the color names that are separated by the vertical rules.

 

Subpatterns

Left and right parentheses can be used to start and end a subpattern.  Example: c(ar|en|oun)t will match cart, cent and count.  In absence of the parentheses, car|en|ount would match car, en or ount... a very different result.

 

Escape Character

The backslash can be used to remove significance from a pattern matching character.  Example: if you need to search for an asterisk, use \*. To search for a dot, use \..  To search for a plus sign, use \+.  To search for the backslash itself, use \\.

 

bm1You can also remove significance from pattern matching characters by placing them inside a range expression.  For example, [*+] could be used to match either an asterisk or a plus sign.

 

Matching Whole Words

To force a pattern to find only those occurrences of a search string which appear as whole words, the pattern can be surrounded with a sequence that forces a match at a word boundary.  Example: to find the word sign, but not words such as assign, signature or assignment, use \bsign\b.

 

bm1You can also force a whole word match using the checkbox provided on the dialog.

 

Matching Special Characters

Several characters that are not readily typed from the keyboard can be matched using special character sequences:

 

\\

match a backslash character

\a

match a bell (alarm) character (ASCII 7)

\b

match a backspace character (ASCII 8) (only if used in a character class)

\cx

match character Control-x (x = any character)

\e

match an escape character (ASCII 27)

\f

match a formfeed character (ASCII 12)

\n

match a newline character (ASCII 10)

\r

match a carriage return character (ASCII 13)

\t

match a tab character (ASCII 9)

\ddd

match octal character ddd (d = any digit 0-7)

\xhh

match hexadecimal character hh (h = any hex digit)

 

 

Generic Character Types

There are several convenient shorthand sequences for matching common character classes:

 

\d

match a decimal digit (0-9), equivalent to: [0-9]

\D

match any character except a decimal digit, equivalent to: [^0-9]

\s

match any whitespace character, equivalent to: [\t\n\f\r ]

\S

match any character except whitespace, equivalent to [^\t\n\f\r ]

\w

match any word character, equivalent to: [_a-zA-Z]

\W

match any character except a word character, equivalent to: [^_a-zA-Z]

 

bm2A word character is considered to be any letter, digit or underscore.  No consideration is made for accented characters that reside above value 128 in the character set.  If you require such characters in a pattern, you'll need to name these characters explicitly, perhaps in a range expression that also uses \w.

 

Assertions

The following sequences can be used to force a match to occur only at a required position:

 

\b

match at a word boundary

\B

match when not at a word boundary

\A

match at start of subject

\Z

match at end of subject or before newline

\z

match at end of subject

\G

match at first matching position in subject

 

 

Useful Constructions

The following examples illustrate some common constructions, and give examples of the utility--and complexity--of some advanced regular expressions:

 

.*

match zero or more occurrences of any character

.+

match one or more occurrences of any character

^$

match an empty line

^\s+$

match a line containing only whitespace

^\s+

match leading whitespace

\s+$

match trailing whitespace

[a-zA-Z]

match any alphabetic character

this|that

match 'this' or 'that'

\b(\w+)\s+\1\b

match repeated words (such as 'the the')

\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{­2,4}\b

match a valid email address

 

 

Min/Max Quantifiers

A min/max quantifier can be used to control how many instances of the preceding entity are to be allowed within a match.  The syntax for min/max quantifiers is summarized in this table:

 

{

start a min/max quantifier

}

end a min/max quantifier

{3}

match exactly 3 of the previous item

{3,}

match at least 3 of the previous item

{3,5}

match at least 3, but no more than 5 of the previous item

 

Example: the pattern [abc]{­4,8} would match a sequence of characters consisting of the letters a, b or c, so long as at least 4 characters are present, and no more than 8 appear.  Potential matches: aaaa, accb, abcabc, bbbbcccc.  Non matches: aaa, abcd, abcabcabc.

 

bm2Careful readers might observe that * is effectively shorthand for {­0,} and + is shorthand for {­1,}.

 

Back References and Named Subpatterns

One of the more powerful features of Perl regular expressions is the ability to make reference within a pattern to the string that matched a subpattern which occurred earlier in the pattern.  Subpatterns are created when a portion of a pattern is enclosed in left and right parentheses.  The first opening left parenthesis encountered starts a subpattern whose number is 1.  The second left parenthesis creates subpattern 2, and so on.  To make a back reference to a subpattern by number, this syntax is used:

 

       \1                back reference to subpattern number 1

 

Referring to subpatterns by number can get confusing when a complex regular expression is being created.  For this reason, named subpatterns are also permitted.  To start a subpattern named 'foo', the following syntax would be used:

 

       (?P<foo>        start a subpattern named 'foo'

 

Later on in the pattern, the string that matched subpattern 'foo' could be referenced using this syntax:

 

       (?P=foo)        back reference to the subpattern named 'foo'

 

The example presented above that matches repeated words uses a back reference:

 

       \b(\w+)\s+\1\b

 

The subpattern (\w+) matches any string that contains one or more word characters.  In order for the entire pattern to match, that same string must appear again (due to the \1 reference) with one or more spaces (\s+) in between.  Finally, the \b sequences at each end ensure that the pattern matches only at a word boundary.

 

bm2Named subpattern references can also be used in the replace string of the Replace and Replace Line Enders commands, and with the ChangeStringRE() macro function.

 

Closing Example

Finally, it's worth mentioning that any or all of the expressions presented above can be used within the same regular expression.  This artificially complex example:

 

       ^The\sq[^a]ic{­1}k.*f[aeiou]x.*ov[a-e]r.*lazy\040dog\.$

 

would match the sentence:

 

       The quick brown fox jumped over the lazy dog.

 

if it appeared on a single line.

 

PCRE 5.0 License

The Perl-Compatible Regular Expression (PCRE) package used by Boxer was written by Philip Hazel, and is used in accordance with the PCRE license:

 

Copyright (c) 1997-2004 University of Cambridge

All rights reserved.

 

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

 

   * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

 

   * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

 

   * Neither the name of the University of Cambridge nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

 

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.