Regular Expressions
Fundamental Concepts of Text Pattern Matching
Regular expressions are everywhere — if you
know where to look. They are a powerful and versatile way to extract
information and manipulate text. This functionality is included in most
good programmer's editors, dedicated command line utilities like
grep
, and even word processors. Yet we still find that many
programmers have little or no skills in this area. This is an
introduction to the basic concepts and syntax of regular
expressions.
PREREQUISITES — You should already…
- have some programming or scripting experience, in a language like C++, C#, Java, JavaScript, PHP, Perl, Python, AWK, Bash.
Introduction
Here is a high-level overview of the background and concepts of regular expressions.
Overview
Historically, regular expression is a term
for a branch of mathematics (regular language). Today, however,
the term generally refers to a “pattern matching” language. Regular
expressions are included in many tools and libraries (often with names
abbreviated as RegEx
, Regex
,
regex
, or just re
). They are commonly used in
programming, and are very useful for command line text manipulation.
A regular expression language is not a general purpose language, unlike Java or C#. The ability to use regular expressions, however, is usually available as a library for such a programming language, or even built into the syntax of the language (JavaScript, Perl, AWK).
Some of the oldest and most-used implementations are found in
grep
, sed
and other venerable and mature
command line tools.
Basic Concepts
A regular expression is a pattern, or sequence, of characters. A regular expression language is a set of syntax rules for describing or creating these patterns.
We will use the term regex
in this document to refer to
a regular expression.
The simplest pattern is a literal string. The regex processing
software will then look for a match i.e. a sequence of
characters that corresponds to that pattern. A very basic implementation
of this is found in word processors: for example, you can search in a
document for all occurrences of the pattern cat
. The word
processor will find matches in words such as catch
,
catastrophe
, bobcats
etc.
Matching
The most common use case is to test whether an input string matches
some regex, i.e., true
, or false
. The match is
by default conducted as a case-sensitive, sliding algorithm — it starts
from the left and repeatedly attempts a match, moving on to the next
character, if it doesn't, until the end of the string. This sliding
“window” can match anywhere inside the search string, unless it is anchored (discussed later in this document).
A regular expression is traditionally line-oriented, but it does not have to be. By line-oriented, we mean that it will search for a match that is fully contained on a single line, but not for a match that starts on one line and ends on another. In a programming language, a string to match may contain newlines, but options are provided to perform matches across the newlines.
It is important to understand that for a regex to “match” successfully, some part of the string must satisfy the regex condition. For string manipulation, we often need to know exactly what part of the string matched. Consequently, all implementations will provide some mechanism to retrieve the matched part.
Compilation
Most implementations compile regular expressions, before they are applied to test if strings match. The compiled form is not machine code; it is just a structure that allows for faster matching. This can often be saved in an object or variable, and applied directly to test for matches. Compiled regular expressions are more efficient, especially when they are to be used several times.
Syntax
Regular expression syntax is relatively simple, even though it does not resemble a regular programming language. It does have the equivalent of keywords, operators with precedence, and grouping.
Special Characters
The “keywords” of the regular expression language are not keywords in the usual sense — simply, certain characters are treated as special, in that they have a special meaning. They serve the same purpose as keywords in more typical programming languages, and are more commonly referred to as metacharacters.
The special, or meta, characters are much the same across all implementations. The typical special characters are:
. [ ] ( ) { } ^ * + ? # \ $
Note that your language or tool may provide more, or fewer, metacharacters.
Escape Sequences
To mark a special character as non-special, it must be
escaped with a backslash (\
).
Your implementation may require that the regex is enclosed in a native string literal, specific to the language or tool.
The backslash character itself, must be escaped with another
backslash (\\
). The language may also use backslash as an
escape sequence (C, C++, Java, PHP, C#, etc.). In this case, you may
need to type four backslashes, so that 2 will be stored in memory for
use by the regex library. Or you can use your language's version of a
‘verbatim string‘ (‘raw string’ in Python), where backslashes have no
special meaning. In a POSIX shell, use single
quote delimited strings (also PHP, while Perl has special regex
delimiters). In C#, you can use @"…"
. In Python, use
R"…"
or r"…"
.
Some special characters like tab (\t
), newline
(\n
) and carriage return (\r
) can be
represented with the escape sequences shown.
The circumflex, or caret (^
), is special only when it is
used as an anchor or to negate a character class, both of which are
described later in this document.
Structure
In general, a regex will have three components:
- Character classes are used to match one or more characters.
- Quantifiers specify how many times a character class may be repeated.
- Anchors specify the position of the pattern in relation to a line of text.
We will look at all of these in more detail.
Character Classes
A fundamental concept is that of an expression that matches a single character. A character class means: “match one character”.
Effectively, any non-special character means “match this
character here”. So the regex ABC
means “match one
A
, followed by one B
, followed by one
C
, anywhere inside a string”.
General Character Class
Instead of only one possibility like the ABC
example
above, however, we can specify a set of potential matches. The
most general character class is the period (.
), which means
“match any character, except a newline”.
User-Defined Character Classes
For more flexibility, we can define the set of matches. A user-defined character class is created by specifying the potential matches between square brackets. For example:
[ABC]
means: “match either one A
,
or one B
, or one C
”.
[ABC][123]
means: “match one of either
A
, or B
, or C
; followed by one of
either 1
, 2
, or 3
”.
The first regex only matches one character. The second regex matches
two characters and would have 9 possible matches, including
A3
, B1
and C2
.
Ranges
Character classes may form a range, by separating the start and end
characters with a minus sign (-
):
[A-Z]
means “match any uppercase alphabetic
character”.
[A-Za-z]
means “match any alphabetic character”.
[0-9]
means “match any decimal digit”.
If the minus sign does not appear between two increasingly larger characters (based on ordinals), it will be treated as just another character that's eligible for matching. Thus:
[B-E]
means “match any uppercase alphabetic character in
the range from B to E”.
[E-B]
means “match one of either E
or
-
or B
”.
Negation
When the circumflex or caret (^
) appears as the
first character in a character class, it inverses, or negates,
the meaning of the character class.
[^A-Z]
means “match any character except
uppercase alphabetic characters”.
[^A-Za-z]
means “match any non-alphabetic
character”.
[^0-9]
means “match any non-digit character”.
Pre-defined Character Classes
Most implementations provide escape sequences for some common character classes:
\d
means “match a decimal digit”, and is shorthand for:
[0-9]
.
\D
means “match any character, but a
decimal digit”. Shorthand for: [^0-9]
.
\s
means “match any white space character” (tabs, newlines,
spaces).
\S
means “match any non-white space character”.
\w
means “match any word character”. Shorthand
for: [_0-9a-zA-Z]
.
\W
means “match any non-word character”. Shorthand
for: [^_0-9a-zA-Z]
.
Your implementation, language or library, may provide more convenient escape sequences.
POSIX Character Classes
In POSIX compliant systems, many regular
expression engines have named character classes in the form:
:
class:
, to better
accomodate Unicode and internationalisation. Possible values for class are:
alnum alpha ascii blank cntrl digit graph lower
print punct space upper word xdigit
Alternation and Grouping
A set of parentheses can be used to group a sequence of characters or other pattern.
The pipe (|
) can be used to specify alternates, i.e.
“either-or” matches.
We can combine groups and alternates in a number of useful ways:
AB|C
means “match one A
; followed by
either one B
or one
C
”.
(AB)|C
means “match either AB
,
or C
.
(AB)|(CD)
means “match either AB
,
or CD
.
Quantifiers
More powerful patterns will express the concept of how many (quantity). A quantifier is effectively a postfix operator — it operates on the character or expression that appears before it.
0 to 1 (Optional)
The question mark (?
) means “match 0
to
1
times”. It applies to the character or group before it,
so:
AB?C
means “match an A
; optionally followed
by a B
; followed by one C
”.
A.?C
means “match an A
; optionally followed by
any character; followed by one C
”.
(AB)?C
means “match optional AB
; followed by
one C
”.
\d?A
means “match one optional decimal digit; followed by
one A
”.
0 to Many, 1 to Many
By Many
, we effectively mean “as many as the
implementation allows”. The upper limit is unbounded.
To match 0
-Many
times, use the asterisk
(*
):
AB*C
means “match one A
; followed by zero
or more B
s; followed by one C
.”
(AB)*C
means “match zero or more AB
sequences;
followed by one C
.”
A.*C
means “match one A
; followed by zero or
more of any character; followed by one C
”.
To match 1
-Many
times, use the plus sign
(+
):
AB+C
means “match one A
; followed by one or
more B
s; followed by one C
.”
(AB)+C
means “match one or more AB
sequences;
followed by one C
.”
A.+C
means “match one A
; followed by one or
more of any character; followed by one C
”.
Any to Any
When you need more flexibility, you can explicitly specify the
minimum and maximum quantities. You do this by specifying the quantity
in curly braces, and separating the minimum and maximum values with a
comma. When the minimum and maximum values are equal i.e. you want an
exact quantity, only specify the first value, without a comma. If the
value before the comma is left out, it defaults to 0
. If
the value after the comma is left out, it defaults to
Many
.
Like the other quantifiers, this applies to the element before it.
A{,}B
means “match one A
, repeated
0
to Many
times; followed by one
B
.
A{3}B
means “match exactly 3
A
s;
followed by one B
.
A{,3}B
means “match 0
to 3
A
s; followed by one B
.
A{3,}B
means “match 3
or more A
s;
followed by one B
.
A{3,5}B
means “match 3
, or 4
, or
5
A
s; followed by one B
.
Greedy vs Non-greedy
The quantifiers +
and *
are greedy
— they match the longest sequence that still satisfies the match. For
example, when the regex: <e>.*</e>
, is applied
to the string:
This <e>is</e> a <e>very</e> stupid sentence.
it will match:
<e>is</e> a <e>very</e>
You can create a non-greedy version of the greedy quantifiers by
appending a question mark (?
) to either *
or
+
, e.g., A*?B
or A+?B
. As
example, the regex: <e>.*?</e>
when applied to
the above string, it will match:
<e>is</e>
The question mark (?
), although a quantifier, when
suffixed to an expression, does not involve “greediness”, since it has a
fixed maximum (1
) match value. When appending it to the
unlimited upper bound quantifiers however, it changes their
behaviour.
Anchors
You can anchor a regex to prevent it from matching anywhere in a search string. An anchor never matches a character. It is logically a position or location. If, by analogy, one treats a string as a sequence of physical blocks, the beginning of the sequence is the left edge of the first block. This analogy also illustrates the concept of “between characters”.
Beginning or End of String
The caret, when used at the beginning of the regex, means “anchor at the start”.
The dollar sign, $
, when used at the end of a regex,
means “anchor at the end”.
^ABC
means “match ABC
at the start
of the string only” (begins with ABC
).
ABC$
means “match ABC
at the end of
the string only” (ends with ABC
).
^ABC$
means “match ABC
at the start
and the end of the string (equality)”.
The last example is a rather expensive way to test for equality. Anchors do not match any characters — they match positions.
The circumflex (caret) or dollar signs are not treated as anchors, or special characters in any way, when they appear anywhere else in the regex.
Word Boundary
A word boundary is a special kind of anchor, which matches the start
or end of a sequence of word characters. Boundaries occur at
the beginning of a string, at the end of a string, before and after
punctuation, and before and after white space. A word boundary anchor is
specified with: \b
. The converse also exist: a non-word
boundary, specified with \B
. Thus:
\bABC\b
means “match the word ABC
”
(but will not match part of a word like ABCDEF
).
Back References
The grouping parentheses has an additional effect — every set
“remembers” what was matched. Each parentheses set stores its matches in
sequentially numbered memory locations, starting at 1
. To
“retrieve” a remembered value, you can refer back to it with
\‹
number
›
.
(.)\1
means “match any character, and remember in memory
location 1
; then match a copy of the character previously
matched and stored in memory location 1
”.
The above example will therefor match twins of any character. Another example:
(.)(.)\2\1
means “match any single character & store
in memory location 1
; followed by matching any single
character & store in memory location 2
; followed by
matching what is stored in memory location 2
; followed by
matching what is stored in memory location 1
.
The above example will match 4-letter palindromes, like
abba
, noon
, deed
, etc.
Conclusion
Apart from simple matches, various libraries provide functionality to perform tasks such as: search and replace; find all matches as a collection; return the location of a match; split strings based on regular expressions; name parentheses captures so that back references can use names instead of numbers, and more. The newer libraries also provide negative and positive look-ahead.
Many offline or online, commercial or free, regular expression testers and debuggers are available. Some are specific to one or more programming languages, others use a PCRE library.
Just about every programmer, or administrator, can use regular expressions to be more productive. If you are new to regular expressions, we hope this will help you to understand the concepts.
PCRE
You may see references to PCRE (Perl Compatible Regular Expressions) when reading regex documentation. The Perl regex library is very powerful and this term refers to libraries that follow the Perl implementation and syntax.
2022-03-03: Fix history message & add reference to Python's raw
strings. [brx]
2020-04-22: Added missing closing quotation marks. [brx]
2019-03-07: Added note about POSIX named
character classes. [brx]
2017-11-19: Update to new admonitions. [brx]
2017-10-04: Edited. [jjc]
2016-06-11: Created. Edited. [brx;jjc]