Introductory Programming in Python: Lesson 19
Regular Expressions

[Prev: Files for Input and Output] [Course Outline] [Next: Basic Parsing]

Patterns Defining Sets of Strings

Take the sentence "To collect your e-ticket you will need your ID document and/or your passport.". Congratulations, you've just seen and understood a regular expression. A regular expression is simply a pattern that defines a set of strings. Any string that fits the specified pattern is said to match the pattern, i.e. both "To collect your e-ticket you will need your ID document or your passport." and "To collect your e-ticket you will need your ID Document and your passport" match the pattern specified.

We can specify patterns in natural languages, e.g. The word 'apples', followed by a comma, and then either the word 'banana' or 'beer', optionally followed by a comma, followed any number (including zero) of the word 'chips'. Intuitively we understand that all of the following strings, among others, match this pattern

"apples,banana"
"apples,beer,chips"
"apples,banana,chips"
"apples,beerchipschipschips"

Already we can see that our pattern could expressed more concisely, and more importantly more accurately, using a few symbols, e.g. "apples,banana/beer[,][chips][...]". This is still somewhat ambiguous. And here we get to the point where regular expressions become useful. The difference between a regular expression and a simple pattern specification is the fact that the regular expression uses a well defined standardised grammar...

Regular Expression Syntax

Regular expressions have been around since the forties, and as such their syntax has evolved over time. In formal mathematics the syntax is very different from the practical syntaxes used in many computer languages. The most common form of regular expression syntax today is based on that used in the PERL language, and is known as PCRE (PERL Compatible Regular Expression) syntax. Originally regular expressions were developed to describe programming languages by defining the complete set of valid programs, however they were soon appropriated for the purposes of more general pattern matching. Python uses a slight variant of PCRE syntax.

Regular expressions are exactly that, expressions, meaning they can be composed using various operators from simpler regular expressions, so let's deal with the simplest regular expressions first ...

The simplest regular expression is a single character. This expression matches that character and only that character, e.g. the regular expression a matches "a" and only "a". Regular expressions are of course case sensitive. This in and of itself isn't very useful; even string equivalence tests are more advanced than this, so the next thing we want to be able to do is to extend the length of the pattern. The concatenation operator of regular expressions is implicit, meaning the pattern hello matches "hello", and is the concatenation of 5 separate regular expressions, namely h, e, l, l, and o.

So far so good, as far as understanding goes, but usefulness we still find lacking. So let's introduce a few special regular expressions that do stuff a little more fun.

.: Matches any single character except "\n", e.g. "a", "A", "?", ".", "z", "7", etc... all match.
^: Matches the beginning of the line or string, e.g. ^abc matches "abc" at the beginning of a line or string only.
$: Matches the end of the line or string, e.g. xyz$ matches "xyz" only at the end of a line or string.
[<character list>]: Matches any single character in the character list, e.g. [abc] matches "a", or "b", or "c". The inversion of the character list can be obtained using the caret ('^') in the first position of the character list, e.g. [^abc] matches any single character except "a", "b", or "c". The caret is obviously not included in the character list. If you wish to include a caret in a character list, place it in some position other than the first, e.g. [ab^], or for an inversion [^ab^]. In addition all characters in a range can be specified in a character list using the '-' character, e.g. [a-z] includes all the lowercase letters. If you wish to include the '-' character in the list, put it last.
\d: Matches any decimal digit; Equivalent to [0-9]
\w: Matches any 'word' character, i.e. alphanumeric and the '_' character; Equivalent to [a-zA-z0-9_]
\s: Matches any white space character; Equivalent to [ \t\n\r\v\f]

Right, that's more like it! We can now start doing some interesting stuff, e.g. ^Mary had a ...... [Ll]amb$ would match "Mary had a little lamb", "Mary had a little Lamb", and "Mary had a large_ lamb" amongst others. "Poem: Mary had a little lamb" would not match. Nor would "Mary had a little lamb.". Even searching for the regular expression within a larger string, the latter two examples would fail, because of the "^" and the "$", specifying essentially that the whole string must match.

The true power tools of the regular expression world, however, are the repetition operators.

?: Matches if the preceding regular expression occurs zero or one times in a row, e.g. a? matches "" or "a".
*: Matches if the preceding regular expression occurs zero or more times in a row, e.g. a* matches "", "a", "aa", "aaa", ...
+: Matches if the preceding regular expression occurs one or more times in a row, e.g. a+ matches "a", "aa", "aaa", ...
{<m>}: Matches if the preceding regular expression occurs exactly m times in a row, e.g. a{7} matches "aaaaaaa".
{<m>,<n>}: Matches if the preceding regular expression occurs at least m times and no more than n times in a row, e.g. a{3,7} matches "aaa", "aaaa", "aaaaa", "aaaaaa", and "aaaaaaa".

And or course we need a way to specify alternatives of more than one character in length.

<a>|<b>: Matches either the regular expression a or the regular expression b.

Just as with arithmetic expressions, regular expressions have a precedence of operators, listed below from highest precedence to lowest precedence

repetition operators
alternative operator
concatenation operator

The practical upshot of this is that if we want to specify alternative words, e.g. ID document or passport, we must parenthesize them so the concatenation operators override the alternative operator, e.g. (ID Document)|(passport) is what we want. ID Document|passport matches "ID Documentassport" or "ID Documenpassport".

Of course, since we are now using various characters to indicate operators, we come across the familiar problem of matching those characters themselves. The, by now familiar, solution is of course to escape those characters using '\', to remove their special meaning. As a result '\\' is used to match a backslash.

The outline of operators listed here is by no means complete, but it does cover most of the PCRE standard. Python includes a number of python specific extensions to the standard, which can be explored at leisure in the python docs.

Greedy Versus Non-greedy

When using a repetition operator in a regular expression, we introduce an ambiguity, which is somewhat ... distasteful. The purpose of regular expressions was to eliminate ambiguities, and to formally define grammars. This ambiguity is best illustrated by example. Suppose we had a regular expression a[ab]*ab[ab]*. Checking the pattern against a given string "abbabbabbaab", we suddenly want to know which sub-expressions match which bits of the string. Let's make the sub-expressions more obvious with colours.

a[ab]*ab[ab]*
Some possible match methods in our given string ("abbabbabbaab") are

abbabbabbaab
abbabbabbaab
abbabbabbaab

The question is which of these matches is the correct one. By definition, regular expressions are considered greedy. Greedy means that a particular regular expression (or sub-expression) will match as much as it possibly can, whilst still allowing the possibility of an overall match. Which means in our example, the red expression will favour match set 3, where the amount matched by red is largest. In the case where multiple repetition operators compete, the left most operator takes precedence.

Python gives us the ability to specify that a particular repetition operator should be non-greedy (i.e. favour match set 1), by suffixing the repetition operator with a '?' character. So

a[ab]*?ab[ab]*
has the following match set
abbabbabbaab

Match set 2 from the original list is left out of the equation, as regular expression syntax doesn't give us the opportunity to choose arbitrary levels of greediness. Given a longer string of 'a's and 'b's there would most likely be many possible match sets between least greedy and most greedy, and it would not be possible to enumerate these in advance, thus making it impossible to reference them, thus making it impossible to specify them in a regular expression.

The re Module

Python's regular expression tools are provided in the standard module "re". So the first thing we would normally do would be to import it.

Python 2.4.3 (#1, Oct	2 2006, 21:50:13) 
[GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>>

Because of the complexity of dealing with regular expressions, it often behoves us to compile the regular expression prior to use. If the regular expression is only going to be used once, we can safely ignore compilation, but using the same regular expression more than once begs compilation. We can compile a regular expression by using

>>> regexpobj = re.compile("a[ab]*ab[ab]*")
>>>

The compile function returns a regular expression object. Regular expression objects have the following methods

<regular expression object>.match(<string>) returns a match object if the entire string matches the pattern compiled into the regular expression object, otherwise returns None.
<regular expression object>.search(<string>) returns a match object if a portion of the string matches the pattern compiled into the regular expression object, otherwise return None.
<regular expression object>.findall(<string>) returns a list of all non-overlapping portions of the string that match the pattern compiled into the regular expression object. If no portions match, an empty list is returned. Note however that empty matches are returned, i.e. a* would return '' in between two 'b's ("bb") because technically a has appeared zero times which matches the 'zero or more' repetition.
<regular expression object>.sub(<replacement>,<string>) returns the string with every non-overlapping occurrence of the pattern compiled into the regular expression object (counted from the left) replaced by the replacement string. There's a lot more to this method, so read the python docs when you are ready.

For one-shot uses of regular expressions, the re module provides each of these methods as a function in the module for our convenience, where an additional parameter is expected (inserted in the first position) which is the regular expression. This regular expression is compiled, then has the appropriate method called on the given parameters, and the result is returned, e.g. re.match("a[ab]*ab[ab]*","abbabbabbaab").

Using Groups

re.match and re.search both return match objects. Their returning something other than None, means a successful match was made, but there is even more to the match object, and to regular expressions. In PCRE syntax, the brackets affect not only precedence, but perform a grouping function too. Whilst a regular expression is being evaluated over as string, when ever an open round bracket ('(') is encountered, a new group is formed, into which are placed all succeeding matching characters. Groups within groups are allowed, and outer groups have the characters matched within inner groups appended too. Once again, examples are probably easier to grasp. Given the regular expression ((a)b(c)) and the string "abc", we obtain three groups, numbered 1 through 3. Group 1 contains "abc", being everything matched between the first open bracket and it's paired closing bracket. Group 2 contains only "a", being everything matched between the second open bracket and its paired close bracket. Similarly, group 3, contains only "b", from the third pair of brackets.

>>> r = re.compile("((a)b(c))")
>>> m = r.match("abc")
>>> m.groups()
('abc', 'a', 'c')
>>> m.group(1)
'abc'
>>> m.group(2)
'a'
>>> m.group(3)
'c'
>>>

There are two uses for groups. Firstly, within a regular expression we can refer to groups already encountered, using \<number> (where number is the number of a group from 1 to 99). So the regular expression a(.)c\1 translates to: an "a", followed any character 'x', followed by a "c", followed by 'x' again.

>>> m = re.match("a(.)c\\1","abcb")
>>> m
>>> m.groups()
('b',)
>>> m.group(0)
'abcb'
>>> m = re.match("a(.)c\\1","abcd")
>>> print m
None
>>>

The second use for groups, is that they allow us in python to match lines of text, and extract sub-sections of those lines out using groups, in a single function call. We can access the groups of a match object using the following syntax

<match object>.groups() returns a tuple containing the groups as strings in the order they were in the regular expression. Group order is defined as the order in which their open brackets are encountered from left to right.
<match object>.group(<number>) returns the string of the group numbered 'number'. If 'number' is 0, then the entire match is returned.