Regular Expressions

The basics

A regular expression is a pattern that a string is searched for. Read this Regular expressions HOWTO and, if you are hungry for more, you can find more detail in the Python documentation of regular expressions. This page here just briefly summarises the basic stuff along with exercises.

You can build infinitely complex and powerful regular expression patterns, but for now let us just look for the simplest possible pattern, a sequence of characters like "the " (notice the last space character). Python keeps its regular expression tools in the re module so before you can do anyting you need to import this module. Then you can compile the regular expression:

import re

regexp = re.compile(r"the ")

The regular expression "the " is now compiled and stored in the variable regexp. regexp is an object (e.g. like file objects). and just like other objects this one comes with methods that lets us interact with the contents of the object. Remember the 'r' right in front of the string specifying the regular expression (no space between: r"). This tells Python that the string is a raw string which basically means that the string is interpreted as you write it also when you use escaped characters.

To search a string line for the regular expression you use the search method:

match = regexp.search(line)

If the pattern is found search(line) returns a match object. Like other object this has methods, and we will come back to that later. If used in a boolean context the match object evaluates to True. If the pattern is not found search(line) returns None which, as you know, evaluates to False in a boolean context.

In this example we read a file and print only the lines where the regular expression finds a match:

import re

# open a file
f = open("alice.txt","r")

# read all lines into a list:
text = f.readlines()

# close the f
f.close()

# compile the regular expression:
regexp = re.compile(r"the ")

# search each string in the list using the regular expression:
for line in text:
    if regexp.search(line):
       print line

Instead of the last three lines in the example you can use a list comprehension and join the resulting list — just to remind you of list comprehensions ;-):

print "".join([line for line in text if rexexp.search(line)])

Building regular expressions

Just searching for a simple string of characters like "the " is not a meaningful use of regular expressions. That is better done with the in operator on strings like "the " in line. How to build regular expressions is covered in the reading material for this week. Your you convenience I have listed some of the special characters, notations.

Matching of several different single characters:

.	Any single character except a newline
[qjk]	Either q or j or k
[^qjk]	Neither q nor j nor k
[a-z]	Anything from a to z inclusive
[^a-z]	No lower case letters
[a-zA-Z]	Any letter
\w	Any alphanumeric (word) character. The same as [a-zA-Z0-9_]
\W	Any non-word character. The same as [^a-zA-Z0-9_]
\d	Any digit. The same as [0-9]
\D	Any non-digit. The same as [^0-9]
\s	Any whitespace character: space, tab, newline, etc
\S	Any non-whitespace character

Repeating stuff:

*	Zero or more of the previous character
+	One or more of the previous character
?	Zero or one of the previous character
{3}	Three times the previous character
{5,10}	Five to ten times the previous character

Representing special positions in the string i.e. (spaces between characters):

^	The beginning of the string
$	The end of the string - immediately before the newline (if any)
\b	A word boundary, can not be used in inside []
\B	No word boundary
[a-z]+	Any sequence of lower case letters longer than one

Special Characters:

\n	A newline character
\t	A tab character

Extra goodies:

jelly\|cream	Either jelly or cream
(eg\|le)gs	Either eggs or legs
(da)+	Either da or dada or dadada or...

Escapes for special characters

If you want your regular expression to match characters that have special meaning in regular expressions you use a backslash before the character. This tells the regular expression engine that e.g. '\+' is is an actual '+' character and not a symbol for repeating the previous character.

\.	Full stop character
\\|	Vertical bar character
\[	An open square bracket character
\)	A closing parenthesis character
\*	An asterisk character\|
\^	A carat symbol
\/	A slash character
\\	A backslash character

Exercises

Download this linked file: alice.txt. Then write different regular expressions and retrieve the lines from alice.txt

with a full stop character '.'
that contain a three letter string consisting of "s", then any character, then "e" (such as "she").
that contain a word of any length that starts with "s" and ends with "e".
that start with "a".
with an odd digit followed by an even digit (eg. 12 or 74)
with a date in the form "April 14. 2011". That is, a word with a capital first letter, a space, one or two digits, a dot, a space and four digits.
that do not contain "the ".

Retrieving the match

If you want to retrieve the string that was matched by the regular expression you use the group() method of the match object. This method takes a integer as argument. Use 0 to get the string that corresponds to the string that matches the regular expression:

for line in text:
    match = regexp.search(line)
    if match:
       print "Match: %s, Line: %s" % (match.group(0), line)

Take a close look at the string formatting that produces the line we print.

Non-greedy Multipliers

By default the multipliers * and + are "greedy" which means they match as many characters as possible. A question mark behind a multiplier forces it to be non-greedy, i.e. match as few characters as possible. Searching a string "This is Miss Mississippi 2009" with the regular expression r"Miss.*ippi" would match "Miss Mississippi" where as r"Miss.*?ippi" would match only "Mississippi".

Exercise

Use the code example above and print both match and line and write a regular expression that matches "Alice's Adventures in Wonderland (commonly shortened to Alice in Wonderland" and one that matches "Alice's Adventures in Wonderland".

Remembering Patterns

Sometimes it is useful to be able to select substrings from the string that is matched by the regular expression. For example, the expression r"\d\d?\.\d\d?\.\d\d\d\d" matches a date format. To capture the date, month, year from the matching string we can put parentheses around these and retrieve the substrings as shown in the following example:

import re

regexp = re.compile(r"(\d\d?)\.(\d\d?)\.(\d\d\d\d)")

result = regexp.search(date)
if result:
    print "Day:", result.group(1)
    print "Month:", result.group(2)
    print "Year:", result.group(3)

Notice that we use the group() method again, but now with arguments 1 or larger. With argument 1 you get the content of the first set of parentheses. Guess what you get with argument 2.

Exercise

Write a regular expression that matches two capitalized words like a name. Then retrieve and print the first name and the last name for each such match in lines in alice.txt.

Substitution

Instead of just printing the results of a search, you can also replace them with a string s using the sub(s) method of the regular expression object regexp. The behaviour is the one you know from search/replace in a word processor. The following code replaces "t" with "T". Note: whatever is searched for is a regular expression, but it is replaced by a string. That means r"t" is a regular expression in the example but "T" is a string.

import re

# open a file
f = open("alice.txt","r")

# compiling the regular expression:
regexp = re.compile(r"t")

# searching the file content line by line:
for line in f:
    print regexp.sub ("T", line),

Exercises

Replace all instances of "the " with "the bloody".
Delete all words with more than 3 characters. Hint: deleting means replacing with nothing.

Optional exercise:

Parenthesis can also be used to match repeated substrings within one regular expression. In this case, the groups are denoted by \1, \2, \3. For example, r"(.)\1" matches any character that occurs twice. Note that this is different from r"..", which means any two (possibly different) characters.

Exercise:

Print all lines in the alice.txt file that contain double characters like middle or good.