1. Half the Battle: How Python Understand Regular Expressions
Half the battle in writing a good regular expression (regex) is understanding how the computer -- not just the Python interpreter -- reads it. The basic process for understanding this has three parts to it:
- Understanding how a computer reads and recognizes character strings in general
- Understanding how it understands meta-characters and regular expression symbols
- Understanding how to combine those characters and symbols into meaningful expressions to match a group of strings
2. How Python Views a String: Every Dog has Its Null Value
As you might expect, the computer reads strings the same way you do: one letter at a time. What is more, it recognizes character strings the same you do: by reading until it matches the word and then moving to the next. However, there are parts of character strings which the computer reads differently and some which the computer alone sees. These are important to keep in mind as you form a regular expression.
As an example of strings read differently, you and I read "\n" as a backslash followed by the letter en. But the computer will read "\n" (without the quotes) as an indicator of a new line. Similarly, you and I read " " as five blank spaces or possibly a tab space. For the computer, however, the latter option is represented by "\t", which we read as a backslash followed by a tee.
Similarly, character strings have parts which only the computer can see and distinguish. In memory (RAM), every string ends with what is called a null value ( ). This is how the computer knows where the string begins and ends. (Note: how data is stored on a hard disk is determined largely by the file system in which the disk is formatted -- e.g., FAT32, NTFS, etc.). So, the string 'dog' actually looks like this image in memory ("|" represents the boundary between each block).
When the computer gets to , it knows to go no further. In addition to this, there are incremental boundaries to string values that you should keep in mind.
3. How Did the Python Eat the String? One Byte at a Time
Beyond the null value the computer notices all of the incremental locations which we as human readers seldom, if ever, think about. Perhaps the two most common of these are the spots immediately before and after a string's value. In this image, those two spots are highlighted in red.
Next we look at how Python views two words in one string.
4. Python Strings Along Dogs and Cats
This distinction also applies to word boundaries within a character string. This image shows how the a character string "dog cat" would appear in memory. The word boundaries are highlighted in red and the string boundaries are highlighted in green. You will note that the medial blank space is open -- it is not null. The leftmost green boundary is the beginning of the character string. The red boundary that follows it is the beginning of the first word.
Understanding the parameters of string boundaries and the boundaries of a string's contents, we can look at the metacharacters of regex formulation.
5. "I never metacharacter I didn't like."
As the saying goes, "I never metacharacter I didn't like." Once you learn how to use metacharacters, you will wonder how you ever got on without them. This is particularly true in Python. Python has a robust regular expression engine and employs a wide array of metacharacters. The following is a list of the symbols and their uses:
- .: any character except a newline.
- ^: the boundary at the beginning of a character string (the green line)
- $: the boundary at the end of a character string (the other green line)
- *: zero or more instances of a pattern
- +: one or more instances of a pattern
- ?: one or more instances of a pattern
- \: the indicator of an escape sequence
- {}: used to specify parameters for the matching of the regular expression
- []: indicates a set of characters for a single position in the regex
- (): used to group regular expressions
- |: used as an 'or' between two possible matches.
6. Python's Compound Metacharacters
While many of the symbols mentioned on the previous page have a stand-alone meaning, two of them are used to form compound symbols and expressions which change the meaning of regular expressions. They are the backslash ("\") and curly braces "{}".
The backslash is the symbol used to initiate an escape sequence. Obviously, there are only so many characters on a keyboard and repetition of those basic components is necessary in any system. So the backslash is used to convert "normal" or "regular" characters into "escaped" characters. The following are the escaped characters that one can use within a regular expression and their meanings:
- \n: a newline
- \t: a tabspace
- \A: the start of a string (similar to "^")
- \Z: the end of a string (similar to "$")
- \b: the boundary of a word (the red line)
- \B: the empty string that is neither at the beginning nor at the end of a word
- \d: any decimal digit (the mathematical set of real numbers)
- \D: any non-decimal digit (the mathematical set of integers)
- \s: any whitespace character (blank space, tab, etc.)
- \S: any non-whitespace character
- \w: any alphanumeric character and the underscore
- \W: any non-alphanumeric character (e.g., "&", "£", "!", etc.) [\ul] The curly braces, however, take on a different meaning depending on their contents.
7. Curly Braces in Python Regular Expressions
In general, the curly braces are used to indicate a range. If you use a single digit between them, a specific number of instances will be matched:
>>> import re
>>> list = ('coco', 'dodo', 'mumu', 'dada', 'haha', 'coha', 'domu', 'hada')
>>> x = re.compile('\S[ou]{1}')
>>> for i in list:
... if re.match(x, i):
... print i
...
coco
dodo
mumu
coha
domu
It is important to note that the number of instances is equivalent to at least one. If, however, you render the number of instances as zero (i.e., "{0}"), then all instances will be matched. That is, no matching will occur.
>>> import re
>>> list = ('coco', 'dodo', 'mumu', 'dada', 'haha', 'coha', 'domu', 'hada')
>>> x = re.compile('\S[ou]{0}')
>>> for i in list:
... if re.match(x, i):
... print i
...
coco
dodo
mumu
dada
haha
coha
domu
hada
8. Tightening Python's Grip: Ranges in Python Pattern Matching
Two numbers within curly braces denotes a range in the number of time to repeat the regular expression. Instead of writing "{3}", we could write "{0,3}" and have the same effect. So a single digit within the curly braces like "{3}" effectively begins with an implicit zero.
This is important when you want minimum and maximum thresholds on the number of matches returned. By "the number of matches" is not meant the number of times the regex will match a string (for this you must use iteration). Rather, it is the number of times the regex should be repeated when Python builds the regex for comparison.
For example, say you give re.compile the following regular expression formula:
"^\S"
It will match any string that does not begin with whitespace. However, say you give it this regular expression formula:
"^\S{2,5}"
Then, Python will construct a regular expression that will match any and all of the following regular expression formulae:
"^\S\S"
"^\S\S\S"
"^\S\S\S\S"
"^\S\S\S\S\S"
Note that matches beginning with a single character are not included, and anything with six non-whitespace characters or above are omitted as well.
9. An Example of Pattern Matching in Python
To clarify this even further, let's consider an example. Say you have a list of words that all begin with the same letter: deep, deer, deescalate, deface, de facto, defalcate, defame, defat, default, defeat, de haut en bas, Dei gratia, déjà vu. If you wanted all words that do not being with the Latin preposition de, how would you find them? Give a minimum threshold of 3. For purposes of illustration, what if you wanted only words that fit that minimum threshold and had a vowel in the fifth position? You have to tack on an extra piece of regex as follows:
>>> import re
>>> list = ("deep", "deer", "deescalate", "deface", "de facto", "defalcate", "defame", "defat", "default", "defeat", "de haut en bas", "Dei gratia", "déjà vu")
>>> x = re.compile("^\S{3,5}.[aeiou]")
>>> for i in list:
... if re.match(x, i):
... print i
...
deescalate
deface
defalcate
defame
default
defeat
If we wanted to leave off the vowel requirement but only wanted three or more whitespace characters, the regex could be "\S{3,}". Like list addresses and other ranges, the curly brace syntax allows for implied infinity at either end.
If you are a bit unclear on how the matches were made, compare those items matched to the unmatched items to find a pattern. Then compare that pattern to the regular expression. You might also play with the numbers of the range to see how the hits change. If you are still thrown by all of this, we discuss how this regular expression works and how to form regex formulae on the next page.
10. An Explanation of Forming RegEx in Python
Let's look at the regex for a moment. There are some things here which have been briefly mentioned but not fully discussed.
>>> x = re.compile("^\S{3,5}.[aeiou]")
When forming a regular expression, it is important to remember that every pattern matching symbol equals exactly one character in the string to be matched. The only time this 1-to-1 relationship changes is when you put quantifiers in curly braces or bracket alternatives together in a square braces.
If you find yourself having difficulty in formulating the appropriate regex forumla, write it out long-style -- on-screen or on paper. Then, take it one character at a time. Even if you want more than one match, put each one down explicitly.
So if we want to begin with a consonant, we write:
^\S Then, because we want three, we write out the three consonants:^\S\S\SKeep in mind here that these three symbols equate to three characters. If we then want the fifth character of the string to be a vowel, we need to enter a filler for the fourth position. Enter the period/full stop. Remember, the period/full stop equates to any character save the newline character ("\n").
^\S\S\S. Then, if we want to add the vowels, we need to decide whether we want to include "y". Assuming no, we can place all the vowels together between a pair of square brackets and append them to the formula:^\S\S\S.[aeiou]



