Python: Regular Expressions Part Two

Previously we have discussed about Introduction, Simple Metacharacters and Character classes. Checkout part one to get better understandings.  Today we will discuss about More MetaCharacters, Groups, Special Sequences and Email Extraction.

Metacharacters

Some more metacharacters are *, +, ?, { and }.
These specify numbers of repetitions.
The metacharacter * means “zero or more repetitions of the previous thing”. It tries to match as many repetitions as possible. The “previous thing” can be a single character, a class, or a group of characters in parentheses.
Example:

import re
pattern = r"egg(spam)*"
if re.match(pattern, "egg"):
   print("Match 1")
if re.match(pattern, "eggspamspamegg"):
   print("Match 2")
if re.match(pattern, "spam"):
   print("Match 3"

The example above matches strings that start with “egg” and follow with zero or more “spam”s.

The metacharacter + is very similar to *, except it means “one or more repetitions”, as opposed to “zero or more repetitions”.
Example:

import re
pattern = r"g+"
if re.match(pattern, "g"):
   print("Match 1")
if re.match(pattern, "gggggggggggggg"):
   print("Match 2")
if re.match(pattern, "abc"):
   print("Match 3")

To summarize:
* matches 0 or more occurrences of the preceding expression.
+ matches 1 or more occurrence of the preceding expression.

The metacharacter ? means “zero or one repetitions”.
Example:

import re
pattern = r"ice(-)?cream"
if re.match(pattern, "ice-cream"):
   print("Match 1")
if re.match(pattern, "icecream"):
   print("Match 2")
if re.match(pattern, "sausages"):
   print("Match 3")
if re.match(pattern, "ice--ice"):
   print("Match 4")

Curly Braces

Curly braces can be used to represent the number of repetitions between two numbers.
The regex {x,y} means “between x and y repetitions of something”.
Hence {0,1} is the same thing as ?.
If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity.
Example:

import re
pattern = r"9{1,3}$"
if re.match(pattern, "9"):
   print("Match 1")
if re.match(pattern, "999"):
   print("Match 2")
if re.match(pattern, "9999"):
   print("Match 3")

“9{1,3}$” matches string that have 1 to 3 nines.

Groups

A group can be created by surrounding part of a regular expression with parentheses.
This means that a group can be given as an argument to metacharacters such as * and ?.
Example:

import re
pattern = r"egg(spam)*"
if re.match(pattern, "egg"):
   print("Match 1")
if re.match(pattern, "eggspamspamspamegg"):
   print("Match 2")
if re.match(pattern, "spam"):
   print("Match 3")

(spam) represents a group in the example pattern shown above.

The content of groups in a match can be accessed using the group function.
A call of group(0) or group() returns the whole match.
A call of group(n), where n is greater than 0, returns the nth group from the left.
The method groups() returns all groups up from 1.
Example:

import re
pattern = r"a(bc)(de)(f(g)h)i"
match = re.match(pattern, "abcdefghijklmnop")
if match:
   print(match.group())
   print(match.group(0))
   print(match.group(1))
   print(match.group(2))
   print(match.groups())

As you can see from the example above, groups can be nested.

There are several kinds of special groups.
Two useful ones are named groups and non-capturing groups.
Named groups have the format (?P<name>…), where name is the name of the group, and is the content. They behave exactly the same as normal groups, except they can be accessed bygroup(name) in addition to its number.
Non-capturing groups have the format (?:…). They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.
Example:

import re
pattern = r"(?P<first>abc)(?:def)(ghi)"
match = re.match(pattern, "abcdefghi")
if match:
   print(match.group("first"))
   print(match.groups())

Metacharacters

Another important metacharacter is |.
This means “or”, so red|blue matches either “red” or “blue”.
Example:

import re
pattern = r"gr(a|e)y"
match = re.match(pattern, "gray")
if match:
   print ("Match 1")
match = re.match(pattern, "grey")
if match:
   print ("Match 2")    
match = re.match(pattern, "griy")
if match:
    print ("Match 3")

Special Sequences

There are various special sequences you can use in regular expressions. They are written as a backslash followed by another character.
One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17. This matches the expression of the group of that number.
Example:

import re
pattern = r"(.+) \1"
match = re.match(pattern, "word word")
if match:
   print ("Match 1")
match = re.match(pattern, "?! ?!")
if match:
   print ("Match 2")
match = re.match(pattern, "abc cde")
if match:
   print ("Match 3")

Note, that “(.+) \1” is not the same as “(.+) (.+)”, because \1 refers to the first group’s subexpression, which is the matched expression itself, and not the regex pattern.

More useful special sequences are \d, \s, and \w.
These match digits, whitespace, and word characters respectively.
In ASCII mode they are equivalent to [0-9], [ \t\n\r\f\v], and [a-zA-Z0-9_].
In Unicode mode they match certain other characters, as well. For instance, \w matches letters with accents.
Versions of these special sequences with upper case letters – \D, \S, and \W – mean the opposite to the lower-case versions. For instance, \D matches anything that isn’t a digit.
Example:

import re
pattern = r"(\D+\d)"
match = re.match(pattern, "Hi 999!")
if match:
   print("Match 1")
match = re.match(pattern, "1, 23, 456!")
if match:
   print("Match 2")
match = re.match(pattern, " ! $?")
if match:
    print("Match 3")

(\D+\d) matches one or more non-digits followed by a digit.

Additional special sequences are \A, \Z, and \b.
The sequences \A and \Z match the beginning and end of a string, respectively.
The sequence \b matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
The sequence \B matches the empty string anywhere else.
Example:

import re
pattern = r"\b(cat)\b"
match = re.search(pattern, "The cat sat!")
if match:
   print ("Match 1")
match = re.search(pattern, "We s>cat<tered?")
if match:
   print ("Match 2")
match = re.search(pattern, "We scattered.")
if match:
   print ("Match 3")

\b(cat)\b” basically matches the word “cat” surrounded by word boundaries.

Email Extraction

To demonstrate a sample usage of regular expressions, lets create a program to extract email addresses from a string.
Suppose we have a text that contains an email address:

str = “Please contact info@test.com for assistance”

Our goal is to extract the substring “info@test.com”.
A basic email address consists of a word and may include dots or dashes. This is followed by the @ sign and the domain name (the name, a dot, and the domain name suffix).
This is the basis for building our regular expression.

pattern = r”([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)”

[\w\.-]+ matches one or more word character, dot or dash.
The regex above says that the string should contain a word (with dots and dashes allowed), followed by the @ sign, then another similar word, then a dot and another word.

Our regex contains three groups:
1 – first part of the email address.
2 – domain name without the suffix.
3 – the domain suffix.

Putting it all together:

import re
pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"
str = "Please contact info@test.com for assistance"
match = re.search(pattern, str)
if match:
   print(match.group())

In case the string contains multiple email addresses, we could use the re.findall method instead of re.search, to extract all email addresses.

The regex in this example is for demonstration purposes only.
A much more complex regex is required to fully validate an email address.

Courtesy: sololearn

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s