Python: Regular Expressions Part One

Regular expressions are a powerful tool for various kinds of string manipulation.
They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python.
They are useful for two main tasks:
– verifying that strings match a pattern (for instance, that a string has the format of an email address),
– performing substitutions in a string (such as changing all American spellings to British ones).

Domain specific languages are highly specialized mini programming languages.
Regular expressions are a popular example, and SQL (for database manipulation) is another.
Private domain-specific languages are often used for specific industrial purposes.

Regular expressions in Python can be accessed using the re module, which is part of the standard library.
After you’ve defined a regular expression, the re.match function can be used to determine whether it matches at the beginning of a string.
If it does, match returns an object representing the match, if not, it returns None.
To avoid any confusion while working with regular expressions, we would use raw strings asr”expression”.
Raw strings don’t escape anything, which makes use of regular expressions easier.

import re

pattern = r"spam"

if re.match(pattern, "spamspamspam"):
   print("Match")
else:
   print("No match")

The above example checks if the pattern “spam” matches the string and prints “Match” if it does.

Here the pattern is a simple word, but there are various characters, which would have special meaning when they are used in a regular expression.

Other functions to match patterns are re.search and re.findall.
The function re.search finds a match of a pattern anywhere in the string.
The function re.findall returns a list of all substrings that match a pattern.

Example:

import re

pattern = r"spam"

if re.match(pattern, "eggspamsausagespam"):
   print("Match")
else:
   print("No match")

if re.search(pattern, "eggspamsausagespam"):
   print("Match")
else:
   print("No match")
    
print(re.findall(pattern, "eggspamsausagespam"))

In the example above, the match function did not match the pattern, as it looks at the beginning of the string.
The search function found a match in the string.

The function re.finditer does the same thing as re.findall, except it returns an iterator, rather than a list.

The regex search returns an object with several methods that give details about it.
These methods include group which returns the string matched, start and end which return the start and ending positions of the match, and span which returns the start and end positions as a tuple.

import re
pattern = r"pam"
match = re.search(pattern, "eggspamsausage")
if match:
   print(match.group())
   print(match.start())
   print(match.end())
   print(match.span())

Search & Replace
One of the most important re methods that use regular expressions is sub.
Syntax:

re.sub(pattern, repl, string, max=0)

This method replaces all occurrences of the pattern in string with repl, substituting all occurrences, unless max provided. This method returns the modified string.
Example:

import re
str = "My name is David. Hi David."
pattern = r"David"
newstr = re.sub(pattern, "Amy", str)
print(newstr)

Metacharacters:

Metacharacters are what make regular expressions more powerful than normal string methods.
They allow you to create regular expressions to represent concepts like “one or more repetitions of a vowel”.

The existence of metacharacters poses a problem if you want to create a regular expression (orregex) that matches a literal metacharacter, such as “$”. You can do this by escaping the metacharacters by putting a backslash in front of them.
However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting three or four backslashes in a row to do all the escaping.

To avoid this, you can use a raw string, which is a normal string with an “r” in front of it. We saw usage of raw strings in the previous lesson.

The first metacharacter we will look at is . (dot). This matches any character, other than a new line.
Example:

import re
pattern = r"gr.y"
if re.match(pattern, "grey"):
   print("Match 1")
if re.match(pattern, "gray"):
   print("Match 2")
if re.match(pattern, "blue"):
   print("Match 3")

The next two metacharacters are ^ and $. These match the start and end of a string, respectively.
Example:

import re
pattern = r"^gr.y$"
if re.match(pattern, "grey"):
   print("Match 1")
if re.match(pattern, "gray"):
   print("Match 2")
if re.match(pattern, "stingray"):
   print("Match 3")

The pattern “^gr.y$” means that the string should start with gr, then follow with any character, except a newline, and end with y.

Character Classes:

Character classes provide a way to match only one of a specific set of characters.
A character class is created by putting the characters it matches inside square brackets.
Example:

import re
pattern = r"[aeiou]"
if re.search(pattern, "grey"):
   print("Match 1")
if re.search(pattern, "qwertyuiop"):
   print("Match 2")
if re.search(pattern, "rhythm myths"):
   print("Match 3")

The pattern [aeiou] in the search function matches all strings that contain any one of the characters defined.

Character classes can also match ranges of characters.
Some examples:
The class [a-z] matches any lowercase alphabetic character.
The class [G-P] matches any uppercase character from G to P.
The class [0-9] matches any digit.
Multiple ranges can be included in one class. For example, [A-Za-z] matches a letter of any case.

Example:

import re
pattern = r"[A-Z][A-Z][0-9]"
if re.search(pattern, "LS8"):
   print("Match 1")
if re.search(pattern, "E3"):
   print("Match 2")
if re.search(pattern, "1ab"):
   print("Match 3")

The pattern in the example above matches strings that contain two alphabetic uppercase letters followed by a digit.

Place a ^ at the start of a character class to invert it.
This causes it to match any character other than the ones included.
Other metacharacters such as $ and ., have no meaning within character classes.
The metacharacter ^ has no meaning unless it is the first character in a class.

Example:

import re
pattern = r"[^A-Z]"
if re.search(pattern, "this is all quiet"):
   print("Match 1")
if re.search(pattern, "AbCdEfG123"):
   print("Match 2")
if re.search(pattern, "THISISALLSHOUTING"):
   print("Match 3")

The pattern [^A-Z] excludes uppercase strings.
Note, that the ^ should be inside the brackets to invert the character class.

Courtesy: sololearn

Advertisements

One thought on “Python: Regular Expressions Part One

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s