Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Python Regular Expressions

You're reading from  Mastering Python Regular Expressions

Product type Book
Published in Feb 2014
Publisher Packt
ISBN-13 9781783283156
Pages 110 pages
Edition 1st Edition
Languages

Chapter 3. Grouping

Grouping is a powerful tool that allows you to perform operations such as:

  • Creating subexpressions to apply quantifiers. For instance, repeating a subexpression rather than a single character.

  • Limiting the scope of the alternation. Instead of alternating the whole expression, we can define exactly what has to be alternated.

  • Extracting information from the matched pattern. For example, extracting a date from lists of orders.

  • Using the extracted information again in the regex, which is probably the most useful property. One example would be to detect repeated words.

Throughout this chapter, we will explore groups, from the simplest to the most complex ones. We'll review some of the previous examples in order to bring clarity to how these operations work.

Introduction


We've already used groups in several examples throughout Chapter 2 Regular Expressions with Python. Grouping is accomplished through two metacharacters, the parentheses (). The simplest example of the use of parentheses would be building a subexpression. For example, imagine you have a list of products, the ID for each product being made up of two or three sequences of one digit followed by a dash and followed by one alphanumeric character, 1-a2-b:

>>>re.match(r"(\d-\w){2,3}", ur"1-a2-b")
<_sre.SRE_Match at 0x10f690738>

As you can see in the preceding example, the parentheses indicate to the regex engine that the pattern inside them has to be treated like a unit.

Let's see another example; in this case, we need to match whenever there is one or more ab followed by c:

>>>re.search(r"(ab)+c", ur"ababc")
<_sre.SRE_Match at 0x10f690a08>
>>>re.search(r"(ab)+c", ur"abbc")
None

So, you could use parentheses whenever you want to group meaningful subpatterns...

Backreferences


As we've mentioned previously, one of the most powerful functionalities that grouping gives us is the possibility of using the captured group inside the regex or other operations. That's exactly what backreferences provide. Probably the best known example to bring some clarity is the regex to find duplicated words, as shown in the following code:

>>>pattern = re.compile(r"(\w+) \1")
>>>match = pattern.search(r"hello hello world")
>>>match.groups()
('hello',)

Here, we're capturing a group made up of one or more alphanumeric characters, after which the pattern tries to match a whitespace, and finally we have the \1 backreference. You can see it highlighted in the code, meaning that it must exactly match the same thing it matched as the first group.

Backreferences can be used with the first 99 groups .Obviously, with an increase in the number of groups, you will find the task of reading and maintaining the regex more complex. This is something that can...

Named groups


Remember from the previous chapter when we got a group through an index?

>>>pattern = re.compile(r"(\w+) (\w+)")
>>>match = pattern.search("Hello⇢world")
>>>match.group(1)
  'Hello'
>>>match.group(2)
  'world'

We just learnt how to access the groups using indexes to extract information and to use it as backreferences. Using numbers to refer to groups can be tedious and confusing, and the worst thing is that it doesn't allow you to give meaning or context to the group. That's why we have named groups.

Imagine a regex in which you have several backreferences, let's say 10, and you find out that the third one is invalid, so you remove it from the regex. That means you have to change the index for every backreference starting from that one onwards. In order to solve this problem, in 1997, Guido Van Rossum designed named groups for Python 1.5. This feature was offered to Perl for cross-pollination.

Nowadays, it can be found in almost any flavor. Basically...

Non-capturing groups


As we've mentioned before, capturing content is not the only use of groups. There are cases when we want to use groups, but we're not interested in extracting the information; alternation would be a good example. That's why we have a way to create groups without capturing. Throughout this book, we've been using groups to create subexpressions, as can be seen in the following example:

>>>re.search("Españ(a|ol)", "Español")
<_sre.SRE_Match at 0x10e90b828>
>>>re.search("Españ(a|ol)", "Español").groups()
('ol',)

You can see that we've captured a group even though we're not interested in the content of the group. So, let's try it without capturing, but first we have to know the syntax, which is almost the same as in normal groups, (?:pattern). As you can see, we've only added ?:. Let's see the following example:

>>>re.search("Españ(?:a|ol)", "Español")
<_sre.SRE_Match at 0x10e912648>
>>>re.search("Españ(?:a|ol)", "Español").groups...

Special cases with groups


Python provides us with some forms of groups that can help us to modify the regular expressions or even to match a pattern only when a previous group exists in the match, such as an if statement.

Flags per group

There is a way to apply the flags we've seen in Chapter 2 Regular Expressions with Python, using a special form of grouping: (?iLmsux).

Letter

Flag

i

re.IGNORECASE

L

re.LOCALE

m

re.MULTILINE

s

re.DOTALL

u

re.UNICODE

x

re.VERBOSE

For example:

>>>re.findall(r"(?u)\w+" ,ur"ñ")
[u'\xf1']

The above example is the same as:

>>>re.findall(r"\w+" ,ur"ñ", re.U)
[u'\xf1']

We've seen what these examples do several times in the previous chapter.

Remember that a flag is applied to the whole expression.

yes-pattern|no-pattern

This is a very useful case of groups. It tries to match a pattern in case a previous one was found. On the other hand, it doesn't try to match a pattern in case a previous group was not found. In short, it's like an if-else statement...

Overlapping groups


Throughout Chapter 2, Regular Expressions with Python, we've seen several operations where there was a warning about overlapping groups: for example, the findall operation. This is something that seems to confuse a lot of people. So, let's try to bring some clarity with a simple example:

>>>re.findall(r'(a|b)+', 'abaca')
['a', 'a']

What's happening here? Why does the following expression give us 'a' and 'a' instead of 'aba' and 'a'?

Let's look at it step by step to understand the solution:

Overlapping groups matching process

As we can see in the preceding figure, the characters aba are matched, but the captured group is only formed by a. This is because even though our regex is grouping every character, it stays with the last a. Keep this in mind because it's the key to understanding how it works. Stop for a moment and think about it, we're requesting the regex engine to capture all the groups made up of a or b, but just for one of the characters and that's the key...

Summary


Don't allow the simplicity of the chapter to fool you, what we have learned throughout this chapter will be very useful in your day-to-day work with regex, and it'll give you a lot of leverage.

Let's summarize what we have learned so far. First, we have seen how a group can help us when we need to apply quantifiers to only some part of the expression.

We have also learned how to use the captured groups in the pattern again or even in the replacement string in the sub operation, thanks to backreferences.

In this chapter, we have also viewed named groups, a tool for improving the readability and future maintenance of the regex.

Later on, we have learned to match a subexpression just in case a previous group exists or on the other hand, to match it when a previous group doesn't exist.

Now that we know how to use groups, it's time to learn a more complex subject very close to groups; look around!

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Python Regular Expressions
Published in: Feb 2014 Publisher: Packt ISBN-13: 9781783283156
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}