Reader small image

You're reading from  Perl 6 Deep Dive

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787282049
Edition1st Edition
Languages
Right arrow
Author (1)
Andrew Shitov
Andrew Shitov
author image
Andrew Shitov

Andrew Shitov has been a Perl enthusiast since the end of the 1990s, and is the organizer of over 30 Perl conferences in eight countries. He worked as a developer and CTO in leading web-development companies, such as Art. Lebedev Studio, Booking dotCom, and eBay, and he learned from the "Fathers of the Russian Internet", Artemy Lebedev and Anton Nossik. Andrew has been following the Perl 6 development since its beginning in 2000. He ran a blog dedicated to the language, published a series of articles in the Pragmatic Perl magazine, and gives talks about Perl 6 at various Perl events. In 2017, he published the Perl 6 at a Glance book by DeepText, which was the first book on Perl 6 published after the first stable release of the language specification.
Read more about Andrew Shitov

Right arrow

Regexes

Regular expressions are one of the most valuable features of Perl. In Perl 6, regular expressions were redesigned to make them more regular and powerful. The term also changed—regular expressions are more often called simply regexes now. In this chapter, we will go through all the elements of the syntax of regexes.

The following topics will be covered in this chapter:

  • Matching against regexes
  • Literals
  • Character classes
  • Quantifiers
  • Anchors
  • Alternation
  • Grouping
  • Capturing and named captures
  • Named regexes
  • The Match object
  • Assertions
  • Adverbs
  • Substitution

Matching against regexes

Regexes describe patterns of text. They provide us with a language, in which we can express the structure of the text.

Consider an example. A phone number is a sequence of digits. The phrase "sequence of digits" can be written down as \d+. If we take into account the fact that phone numbers may be written with spaces and dashes, then we have to say that a phone number is a sequence of digits, delimited with spaces or dashes. This is already a more complex regex, which can be written differently, depending on how strict we are, for instance, if we allow two spaces together or if a dash can be followed by a space, or if a group of digits can consist of a single digit.

Let's be least strict and formalize it as (\d || \s || \-)+, that is more than one number of digits (\d) or spaces (\s) or dashes (\-). The double vertical bar stands for "...

Literals

The syntax of regexes is a small language within Perl 6. As there are many things to express, it uses some characters to convey the meaning. Letters, digits and underscores stand for themselves without any special meaning. These characters can be used as-is, as shown in the following example:

my $name = 'John';
say 'OK' if $name ~~ /John/; # OK

my $id = 534;
say 'OK' if $id ~~ /534/; # OK

If the string inside a regex contains other characters, for example, spaces, you should take care of them. One of the possibilities is to quote the whole string:

my $name = 'Smith Jr.' ;
say 'Junior' if $last-name ~~ /' Jr'/; # Junior

The literal string ' Jr' inside a regex contains a space that will have to be present in the variable $name.

Another alternative is to use a special character, prefixed by a backslash. For...

Character classes

A character class in regexes is a special sequence that matches characters from some given set. For example, in the previous section, we already used a character class \s, which matches with an ASCII space as well as with some other whitespace characters, such as tabs. Let us explore character classes in regexes of Perl 6.

The . (dot) character

A very simple character, just a single dot, can match with any character in the string. This is often used when you do not care about some character between the two parts. For example, the following code will match with a string that has any two characters between a and d:

say 'OK' if 'abcd' ~~ / a . . d /; # OK
say 'OK' if 'aefd...

Creating repeated patterns with quantifiers

Quantifiers modify the previous atom and request the particular number of repetitions. An atom is a character or character class or a string literal or a group (we will talk about groups later in the Extracting substrings with capturing section of this chapter).

The + quantifier allows the previous atom to be repeated one or more times. For example, the regex /a+/ matches with a single character a, as well as with a string containing two characters aa, or three, or more—aaaaaa. It will not, however, match with a string that does not contain the a character at all.

The * quantifier allows any number of repetitions, including zero. So, the /a*/ regex matches with strings such as bdef, abc, or baad. Of course, a single /a*/ may not be that useful; the * quantifier's more natural use case is between other substrings, such as...

Extracting substrings with capturing

Matching against regexes is not enough. The real power of regular expressions is not complete without the ability to extract the substrings that agreed with the regex pattern. Saving the parts of the string in special variables is called capturing.

Capturing groups

In Perl 6, capturing is achieved by placing the part of a regex in parentheses. Parentheses have as dual meaning in regexes. We already have seen the usage of parentheses for grouping alternatives in the phone number.

Let us continue with the example of extracting values of HTML attributes. We want now to print the values. So, we need to create a regex and mark the borders of the data that we want to extract. Captured data is...

Using alternations in regexes

Let us look once again to our naïve regex for matching phone numbers:

rx/ \+? (\d || \s || \-)+ /

Vertical bars separate different variants within the group in parentheses. It can be either \d, or \s, or \-. In the context of regexes, this is call alternation. Different variants are, correspondingly, called alternatives.

In Perl 6, there are two forms of alternation separator in regexes—single | and double || vertical bars . With a single vertical bar, the longest variant always wins. With the double bar, the first matched alternative wins.

In the phone number example, each alternative is exactly one symbol long. So, there is no difference between | and || there. In other cases, the choice of the operator may drastically change the result.

For example, take the two regexes from the following example and match the forms of an adjective...

Positioning regexes with anchors

In many cases, a regex has to be applied to the string in such a way that its beginning coincides with the beginning of the string. For example, if a phone number contains the + character, it can only appear in the first position.

Perl 6 regexes have so-called anchors—special characters, that anchor a regex to either the beginning or the end of the string or a logical line.

Matching at the start and at the end of lines or strings

Let us modify the phone number regex so that it forces the regex to match with the whole string containing a potential phone number:

/ ^ \+? <[\d\s\-]>+ $ /;

Here, ^ is the anchor that matches at the beginning of the string and does not consume any...

Looking forward and backward with assertions

Another topic of manipulating the flow of a regex is assertions. During the match process, the pattern consumes characters of the source strings. Assertions help to make some checks at the current position without eating characters.

There are two types of assertions in Perl 6 regexes—lookahead and lookbehind. Each of them can be negated. In the following table, all the possible combinations are listed:

Positive assertion Negative assertion
Lookahead <?before X> <!before X>
Lookbehind <?after X> <!after X>

Being placed inside a regex, the lookahead assertion <?before X> checks whether at this position the following characters are X. If it is so, then the assertion succeeds and the regex engine continues its work. Other assertions behave following the same logical considerations, for example...

Modifying regexes with adverbs

Adverbs are regex modifiers. They are colon-prefix letters that change the behavior of regexes.

Adverbs exist in two forms—short and long—and appear in front of a regex, for example:

say 'OK' if 'ABCD' ~~ m:i/ abcd /;

Notice, that when an adverb is applied to the whole regex as in this example, m or rx is needed. Alternatively, an adverb can be put inside the regex. In this case, it starts its action from the position where it appeared. This is demonstrated in the examples in the next section about the :i adverb.

The following table lists all the adverbs:

Short form Long form Description
:i :ignorecase Match letters are case-insensitive
:s :sigspace Whitespacess are significant
:p(N) :pos(N) Start at position N
:g :global Match globally
:c :continue Continue after the previous match
:r :ratchet Disable...

Substitution and altering strings with regexes

Matching strings with a regex often extracts some information from the given data. Another common task is to replace parts of the text with different characters. In Perl 6, the s built-in function does that.

It takes two arguments, a regex and a replacement. When a regex is applied to the source string and the pattern is matched, the part of the string that matches is replaced with the second argument.

Consider a simple example:

my $str = 'Its length is 10 mm';
$str ~~ s/<<mm>>/millimeters/;
say $str; # Its length is 10 millimeters

The regex here, /<<mm>>/, matches with the word mm. The second part tells to replace it with the full name of the measurement unit. The replacement happens in-place and the original string is modified.

Traditionally, s uses slashes as delimiters but different characters can...

Summary

In this chapter, we discussed about regexes in Perl 6. They share many common ideas with regular expressions in Perl 5 but also offer many fascinating new things. We examined the methods of constructing regexes and matching with text, learned how to extend the power of a regex engine by using character classes, written by you or built-in. We also looked at the way Perl 6 stores results in the Match object and how to make substitution and replacement in strings using regexes.

In the next chapter, we will meet an even more powerful tool that tremendously extends regexes, grammars.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Perl 6 Deep Dive
Published in: Sep 2017Publisher: PacktISBN-13: 9781787282049
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrew Shitov

Andrew Shitov has been a Perl enthusiast since the end of the 1990s, and is the organizer of over 30 Perl conferences in eight countries. He worked as a developer and CTO in leading web-development companies, such as Art. Lebedev Studio, Booking dotCom, and eBay, and he learned from the "Fathers of the Russian Internet", Artemy Lebedev and Anton Nossik. Andrew has been following the Perl 6 development since its beginning in 2000. He ran a blog dedicated to the language, published a series of articles in the Pragmatic Perl magazine, and gives talks about Perl 6 at various Perl events. In 2017, he published the Perl 6 at a Glance book by DeepText, which was the first book on Perl 6 published after the first stable release of the language specification.
Read more about Andrew Shitov