Special characters, also known as non-literals are characters that have a special meaning in the context of a regular expression. These are sometimes also referred to as metacharacters , and here they are:
The dot or period (.): The dot character is the quintessential wildcard of regular expressions. This special RegEx character will match any single character.
For example the RegEx:
port.
Strings that will match:
ports
,portA
,portC
,portX
,Port8
, and so on. Note that in UAG, RegEx is not case sensitive.Strings that will not match the above RegEx:
portable
,airport
,port12
,and so on.
The asterisk or star (*) : This metacharacter has the meaning of a repetition in RegEx, instructing the engine to match whatever was the previous character, zero or more times.
For example the RegEx:
port*
(meaning the charactersp
,o
,r
, followed byt
repeated zero times, or once, or multiple times)Strings that will match:
por, port,
portt, porttttt
Strings that will not match:
portable,
airport,
port12
, and so on
The combination of the dot and the star is going to become a good friend of yours when you read and write RegExes. This combination means "any character" (the dot), repeated zero or more times (the star), which in effect means "anything"!
The RegEx:
port.*
(meaning the charactersp
,o
,r
,t
, followed by "any character" repeated zero or more times)The plus sign (+) : This is just like the star sign, except that the plus sign denotes a repetition of at least once or more times of the preceding character (unlike the star sign, which also allows no repetition at all).
The RegEx:
server1+
Will match:
server1
,server11,
server111111,
and so onWill not match:
server1,
server12,
server7,
and so on
The question mark (?) : This, too, is a repetition metacharacter, which allows for the preceding character to be repeated exactly once or zero times (meaning not at all).
The RegEx:
appa?
Will match:
AppA
,AppAA
,.and so onWill not match:
AppAAA,
AppAB,
App5,
and so on
The backslash (/) : This is also known as "the escape character". Its task is to be placed in front of the special RegEx character when you want it to be treated by the RegEx engine as a literal, and not as a special character.
The RegEx:
/images/logo\.gif
(meaning the dot before the stringgif
is a literal dot, not "any character")Will match:
/images/logo.gif
Will not match:
/images/logosgif
, and so on
Note that the repeat expressions we just discussed do not necessarily refer only to the preceding single character. They repeat the preceding "token", which means the preceding character or the preceding character set or sub-expression. Let's see what these are:
Character set: As the name hints, this is a set of characters, where the RegEx engine can match any one of the characters in the set. A character set is delimited by the square brackets, which are both special characters.
Will match:
potatoe
andpotatos
(might come in handy when you need to cater for those who make spelling mistakes, like a certain VP back in 1992)Will not match:
potatoes
A range of characters: This is a special character set, which encompasses a full range of characters, for example, the letters from A to M, or the digits from 0 to 6. You do not need to specify all the characters in the range, instead you only specify the first one and the last one, divided by a dash (
-
). You can have more than one range in the set, and you can also mix-and-match between ranges and single characters in the same set. Note that the "-
" character is a special character when found between square brackets, unless it is at the very beginning or very end of the set, which then makes it a literal.The RegEx:
server[a-f1-5]
(meaning "server", followed by one of the letters
a
tof
, or the lettersA
toF
, or the digits1
to5
)Will match:
serverD
,serverF
,server2
,server5
, and so onWill not match:
serverA1
,serverAB
,server05
, and so on.
The RegEx:
/scripts/[a-z0-9_-]+\.vbs
(means/scripts/
, followed by one of the lettersa
toz
or the lettersA
toZ
or the digits0
to9
or the underscore or dash, any of these repeated one or multiple times (due to the plus sign), followed by a literal period, and then byvbs
)Will match:
/scripts/mapdrives.vbs
,/scripts/home_017.vbs
, and so onWill not match:
/scriptsdrives.vbs
,/scripts/test/run.vbs
,/scripts/policy.cgi
, and so on
The excluding character set: This is very similar to the character set, except that the characters specified in the set must not appear in the string. The excluding character set is defined by a caret (
^
) sign immediately following the opening square bracket.The RegEx:
server-[^br]ed
(meaningserver-
, followed by any single character except the lettersb
orr
(as always, non case-sensitive), and then followed byed
)Will match:
server-Ted
,server-Med
,server-8ED
, and so onWill not match:
serverTed
,server-Red
,server-Bed
, and so on
Alternatives and sub-expressions: Alternatives are more than one option that could result in a match. The options in a RegEx alternative are divided by the pipe symbol or vertical bar (
|
). The options can be single characters or groups of characters (for example:green|yellow
), and they can also be grouped in a sub-expression, where the sub-expression delimiters are the opening and closing round parenthesis characters ((
) and ()
)The RegEx:
/images/[a-z0-9_-]+\.(gif|jpg|jpeg)
(meaning/images/
, followed by any letter, or digit, or underscore or dash, repeated one or more times, followed by a period and then by eithergif
orjpg
orjpeg
)Will match:
/images/logo_small.gif
,/images/corp17.jpeg
,/images/87329.jpg
, and so onWill not match:
/image/line.jpg
,/logo.gif
,/images/left(a).png
, and so on
Before we conclude, let's try to understand how this works in the real world. The last sample above shows an expression you might need. The default ruleset for the portal allows only JPG files to be used, and will block other file extensions. If you need to customize the appearance, and your design uses GIF files instead of JPG, you will have to change some of the portal rules using the pipe symbol to also accept GIF. Another example is when publishing major web applications, where we often need to specify many servers during the application-publishing wizard. Even using copy-and-paste, this is still a tedious and error-prone task, but RegEx allows us to reduce the risk and save time. Let's say that your organization uses a cluster of eight servers, named HRWEB01
to HRWEB08
. Instead of manually entering all their names, you can simply use the RegEx HRWEB0[1-8]
. The tricky part is to choose the best combination of literals and non-literals so that we get the appropriate coverage for everything, but nothing (or, as little as possible) more. With time and practice, you will master this too. While experimenting, you might benefit from using a RegEx Evaluator, which can be a stand-alone software, or an online one. We cannot recommend a specific one here, but use your favourite search engine (Bing, we are guessing) to find one. Should you need, you can find more information about the UAG RegEx++ syntax here: http://technet.microsoft.com/en-us/library/dd282903.aspx
.