Home Cloud & Networking Python Automation Cookbook

Python Automation Cookbook

By Jaime Buelta
books-svg-icon Book
eBook $29.99 $20.98
Print $38.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $29.99 $20.98
Print $38.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Let Us Begin Our Automation Journey
About this book
Have you been doing the same old monotonous office work over and over again? Or have you been trying to find an easy way to make your life better by automating some of your repetitive tasks? Through a tried and tested approach, understand how to automate all the boring stuff using Python. The Python Automation Cookbook helps you develop a clear understanding of how to automate your business processes using Python, including detecting opportunities by scraping the web, analyzing information to generate automatic spreadsheets reports with graphs, and communicating with automatically generated emails. You’ll learn how to get notifications via text messages and run tasks while your mind is focused on other important activities, followed by understanding how to scan documents such as résumés. Once you’ve gotten familiar with the fundamentals, you’ll be introduced to the world of graphs, along with studying how to produce organized charts using Matplotlib. In addition to this, you’ll gain in-depth knowledge of how to generate rich graphics showing relevant information. By the end of this book, you’ll have refined your skills by attaining a sound understanding of how to identify and correct problems to produce superior and reliable systems.
Publication date:
September 2018
Publisher
Packt
Pages
398
ISBN
9781789133806

 

Let Us Begin Our Automation Journey

In this chapter, we'll cover the following recipes:

  • Creating a virtual environment
  • Installing third-party packages
  • Creating strings with formatted values
  • Manipulating strings
  • Extracting data from structured strings
  • Using a third-party tool—parse
  • Introducing regular expressions
  • Going deeper into regular expressions
  • Adding command-line arguments
 

Introduction

The objective of this chapter is to lay down some of the basic techniques that will be useful through this book. The main idea is to be able to create a good Python environment to run the automation tasks that will follow, and be able to parse text inputs into structured data.

Python has a good amount of tools installed by default, but it also makes it easy to install third-party tools that can simplify common operations when processing texts. In this chapter, we'll see how to import modules from external sources and use them to leverage the full potential of Python.

The ability to structure input data is critical in any automation task. Most of the data that we will process in this book will come from unformatted sources such as web pages or text files. As the old computer adage says, garbage in, garbage out, making the sanitizing of inputs a very important task.

 

Creating a virtual environment

As a first step when working with Python, it is a good practice to explicitly define the working environment. This helps with detaching from the operative system interpreter and environment, and properly defining the dependencies that will be used. Not doing so tends to generate chaotic scenarios. Remember, explicit is better than implicit!

This is especially important in two scenarios:

  • When dealing with multiple projects on the same computer, as they can have different dependencies that clash at some point. For example, two versions of the same module cannot be installed in the same environment.
  • When working on a project that will be used on a different computer, for example, developing some code in a personal laptop that will ultimately run in a remote server.
A common joke among developers is responding to a bug with it runs on my machine, meaning that it appears to work on their laptop, but not on the production servers. Although a huge number of factors can produce this error, a good practice is to produce an automatically replicable environment, reducing uncertainty over what dependencies are really being used.

This is easy to achieve using the virtualenv module, which sets up a virtual environment, so none of the installed dependencies will be shared with the Python version installed on the machine.

In Python3, the virtualenv tool is installed automatically, which was not the case in previous versions.

Getting ready

To create a new virtual environment, do the following:

  1. Go to the main directory that contains the project.
  2. Type the following command:
$ python3 -m venv .venv
This creates a subdirectory called .venv that contains the virtual environment.
The directory containing the virtual environment can be located anywhere. Keeping it on the same root keeps it handy, and adding a dot in front of it avoids it being displayed when running ls or other commands.
  1. Before activating the virtual environment, check the version installed in pip. This is different depending on your operative system, for example, 9.0.3 for MacOS High Sierra 10.13.4. It will be upgraded later. Also check the referenced Python interpreter, which will be the main operating system one:
$ pip --version
pip 9.0.3 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)
$ which python3
/usr/local/bin/python3

Now, your virtual environment is ready to go.

How to do it...

  1. Activate the virtual environment by running this:
$ source .venv/bin/activate

You'll notice that the prompt will display (.venv), showing that the virtual environment is active.

  1. Notice that the Python interpreter used is the one inside the virtual environment, and not the general operative system one from step 3 of Getting ready. Checking the location within a virtual environment:
(.venv) $ which python
/root_dir/.venv/bin/python
(.venv) $ which pip
/root_dir/.venv/bin/pip
  1. Upgrade the version of pip and check the version:
(.venv) $ pip install --upgrade pip
...
Successfully installed pip-10.0.1
(.venv) $ pip --version
pip 10.0.1 from /root_dir/.venv/lib/python3.6/site-packages/pip (python 3.6)
  1. Get out of the environment and run pip to check the version, which will return the previous environment. Check the pip version and the Python interpreter to show the ones before activating the virtual environment ones, as shown in step 3 of the Getting ready section. Note that they are different pip versions!
(.venv) $ deactivate 
$ which python3
/usr/local/bin/python3
$ pip --version
pip 9.0.3 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)

How it works...

Notice that inside the virtual environment you can use python instead of python3, although python3 is available as well. This will use the Python interpreter defined in the environment.

In some systems like Linux, it's possible that you need to use python3.7 instead of python3. Verify that the Python interpreter you're using is 3.7 or higher.

Inside the virtual environment, step 3 of the How to do it... section installs the most recent version of pip, without affecting the external installation.

The virtual environment contains all the Python data in the .venv directory, and the activate script points all the environment variables there. The best thing about it is that it can be deleted and recreated very easily, removing the fear of experimenting in a self-contained sandbox.

Remember that the directory name is displayed in the prompt. If you need to differentiate the environment, use a descriptive directory name, such as .my_automate_recipe, or use the --prompt option.

There's more...

To remove a virtual environment, deactivate it and remove the directory:

(.venv) $ deactivate
$ rm -rf .venv

The venv module has more options, which can be shown with the -h flag:

$ python3 -m venv -h
usage: venv [-h] [--system-site-packages] [--symlinks | --copies] [--clear]
[--upgrade] [--without-pip] [--prompt PROMPT]
ENV_DIR [ENV_DIR ...]
Creates virtual Python environments in one or more target directories.
positional arguments:
ENV_DIR A directory to create the environment in.

optional arguments:
-h, --help show this help message and exit
--system-site-packages
Give the virtual environment access to the system
site-packages dir.
--symlinks Try to use symlinks rather than copies, when symlinks
are not the default for the platform.
--copies Try to use copies rather than symlinks, even when
symlinks are the default for the platform.
--clear Delete the contents of the environment directory if it
already exists, before environment creation.
--upgrade Upgrade the environment directory to use this version
of Python, assuming Python has been upgraded in-place.
--without-pip Skips installing or upgrading pip in the virtual
environment (pip is bootstrapped by default)
--prompt PROMPT Provides an alternative prompt prefix for this
environment.
Once an environment has been created, you may wish to activate it, for example, by
sourcing an activate script in its bin directory.

A convenient way of dealing with virtual environments, especially if you often have to swap between them, is using the virtualenvwrapper module:

  1. To install it, run this:
$ pip install virtualenvwrapper
  1. Then, add the following variables to your sheet startup script, these normally being .bashrc or .bash_profile. The virtual environments will be installed under the WORKON_HOME directory instead of the same directory as the project, as shown previously:
export WORKON_HOME=~/.virtualenvs
source /usr/local/bin/virtualenvwrapper.sh

Sourcing the startup script or opening a new Terminal will allow you to create new virtual environments:

$ mkvirtualenv automation_cookbook
...
Installing setuptools, pip, wheel...done.
(automation_cookbook) $ deactivate
$ workon automation_cookbook
(automation_cookbook) $

For more information, check the documentation of virtualenvwrapper at: https://virtualenvwrapper.readthedocs.io/en/latest/index.html.

Hitting the Tab key after workon autocompletes with the available environments.

See also

  • The Installing third-party packages recipe
  • The Using a third-party tool—parse recipe
 

Installing third-party packages

One of the strongest capabilities of Python is the ability to use an impressive catalog of third-party packages that cover an amazing amount of ground in different areas, from modules specialized in performing numerical operations, machine learning, and network communications, to command-line convenience tools, database access, image processing, and much more!

Most of them are available on the official Python Package Index (https://pypi.org/), which has more than 130,000 packages ready to use. In this book, we'll install some of them, and in general spending a little time researching external tools when trying to solve a problem is time well spent. It's very likely that someone else has created a tool that solves all, or at least part, of the problem.

As important as finding and installing a package is keeping track of which packages are being used. This greatly helps with replicability, meaning the ability to start the whole environment from scratch in any situation.

Getting ready

The starting point is to find a package that will be of use in our project.

A great one is requests, a module that deals with HTTP requests and is known for its easy and intuitive interface, as well as its great documentation. Take a look at the documentation, which can be found here: http://docs.python-requests.org/en/master/.

We'll use requests throughout this book when dealing with HTTP connections.

The next step will be to choose the version to use. In this case, the latest (2.18.4, at the time of writing) will be perfect. If the version of the module is not specified, by default, it will install the latest version, which can lead to inconsistencies in different environments.

We'll also use the great delorean module for time handling (version 1.0.0 http://delorean.readthedocs.io/en/latest/).

How to do it...

  1. Create a requirements.txt file in our main directory, which will specify all the requirements for our project. Let's start with delorean and requests:
delorean==1.0.0
requests==2.18.4
  1. Install all the requirements with the pip command:
$ pip install -r requirements.txt
...
Successfully installed babel-2.5.3 certifi-2018.4.16 chardet-3.0.4 delorean-1.0.0 humanize-0.5.1 idna-2.6 python-dateutil-2.7.2 pytz-2018.4 requests-2.18.4 six-1.11.0 tzlocal-1.5.1 urllib3-1.22
  1. You can now use both modules when using the virtual environment:
$ python
Python 3.6.5 (default, Mar 30 2018, 06:41:53)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import delorean
>>> import requests

How it works...

The requirements.txt file specifies the module and version, and pip performs a search on pypi.org.

Note that creating a new virtual environment from scratch and running the following will completely recreate your environment, which makes replicability very straightforward:

$ pip install -r requirements.txt

Note that step 2 of the How to do it... section automatically installs other modules that are dependencies, such as urllib3.

There's more...

If any of the modules need to be changed to a different version because a new version is available, change it using requirements and run the install command again:

$ pip install -r requirements.txt

This is also applicable when a new module needs to be included.

At any point, the freeze command can be used to display all installed modules. freeze returns the modules in a format compatible with requirements.txt, making it possible to do this to generate a file with our current environment:

$ pip freeze > requirements.txt

This will include dependencies, so expect a lot more modules in the file.

Finding great third-party modules is not easy sometimes. Searching for specific functionality can work well, but sometimes there are great modules that are a surprise because they do things you never thought of. A great curated list is Awesome Python (https://awesome-python.com/), which covers a lot of great tools for common Python use cases, such as cryptography, database access, date and time handling, and so on.

In some cases, installing packages may require additional tools, such as compilers or a specific library that supports some functionality (for example, a particular database driver). If that's the case, the documentation will normally explain the dependencies.

See also

  • The Creating a virtual environment recipe
  • The Using a third-party tool—parse recipe
 

Creating strings with formatted values

One of the basic abilities when dealing with creating text and documents is to be able to properly format the values into structured strings. Python is quite smart in presenting good defaults, such as properly rendering a number, but there are a lot of options and possibilities.

We'll discuss some of the common options when creating formatted text with the example of a table.

Getting ready

The main tool to format strings in Python is the format method. It works with a defined mini-language to render variables this way:

result = template.format(*parameters)

The template is a string that gets interpreted based on the mini-language. At its easiest, it replaces the values between curly brackets with the parameters. Here are a couple of examples:

>>> 'Put the value of the string here: {}'.format('STRING')
"Put the value of the string here: STRING"
>>> 'It can be any type ({}) and more than one ({})'.format(1.23, str)
"It can be any type (1.23) and more than one (<class 'str'>)"
>> 'Specify the order: {1}, {0}'.format('first', 'second')
'Specify the order: second, first'
>>> 'Or name parameters: {first}, {second}'.format(second='SECOND', first='FIRST')
'Or name parameters: FIRST, SECOND'

In 95% of cases, this formatting will be all that's required; keeping things simple is great! But for complicated times, such as when aligning the strings automatically and creating good looking text tables, the mini-language format has more options.

How to do it...

  1. Write the following script, recipe_format_strings_step1.py, to print an aligned table:
# INPUT DATA
data = [
(1000, 10),
(2000, 17),
(2500, 170),
(2500, -170),
]
# Print the header for reference
print('REVENUE | PROFIT | PERCENT')

# This template aligns and displays the data in the proper format
TEMPLATE = '{revenue:>7,} | {profit:>+7} | {percent:>7.2%}'

# Print the data rows
for revenue, profit in data:
row = TEMPLATE.format(revenue=revenue, profit=profit, percent=profit / revenue)
print(row)
  1. Run it to display the following aligned table. Note that PERCENT is correctly displayed as a percentage:
REVENUE | PROFIT | PERCENT
1,000 | +10 | 1.00%
2,000 | +17 | 0.85%
2,500 | +170 | 6.80%
2,500 | -170 | -6.80%

How it works...

The TEMPLATE constant contains three columns, each one properly named (REVENUE, PROFIT, PERCENT). This makes it more explicit and straightforward to apply the template on the format call.

After the name of the parameter, there's a colon that separates the format definition. Note that all inside the curly brackets. In all columns, the format specification sets the width to seven characters and aligns the values to the right with the > symbol:

  • Revenue adds a thousands separator with the , symbol—[{revenue:>7,}].
  • Profit adds a + sign for positive values. A - for negatives is added automatically—[{profit:>+7}].
  • Percent displays a percent value, with a precision of two decimal places—[{percent:>7.2%}]. This is done through 0.2 (precision) and adding a % symbol for the percentage.

There's more...

You may have also seen the available Python formatting with the % operator. While it works for simple formatting, it is less flexible than the formated mini-language, and it is not recommended for use.

A great new feature since Python 3.6 is to use f-strings, which perform a format action using defined variables this way:

>>> param1 = 'first'
>>> param2 = 'second'
>>> f'Parameters {param1}:{param2}'
'Parameters first:second'

This simplifies a lot of the code and allows us to create very descriptive and readable code.

Be careful when using f-strings to ensure that the string is replaced at the proper time. A common problem is that the variable defined to be rendered is not yet defined. For example, TEMPLATE, defined previously, won't be defined as an f-string, as revenue and the rest of the parameters are not available at that point.

If you need to write a curly bracket, you'll need to repeat it twice. Note that each duplication will be displayed as a single curly bracket, plus a curly bracket for the value replacement, making a total of three brackets:

>> value = 'VALUE'
>>> f'This is the value, in curly brackets {{{value}}}'
'This is the value, in curly brackets {VALUE}'

This allows us to create meta templates—templates that produce templates. In some cases, that will be useful, but try to limit their use, as they'll get complicated very quickly, producing code that will be difficult to read.

The Python Format Specification mini-language has more options than the ones shown here.

As the language tries to be quite concise, sometimes it can be difficult to determine the position of the symbols. You may sometimes ask yourself questions like—Is the + symbol before or after than the width parameters.? Read the documentation with care and remember to always include a colon before the format specification.

Please check the full documentation and examples on the Python website (https://docs.python.org/3/library/string.html#formatspec).

See also

  • The Template Reports recipe in Chapter 5, Generating Fantastic Reports
  • The Manipulating strings recipe
 

Manipulating strings

A basic ability when dealing with text is to be able to properly manipulate that text. That means to be able to join it, split it into regular chunks, or change it to be uppercase or lowercase. We'll discuss more advanced methods for parsing text and separating it later, but in lots of cases it is useful to divide a paragraph into lines, sentences, or even words. Other times, words will have to have some characters removed or replaced with a canonical version to be able to compare it with a determined value.

Getting ready

We'll define a basic text to transform it into its main components, and then we'll reconstruct it. As an example, a report needs to be transformed into a new format to be sent via email.

The input format we'll use in this example will be this:

    AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
WITH THE OBJECTIVES FOR THE YEAR. THE MAIN DRIVER OF THE SALES HAS BEEN
THE NEW PACKAGE DESIGNED UNDER THE SUPERVISION OF OUR MARKETING DEPARTMENT.
OUR EXPENSES HAS BEEN CONTAINED, INCREASING ONLY BY 0.7%, THOUGH THE BOARD
CONSIDERS IT NEEDS TO BE FURTHER REDUCED. THE EVALUATION IS SATISFACTORY
AND THE FORECAST FOR THE NEXT QUARTER IS OPTIMISTIC. THE BOARD EXPECTS
AN INCREASE IN PROFIT OF AT LEAST 2 MILLION DOLLARS.

We need to redact the text to eliminate any references to numbers. It needs to be properly formatted by adding a new line after each period, justified with 80 characters, and transformed into ASCII for compatibility reasons.

The text will be stored in the INPUT_TEXT variable in the interpreter.

How to do it...

  1. After entering the text, split it into individual words:
>>> INPUT_TEXT = '''
... AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
... HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
...
'''
>>> words = INPUT_TEXT.split()
  1. Replace any numerical digits with an 'X' character:
>>> redacted = [''.join('X' if w.isdigit() else w for w in word) for word in words]
  1. Transform the text into pure ASCII (note that the name of the company contains a letter, ñ, which is not ASCII):
>>> ascii_text = [word.encode('ascii', errors='replace').decode('ascii')
... for word in redacted]
  1. Group the words into 80-character lines:
>>> newlines = [word + '\n' if word.endswith('.') else word for word in ascii_text]
>>> LINE_SIZE = 80
>>> lines = []
>>> line = ''
>>> for word in newlines:
... if line.endswith('\n') or len(line) + len(word) + 1 > LINE_SIZE:
... lines.append(line)
... line = ''
... line = line + ' ' + word
  1. Format all lines as titles and join them as a single piece of text:
>>> lines = [line.title() for line in lines]
>>> result = '\n'.join(lines)
  1. Print the result:
>>> print(result)
After The Close Of The Second Quarter, Our Company, Casta?Acorp Has Achieved A
Growth In The Revenue Of X.Xx%.

This Is In Line With The Objectives For The Year.

The Main Driver Of The Sales Has Been The New Package Designed Under The
Supervision Of Our Marketing Department.

Our Expenses Has Been Contained, Increasing Only By X.X%, Though The Board
Considers It Needs To Be Further Reduced.

The Evaluation Is Satisfactory And The Forecast For The Next Quarter Is
Optimistic.

How it works...

Each of the steps performs a specific transformation of the text:

  • The first one splits the text on the default separators, whitespaces, and new lines. This splits it into individual words with no lines or multiple spaces for separation.
  • To replace the digits, we go through every character of each word. For each one, if it's a digit, an 'X' is returned instead. This is done with two list comprehensions, one to run on the list, and another on each word, replacing only if there's a digit—['X' if w.isdigit() else w for w in word]. Note that the words are joined together again.
  • Each of the words is encoded into an ASCII byte sequence and decoded back again into the Python string type. Note the use of the errors parameter to force the replacement of unknown characters such as ñ.
The difference between strings and bytes is not very intuitive at first, especially if you never have to worry about multiple languages or encoding transformation. In Python 3, there's a strong separation between strings (internal Python representation) and bytes, so most of the tools applicable to strings won't be available in byte objects. Unless you have a good idea of why you need a byte object, always work with Python strings. If you need to perform transformations like the one in this task, encode and decode in the same line so that you keep your objects in the comfortable realm of Python strings. If you are interested in learning more about encodings, you can check out this brief article (https://eli.thegreenplace.net/2012/01/30/the-bytesstr-dichotomy-in-python-3) and this other longer and more detailed one (http://www.diveintopython3.net/strings.html).
  • This step first adds an extra newline character (the \n character) for all words ending with a period. This marks the different paragraphs. After that, it creates a line and adds the words one by one. If an extra word will make it go over 80 characters, it finishes the line and starts a new one. If the line already ends with a new line, it finishes it and starts another one as well. Note that there's an extra space added to separate the words.
  • Finally, each of the lines is capitalized as a Title (the first letter of each word is upper cased) and all the lines are joined through new lines.

There's more...

Some other useful operations that can be performed on strings are as follows:

  • Strings can be sliced like any other list. This means that 'word'[0:2] will return 'wo'.
  • Use .splitlines() to separate lines by newline character.
  • There are .upper() and .lower() methods, which return a copy with all the characters set to uppercase or lowercase. Their use is very similar to .title():
>>> 'UPPERCASE'.lower()
'uppercase'
  • For easy replacements (for example, change all A to B or change mine to ours), use .replace(). This method is useful for very simple cases, but replacements can get tricky easily. Be careful with the order of replacements to avoid collisions and case sensitivity issues. Note the wrong replacement in the following example:
>>> 'One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.'.replace('ring', 'necklace')
'One necklace to rule them all, one necklace to find them, One necklace to bnecklace them all and in the darkness bind them.'

This is similar to the issues we'll see with regular expressions matching unexpected parts of your code.

There are more examples to follow later. Refer to the regular expressions recipes for more information.

If you work with multiple languages, or with any kind of non-English input, it is very useful to learn the basics of Unicode and encodings. In a nutshell, given the vast amount of characters in all the different languages in the world, including alphabets not related to the Latin one, such as Chinese or Arabic, there's a standard to try and cover all of them so that computers can properly understand them. Python 3 greatly improved this situation, making the strings internal objects to deal with all of those characters. The encoding that Python uses, and the most common and compatible one, is currently UTF-8.

Dealing with encodings is still relevant when reading from external files that can be encoded in different encodings (for example, CP-1252 or windows-1252, which is a common encoding produced by legacy Microsoft systems, or ISO 8859-15, which is the industry standard).

See also

  • The Creating strings with formatted values recipe
  • The Introducing regular expressions recipe
  • The Going deeper into regular expressions recipe
  • The Dealing with Encodings recipe in Chapter 4, Searching and Reading Local Files
 

Extracting data from structured strings

In a lot of automated tasks, we'll need to treat input text that's in a particular format and extract the relevant information. For example, a spreadsheet may define a percentage in text (such as 37.4%) that we want to retrieve in numerical format to apply it later (0.374, as a float).

In this recipe, we'll see how to process sale logs that contain inline information about a product, such as sold, price, profit, and some other information.

Getting ready

Imagine that we need to parse information stored in sales logs. We'll use a sales log with the following structure:

[<Timestamp in iso format>] - SALE - PRODUCT: <product id> - PRICE: $<price of the sale>

For example, a specific log may look like this:

[2018-05-05T10:58:41.504054] - SALE - PRODUCT: 1345 - PRICE: $09.99

Note that the price has a leading zero. All prices will have two digits for the dollars, and two for the cents.

We need to activate our virtual environment before we start:

$ source .venv/bin/activate

How to do it...

  1. In the Python interpreter, make the following imports. Remember to activate your virtualenv, as described in the Creating a virtual environment recipe:
>>> import delorean
>>> from decimal import Decimal
  1. Enter the log to parse:
>>> log = '[2018-05-05T11:07:12.267897] - SALE - PRODUCT: 1345 - PRICE: $09.99'
  1. Split the log into its parts, which are divided by - (note the space before and after the dash). We ignore the SALE part as it doesn't add any relevant information:
>>> divide_it = log.split(' - ')
>>> timestamp_string, _, product_string, price_string = divide_it
  1. Parse the timestamp into a datetime object:
>>> timestamp = delorean.parse(tmp_string.strip('[]'))
  1. Parse the product_id into a integer:
>>> product_id = int(product_string.split(':')[-1])
  1. Parse the price into a Decimal type:
>>> price = Decimal(price_string.split('$')[-1])
  1. Now, you have all the values in native Python formats:
>> timestamp, product_id, price
(Delorean(datetime=datetime.datetime(2018, 5, 5, 11, 7, 12, 267897), timezone='UTC'), 1345, Decimal('9.99'))

How it works...

The basic working of this is to isolate each of the elements and then parse them in to the proper type. The first step is to split the full log into smaller parts. The - string is a good divider, as it splits it into four parts—a timestamp one, one with just the word SALE, the product, and the price.

In the case of the timestamp, we need to isolate the ISO format, which is in brackets in the log. That's why it's stripped off the brackets. We use the delorean module (introduced earlier) to parse it in to a datetime object.

The word SALE is ignored. There's no relevant information there.

To isolate the product ID, we split the product part at the colon. Then, we parse the last element as an integer:

>>> product_string.split(':')
['PRODUCT', ' 1345']
>>> int(' 1345')
1345

To divide the price, we use the dollar sign as a separator, and parse it as a Decimal character:

>>> price_string.split('$')
['PRICE: ', '09.99']
>>> Decimal('09.99')
Decimal('9.99')

As described in the next section, do not parse this value into a float type.

There's more...

These log elements can be combined together into a single object, helping with parsing and aggregating them. For example, we could define a class in Python code in the following way:

class PriceLog(object):
def __init__(self, timestamp, product_id, price):
self.timestamp = timestamp
self.product_id = product_id
self.price = price
def __repr__(self):
return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
self.product_id,
self.price)
@classmethod
def parse(cls, text_log):
'''
Parse from a text log with the format
[<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>
to a PriceLog object
'''
divide_it = text_log.split(' - ')
tmp_string, _, product_string, price_string = divide_it
timestamp = delorean.parse(tmp_string.strip('[]'))
product_id = int(product_string.split(':')[-1])
price = Decimal(price_string.split('$')[-1])
return cls(timestamp=timestamp, product_id=product_id, price=price)

So, the parsing can be done as follows:

>>> log = '[2018-05-05T12:58:59.998903] - SALE - PRODUCT: 897 - PRICE: $17.99'
>>> PriceLog.parse(log)
<PriceLog (Delorean(datetime=datetime.datetime(2018, 5, 5, 12, 58, 59, 998903), timezone='UTC'), 897, 17.99)>

Avoid using float types for prices. Floats numbers have precision problems that may produce strange errors when aggregating multiple prices, for example:

>>> 0.1 + 0.1 + 0.1 
0.30000000000000004

Try these two options to avoid problems:

  • Use integer cents as the base unit: This means multiplying currency inputs by 100 and transforming them into integers (or whatever fractional unit is correct for the currency used). You may still want to change the base when displaying them.
  • Parse into the Decimal type: The Decimal type keeps the fixed precision and works as you'd expect. You can find further information about the Decimal type in the Python docs at https://docs.python.org/3.6/library/decimal.html.
If you use the Decimal type, parse the results directly into Decimal from the string. If transforming it first into a float, you can carry the precision errors to the new type.

See also

  • The Creating a virtual environment recipe
  • The Using a third-party tool—parse recipe
  • The Introducing regular expressions recipe
  • The Going deeper into regular expressions recipe
 

Using a third-party tool—parse

While manually parsing data, as seen in the previous recipe, works very well for small strings, it can be very laborious to tweak the exact formula to work with a variety of input. What if the input has an extra dash sometimes? Or it has a variable length header depending on the size of one of the fields?

A more advanced option is to use regular expressions, as we'll see in the next recipe. But there's a great module in Python called parse (https://github.com/r1chardj0n3s/parse) that allows us to reverse format strings. It is a fantastic tool, that's powerful, easy to use, and greatly improves the readability of the code.

Getting ready

Add the parse module to the requirements.txt file in our virtual environment and reinstall the dependencies, as shown in the Creating a virtual environment recipe.

The requirements.txt file should look like this:

delorean==1.0.0
requests==2.18.3
parse==1.8.2

Then, reinstall the modules in the virtual environment:

$ pip install -r requirements.txt
...
Collecting parse==1.8.2 (from -r requirements.txt (line 3))
Using cached https://files.pythonhosted.org/packages/13/71/e0b5c968c552f75a938db18e88a4e64d97dc212907b4aca0ff71293b4c80/parse-1.8.2.tar.gz
...
Installing collected packages: parse
Running setup.py install for parse ... done
Successfully installed parse-1.8.2

How to do it...

  1. Import the parse function:
>>> from parse import parse
  1. Define the log to parse, in the same format as in the Extracting data from structured strings recipe:
>>> LOG = '[2018-05-06T12:58:00.714611] - SALE - PRODUCT: 1345 - PRICE: $09.99'
  1. Analyze it and describe it as you'll do when trying to print it, like this:
>>> FORMAT = '[{date}] - SALE - PRODUCT: {product} - PRICE: ${price}'
  1. Run parse and check the results:
>>> result = parse(FORMAT, LOG)
>>> result
<Result () {'date': '2018-05-06T12:58:00.714611', 'product': '1345', 'price': '09.99'}>
>>> result['date']
'2018-05-06T12:58:00.714611'
>>> result['product']
'1345'
>>> result['price']
'09.99'
  1. Note the results are all strings. Define the types to be parsed:
>>> FORMAT = '[{date:ti}] - SALE - PRODUCT: {product:d} - PRICE: ${price:05.2f}'
  1. Parse once again:
>>> result = parse(FORMAT, LOG)
>>> result
<Result () {'date': datetime.datetime(2018, 5, 6, 12, 58, 0, 714611), 'product': 1345, 'price': 9.99}>
>>> result['date']
datetime.datetime(2018, 5, 6, 12, 58, 0, 714611)
>>> result['product']
1345
>>> result['price']
9.99
  1. Define a custom type for the price to avoid issues with the float type:
>>> from decimal import Decimal
>>> def price(string):
... return Decimal(string)
...
>>> FORMAT = '[{date:ti}] - SALE - PRODUCT: {product:d} - PRICE: ${price:price}'
>>> parse(FORMAT, LOG, {'price': price})
<Result () {'date': datetime.datetime(2018, 5, 6, 12, 58, 0, 714611), 'product': 1345, 'price': Decimal('9.99')}>

How it works...

The parse module allows us to define a format, such as string, that reverses the format method when parsing values. A lot of the concepts that we discussed when creating strings applies here—put values in brackets, define the type after a colon, and so on.

By default, as seen in step 4, the values are parsed as strings. This is a good starting point when analyzing text. The values can be parsed into more useful native types, as shown in steps 5 and 6 in the How to do it... section. Please note that while most of the parsing types are the same as the ones in the Python Format Specification mini-language, there are some others available, such as ti for timestamps in ISO format.

If native types are not enough, our own parsing can be defined, as demonstrated in step 7 in the How to do it... section. Note that the definition of the price function gets a string and returns the proper format, in this case a Decimal type.

All the issues about floats and price information described in the There's more section of the Extracting data from structured strings recipe apply here as well.

There's more...

The timestamp can also be translated into a delorean object for consistency. Also, delorean objects carry over timezone information. Adding the same structure as in the previous recipe gives the following object, which is capable of parsing logs:

class PriceLog(object):
def __init__(self, timestamp, product_id, price):
self.timestamp = timestamp
self.product_id = product_id
self.price = price
def __repr__(self):
return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
self.product_id,
self.price)
@classmethod
def parse(cls, text_log):
'''
Parse from a text log with the format
[<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>
to a PriceLog object
'''
def price(string):
return Decimal(string)
def isodate(string):
return delorean.parse(string)
FORMAT = ('[{timestamp:isodate}] - SALE - PRODUCT: {product:d} - '
'PRICE: ${price:price}')
formats = {'price': price, 'isodate': isodate}
result = parse.parse(FORMAT, text_log, formats)
return cls(timestamp=result['timestamp'],
product_id=result['product'],
price=result['price'])

So, parsing it returns similar results:

>>> log = '[2018-05-06T14:58:59.051545] - SALE - PRODUCT: 827 - PRICE: $22.25'
>>> PriceLog.parse(log)
<PriceLog (Delorean(datetime=datetime.datetime(2018, 6, 5, 14, 58, 59, 51545), timezone='UTC'), 827, 22.25)>

This code is contained in the GitHub file Chapter01/price_log.py.

All parse supported types can be found in the documentation at https://github.com/r1chardj0n3s/parse#format-specification.

See also

  • The Extracting data from structured strings recipe
  • The Introducing regular expressions recipe
  • The Going deeper into regular expressions recipe
 

Introducing regular expressions

A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically the definition of a structured kind of text) to check with other strings to see if they match or not.

It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase Ns and As after that. The word Anna matches it, but Bob, Alice, and James does not. The words Aaan, Ana, Annnn, and Aaaan will also be matches, but ANNA won't.

If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are very useful, because they allow us to perform incredibly powerful pattern matching.

Some common uses of regexes are as follow:

  • Validating input data: For example, that a phone number is only numbers, dashes, and brackets.
  • String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
  • Scrapping: Find the occurrences of something in a long text. For example, find all emails in a web page.
  • Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
– Jamie Zawinski

Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is HTML parsing; check Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.

Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.

Getting ready

The python module to deal with regexes is called re. The main function we'll cover is re.search(), which returns a match object with information about what matched the pattern.

As regex patterns are also defined as strings, we'll differentiate them by prefixing them with an r, such as r'pattern'. This is the Python way of labeling a text as raw string literals, meaning that the string within is taken literally, without any escaping. This means that a \ is used as a backslash instead of a sequence. For example, without the r prefix, \n means newline character.

Some characters are special, and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.

The simplest form is just a literal string. For example, the regex pattern r'LOG' matches the string 'LOGS', but not the string 'NOT A MATCH'. If there's not a match, search returns None:

>>> import re
>>> re.search(r'LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'LOG', 'NOT A MATCH')
>>>

How to do it...

  1. Import the re module:
>>> import re
  1. Then, match a pattern that is not at the start of the string:
>>> re.search(r'LOG', 'SOME LOGS')
<_sre.SRE_Match object; span=(5, 8), match='LOG'>
  1. Match a pattern that is only at the start of the string. Note the ^ character:
>>> re.search(r'^LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'^LOG', 'SOME LOGS')
>>>
  1. Match a pattern only at the end of the string. Note the $ character:
>>> re.search(r'LOG$', 'SOME LOG')
<_sre.SRE_Match object; span=(5, 8), match='LOG'>
>>> re.search(r'LOG$', 'SOME LOGS')
>>>
  1. Match the word 'thing' (not excluding things), but not something or anything. Note the \b at the start of the second pattern:
>>> STRING = 'something in the things she shows me'
>>> match = re.search(r'thing', STRING)
>>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
('some', 'thing', ' in the things she shows me')
>>> match = re.search(r'\bthing', STRING)
>>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
('something in the ', 'thing', 's she shows me')

  1. Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matched string:
>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(20, 32), match='1234-567-890'>
>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group()
'1234-567-890'
  1. Match an email address naively:
>>> re.search(r'\S+@\S+', 'my email is email.123@test.com').group()
'email.123@test.com'

How it works...

The re.search function matches a pattern, no matter its position in the string. As explained previously, this will return None if the pattern is not found, or a match object.

The following special characters are used:

  • ^: Marks the start of the string
  • $: Marks the end of the string
  • \b: Marks the start or end of a word
  • \S: Marks any character that's not a whitespace, including special characters

More special characters are shown in the next recipe.

In step 6 in the How to do it... section, the r'[0123456789-]+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0 and 9 (any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.

Step 7 again uses the + sign to match as many characters as necessary before the @ and again after it. In this case, the character match is \S, which matches any non-whitespace character.

Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as john@smith@test.com. A better regex for most uses is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)". You can go to http://emailregex.com/ for find it and links to more information.

Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and very unreadable regex.

The resulting matching object returns the position where the matched pattern starts and ends (using the start and end methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns.

The difference displayed in step 5 is a very common one. Trying to capture GP can end up capturing eggplant and bagpipe! Similarly, things\b won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b for just the word GP.

The specific matched pattern can be retrieved by calling group(), as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:

>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]

There's more...

Dealing with regexes can be difficult and complex. Please allow time to test your matches and be sure that they work as you expect in order to avoid nasty surprises.

You can check your regexes interactively with some tools. A good one that's freely available online is https://regex101.com/, which displays each of the elements and explains the regex. Double-check that you're using the Python flavor:

See that the EXPLANATION describes that \b matches a word boundary (start or end of a word), and that thing matches literally these characters.

Regexes, in some cases, can be very slow, or even produce what's called regex denial-of-service, a string created to confuse a particular regex so that it takes an enormous amount of time, even in the worst case blocking the computer. While automating tasks probably won't get you into those problems, keep an eye out in case a regex takes too long.

See also

  • The Extracting data from structured strings recipe
  • The Using a third-party tool—parse recipe
  • The Going deeper into regular expressions recipe
 

Going deeper into regular expressions

In this recipe, we'll see more about how to deal with regular expressions. After introducing the basics, we will dig a little deeper into pattern elements, introduce groups as a better way to retrieve and parse strings, see how to search for multiple occurrences of the same string, and deal with longer texts.

How to do it...

  1. Import re:
>>> import re
  1. Match a phone pattern as part of a group (in brackets). Note the use of \d as a special character for any digit:
>>> match = re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
>>> match.group()
'the phone number is 1234-567-890'
>>> match.group(1)
'1234-567-890'
  1. Compile a pattern and capture a case insensitive pattern with a yes|no option:
>>> pattern = re.compile(r'The answer to question (\w+) is (yes|no)', re.IGNORECASE)
>>> pattern.search('Naturaly, the answer to question 3b is YES')
<_sre.SRE_Match object; span=(10, 42), match='the answer to question 3b is YES'>
>>> _.groups()
('3b', 'YES')
  1. Match all the occurrences of cities and state abbreviations in the text. Note that they are separated by a single character and the name of the city always starts with an uppercase letter. Only four states are matched for simplicity:
>>> PATTERN = re.compile(r'([A-Z][\w\s]+).(TX|OR|OH|MI)')
>>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
>>> list(PATTERN.finditer(TEXT))
[<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>, <_sre.SRE_Match object; span=(73, 85), match='Corvallis OR'>, <_sre.SRE_Match object; span=(113, 122), match='Toledo.OH'>, <_sre.SRE_Match object; span=(157, 172), match='Grand Rapids,MI'>]
>>> _[0].groups()
('Odessa', 'TX')

How it works...

The new special characters that were introduced are as follows. Note that the same letter in uppercase or lowercase means the opposite match, for example \d matches a digit, while \D matches a non digit.:

  • \d: Marks any digit (0 to 9).
  • \s: Marks any character that's a whitespace, including tabs and other whitespace special characters. Note that this is the reverse of \S, introduced in the previous recipe.
  • \w: Marks any letter (includes digits, but excludes characters such as periods).
  • .: Marks any character.

To define groups, put the defined groups in brackets. Groups can be retrieved individually, making them perfect for matching a bigger pattern that contains a variable part that we'll treat later, as demonstrated in step 2. Note the difference with the step 6 pattern in the previous recipe. In this case, the pattern is not only the number, but includes the prefix, even if we then extract the number. Check out this difference, where there's a number that's not the number we want to capture:

>>> re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(4, 36), match='the phone number is 1234-567-890'>
>>> _.group(1)
'1234-567-890'
>>> re.search(r'[0123456789-]+', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(0, 2), match='37'>
>>> _.group()
'37'

Remember that group 0 (.group() or .group(0)) is always the whole match. The rest of the groups are ordered as they appear.

Patterns can be compiled as well. This saves some time if the pattern needs to be matched over and over. To use it that way, compile the pattern and then use that object to perform searches, as shown in steps 3 and 4. Some extra flags can be added, such as making the pattern case insensitive.

Step 4's pattern requires a little bit of information. It's composed of two groups, separated by a single character. The special character . means it matches everything, in our example a period, a whitespace, and a comma. The second group is a straightforward selection of defined options, in this case US state abbreviations.

The first group starts with an uppercase letter ([A-Z]), and accepts any combination of letters or spaces ([\w\s]+), but not punctuation marks such as periods or commas. This matches the cities, including when composed of more than one word.

Note that this pattern starts on any uppercase letter and keeps matching until finding a state, unless separated by a punctuation mark, which may not be what's expected, for example:

>>> re.search(r'([A-Z][\w\s]+).(TX|OR|OH|MI)', 'This is a test, Escanaba MI')
<_sre.SRE_Match object; span=(16, 27), match='Escanaba MI'>
>>> re.search(r'([A-Z][\w\s]+).(TX|OR|OH|MI)', 'This is a test with Escanaba MI')
<_sre.SRE_Match object; span=(0, 31), match='This is a test with Escanaba MI'>

Step 4 also shows how to find more than one occurrence in a long text. While the .findall() method exists, it doesn't return the full match object, while .findalliter() does. Commonplace now in Python 3, .findalliter() returns an iterator that can be used in a for loop or list comprehension. Note that .search() returns only the first occurrence of the pattern, even if more matches appear:

>>> PATTERN.search(TEXT)
<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>
>>> PATTERN.findall(TEXT)
[('Odessa', 'TX'), ('Corvallis', 'OR'), ('Toledo', 'OH')]

There's more...

The special characters can be reversed if they are case swapped. For example, the reverse of the ones we used are as follows:

  • \D: Marks any non-digit
  • \W: Marks any non-letter
  • \B: Marks any character that's not at the start or end of a word
The most commonly used special characters are typically \d (digits) and \w (letters and digits), as they mark common patterns to search for, and the plus sign for one or more.

Groups can be assigned names as well. This makes them more explicit at the expense of making the group more verbose in the following shape—(?P<groupname>PATTERN). Groups can be referred to by name with .group(groupname) or by calling .groupdict() while maintaining its numeric position.

For example, the step 4 pattern can be described as follows:

>>> PATTERN = re.compile(r'(?P<city>[A-Z][\w\s]+?).(?P<state>TX|OR|OH|MN)')
>>> match = PATTERN.search(TEXT)
>>> match.groupdict()
{'city': 'Odessa', 'state': 'TX'}
>>> match.group('city')
'Odessa'
>>> match.group('state')
'TX'
>>> match.group(1), match.group(2)
('Odessa', 'TX')

Regular expressions are a very extensive topic. There are whole technical books devoted to them and they can be notoriously deep. The Python documentation is good to be used as reference (https://docs.python.org/3/library/re.html) and to learn more.

If you feel a little intimidated at the start, it's a perfectly natural feeling. Analyze each of the patterns with care, dividing it into different parts, and they will start to make sense. Don't be afraid to run a regex interactive analyzer!

Regexes can be really powerful and generic, but they may not be the proper tool for what you are trying to achieve. We've seen some caveats and patterns that have subtleties. As a rule of thumb, if a pattern starts to feel complicated, it's time to search for a different tool. Remember the previous recipes as well and the options they presented, such as parse.

See also

  • The Introducing regular expressions recipe
  • The Using a third-party tool—parse recipe
 

Adding command-line arguments

A lot of tasks can be best structured as a command-line interface that accepts different parameters to change the way it works, for example, scrapping one web page or another. Python includes a powerful argparse module in the standard library to create rich command-line argument parsing with minimal effort.

Getting ready

The basic use of argparse in a script can be shown in three steps:

  1. Define the arguments that your script is going to accept, generating a new parser.
  2. Call the defined parser, returning an object with all the resulting arguments.
  3. Use the arguments to call the entry point of your script, which will apply the defined behavior.

Try to use the following general structure for your scripts:

IMPORTS

def main(main parameters):
DO THINGS

if __name__ == '__main__':
DEFINE ARGUMENT PARSER
PARSE ARGS
VALIDATE OR MANIPULATE ARGS, IF NEEDED
main(arguments)

The main function makes it easy to know what the entry point for the code is. The section under the if statement is only executed if the file is called directly, but not if it's imported. We'll follow this for all the steps.

How to do it...

  1. Create a script that will accept a single integer as a positional argument, and will print a hash symbol that amount of times. The recipe_cli_step1.py script is as follows, but note we are following the structure presented previously, and the main function is just printing the argument:
import argparse

def main(number):
print('#' * number)

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('number', type=int, help='A number')
args = parser.parse_args()

main(args.number)
  1. Call the script and see how the parameter is presented. Calling the script with no arguments displays the automatic help. Use the automatic argument -h to display the extended help:
$ python3 recipe_cli_step1.py
usage: recipe_cli_step1.py [-h] number
recipe_cli_step1.py: error: the following arguments are required: number
$ python3 recipe_cli_step1.py -h
usage: recipe_cli_step1.py [-h] number
positional arguments:
number A number
optional arguments:
-h, --help show this help message and exit
  1. Calling the script with the extra parameters works as expected:
$ python3 recipe_cli_step1.py 4
####
$ python3 recipe_cli_step1.py not_a_number
usage: recipe_cli_step1.py [-h] number
recipe_cli_step1.py: error: argument number: invalid int value: 'not_a_number'
  1. Change the script to accept an optional argument for the character to print. The default will be '#'. The recipe_cli_step2.py script will look like this:
import argparse

def main(character, number):
print(character * number)

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('number', type=int, help='A number')
parser.add_argument('-c', type=str, help='Character to print',
default='#')

args = parser.parse_args()
main(args.c, args.number)
  1. The help is updated, and using the -c flag allows us to print different characters:
$ python3 recipe_cli_step2.py -h
usage: recipe_cli_step2.py [-h] [-c C] number

positional arguments:
number A number

optional arguments:
-h, --help show this help message and exit
-c C Character to print
$ python3 recipe_cli_step2.py 4
####
$ python3 recipe_cli_step2.py 5 -c m
mmmmm
  1. Add a flag that changes the behavior when present. The recipe_cli_step3.py script is as follows:
import argparse

def main(character, number):
print(character * number)

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('number', type=int, help='A number')
parser.add_argument('-c', type=str, help='Character to print',
default='#')
parser.add_argument('-U', action='store_true', default=False,
dest='uppercase',
help='Uppercase the character')
args = parser.parse_args()

if args.uppercase:
args.c = args.c.upper()

main(args.c, args.number)
  1. Calling it uppercases the character if the -U flag is added:
$ python3 recipe_cli_step3.py 4 -c f
ffff
$ python3 recipe_cli_step3.py 4 -c f -U
FFFF

How it works...

As described in step 1 in the How to do it… section, the arguments are added to the parser through .add_arguments. Once all arguments are defined, calling parse_args() returns an object that contains the results (or exits if there's an error).

Each argument should add a help description, but their behavior can change greatly:

  • If an argument starts with a -, it is considered an optional parameter, like the -c argument in step 4. If not, it's a positional argument, like the number argument in step 1.
For clarity, always define a default value for optional parameters. It will be None if you don't, but this may be confusing.
  • Remember to always add a help parameter with a description of the parameter; help is automatically generated, as shown in step 2.
  • If a type is present, it will be validated, for example, number in step 3. By default, the type will be string.
  • The actions store_true and store_false can be used to generate flags, arguments that don't require any extra parameters. Set the corresponding default value as the opposite Boolean. This is demonstrated in the U argument in steps 6 and 7.
  • The name of the property in the args object will be, by default, the name of the argument (without the dash, if it's present). You can change it with dest. For example, in step 6, the command-line argument -U is described as uppercase.
Changing the name of an argument for internal usage is very useful when using short arguments, such as single letters. A good command-line interface will use -c, but internally it's probably a good idea to use a more verbose label, such as configuration_file. Explicit is better than implicit!
  • Some arguments can work in coordination with others, as shown in step 3. Perform all required operations to pass the main function as clear and concise parameters. For example, in step 3, only two parameters are passed, but one may have been modified.

There's more...

You can create long arguments as well with double dashes, for example:

 parser.add_argument('-v', '--verbose', action='store_true', default=False, 
help='Enable verbose output')

This will accept both -v and --verbose, and it will store the name verbose.

Adding long names is a good way of making the interface more intuitive and easy to remember. It's easy to remember after a couple of times that there's a verbose option, and it starts with a v.

The main inconvenience when dealing with command-line arguments may be ending up with too many of them. This creates confusion. Try to make your arguments as independent as possible and not make too many dependencies between them, or handling the combinations can be tricky.

In particular, try to not create more than a couple of positional arguments, as they won't have mnemonics. Positional arguments also accept default values, but most of the time that won't be the expected behavior.

For advanced details, check the Python documentation of argparse (https://docs.python.org/3/library/argparse.html).

See also

  • The Creating a virtual environment recipe
  • The Installing third-party packages recipe
About the Author
  • Jaime Buelta

    Jaime Buelta is a Software Architect who has been a professional programmer since 2002 and a Python enthusiast since 2010. He has developed software for a variety of fields, focusing, in the last 10 years, on developing web services in Python in the gaming, finance and education industries. He is a strong proponent of automating everything to make computers do most of the heavy lifting, so humans can focus on the important stuff. He published his first book, "Python Automation Cookbook", in 2018 (recently with an extended second edition), followed a year later by "Hands-On Docker for Microservices with Python" describing how to migrate to a microservice architecture. He is currently living in Dublin, Ireland, and is a regular speaker at PyCon Ireland.

    Browse publications by this author
Latest Reviews (7 reviews total)
It was a great value and I look forward to learning all of the techniques in the books.
good book in good condition
It has the info I wanted.
Python Automation Cookbook
Unlock this book and the full library FREE for 7 days
Start now