Python Automation Cookbook - Second Edition

3 (1 reviews total)
By Jaime Buelta
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Let's Begin Our Automation Journey

About this book

In this updated and extended version of Python Automation Cookbook, each chapter now comprises the newest recipes and is revised to align with Python 3.8 and higher. The book includes three new chapters that focus on using Python for test automation, machine learning projects, and for working with messy data.

This edition will enable you to develop a sharp understanding of the fundamentals required to automate business processes through real-world tasks, such as developing your first web scraping application, analyzing information to generate spreadsheet reports with graphs, and communicating with automatically generated emails.

Once you grasp the basics, you will acquire the practical knowledge to create stunning graphs and charts using Matplotlib, generate rich graphics with relevant information, automate marketing campaigns, build machine learning projects, and execute debugging techniques.

By the end of this book, you will be proficient in identifying monotonous tasks and resolving process inefficiencies to produce superior and reliable systems.

Publication date:
May 2020
Publisher
Packt
Pages
526
ISBN
9781800207080

 

Let's Begin Our Automation Journey

The objective of this chapter is to lay down some of the basic techniques that will be useful throughout the whole book. The main idea is to create a good Python environment to run the automation tasks that follow and be able to parse text inputs into structured data.

Python has a good number of tools installed by default, but it also makes it easy to install third-party tools that can simplify many common operations. We'll see how to import modules from external sources and use them to leverage the full potential of Python.

We will install and use tools to help process texts. The ability to structure input data is critical in any automation task. Most of the data that we will process in this book will come from unformatted sources such as web pages or text files. As the old computer adage says, garbage in, garbage out, making the sanitizing of inputs a very important task.

In this chapter, we'll cover the following recipes:

  • Activating a virtual environment
  • Installing third-party packages
  • Creating strings with formatted values
  • Manipulating strings
  • Extracting data from structured strings
  • Using a third-party tool—parse
  • Introducing regular expressions
  • Going deeper into regular expressions
  • Adding command-line arguments

We will start by creating our own self-contained environment to work in.

 

Activating a virtual environment

As a first step when working with Python, it is a good practice to explicitly define the working environment.

This helps you to detach from the operating system interpreter and environment and properly define the dependencies that will be used. Not doing so tends to generate chaotic scenarios. Remember, explicit is better than implicit!

Explicit is better than implicit is one of the most quoted parts of the Zen of Python. The Zen of Python is a list of general guidelines for Python, to provide clarity on what is considered Pythonic. The full Zen of Python can be invoked from the Python interpreter by calling import this.

This is especially important in two scenarios:

  • When dealing with multiple projects on the same computer, as they can have different dependencies that clash at some point. For example, two versions of the same module cannot be installed in the same environment.
  • When working on a project that will be used on a different computer, for example, developing some code on a personal laptop that will ultimately run in a remote server.

A common joke among developers is responding to a bug with "it runs on my machine," meaning that it appears to work on their laptop, but not on the production servers. Although a huge number of factors can produce this error, a good practice is to produce an automatically replicable environment, reducing uncertainty over what dependencies are really being used.

This is easy to achieve using the venv module, which sets up a local virtual environment. None of the installed dependencies will be shared with the Python interpreter installed on the machine, creating an isolated environment.

In Python 3, the venv tool is installed as part of the standard library. This was not the case in the previous version where you had to install the external virtualenv package.

Getting ready

To create a new virtual environment, do the following:

  1. Go to the main directory that contains the project:
    $ cd my-directory
    
  2. Type the following command:
    $ python3 -m venv .venv
    

    This creates a subdirectory called .venv that contains the virtual environment.

    The directory containing the virtual environment can be located anywhere. Keeping it on the same root keeps it handy, and adding a dot in front of it avoids it being displayed when running ls or other commands.

  3. Before activating the virtual environment, check the version installed in pip. This is different depending on your operating system and installed packages. It may be upgraded later. Also, check the referenced Python interpreter, which will be the main operating system one:
    $ pip --version
    pip 10.0.1 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)
    $ which python3
    /usr/local/bin/python3
    

Note that which may not be available in your shell. In Windows, for example, where can be used.

Now your virtual environment is ready to go.

How to do it…

  1. Activate the virtual environment if you use Linux or macOS by running:
    $ source .venv/bin/activate
    

    Depending on your operating system (for example, Windows) and shell (for example, fish), you may need a different command. View the documentation of venv in the Python documentation here: https://docs.python.org/3/library/venv.html.

    You'll notice that the shell prompt will display (.venv), showing that the virtual environment is active.

  2. Notice that the Python interpreter used is the one inside the virtual environment, and not the general operating system one from step 3 of the Getting ready section. Check the location within a virtual environment:
    (.venv) $ which python
    /root_dir/.venv/bin/python
    (.venv) $ which pip
    /root_dir/.venv/bin/pip
    
  3. Upgrade the version of pip and then check the version:
    (.venv) $ pip install --upgrade pip
    ...
    Successfully installed pip-20.0.2
    (.venv) $ pip --version
    pip 20.0.2 from /root_dir/.venv/lib/python3.8/site-packages/pip (python 3.8)
    

    An alternative is to run python -m ensurepip -U, which will ensure that pip is installed.

  4. Get out of the environment and run pip to check the version, which will return the previous environment. Check the pip version and the Python interpreter to show the existing directories before activating the virtual environment directories, as shown in step 3 of the Getting ready section. Note that they are different pip versions:
    (.venv) $ deactivate
    $ which python3
    /usr/local/bin/python3
    $ pip --version
    pip 10.0.1 from /usr/local/lib/python3.8/site-packages/pip (python 3.8)
    

How it works…

Notice that inside the virtual environment you can use python instead of python3, although python3 is available as well. This will use the Python interpreter defined in the environment.

In some systems, like Linux, it's possible that you'll need to use python3.8 instead of python3. Verify that the Python interpreter you're using is 3.8 or higher.

Inside the virtual environment, step 3 of the How to do it section installs the most recent version of pip, without affecting the external installation.

The virtual environment contains all the Python data in the .venv directory, and the activate script points all the environment variables there. The best thing about it is that it can be deleted and recreated very easily, removing the fear of experimenting in a self-contained sandbox.

Remember that the directory name is displayed in the prompt. If you need to differentiate the environment, use a descriptive directory name, such as .my_automate_recipe, or use the –prompt option.

There's more…

To remove a virtual environment, deactivate it and remove the directory:

(.venv) $ deactivate
$ rm -rf .venv
The venv module has more options, which can be shown with the -h flag:
$ python3 -m venv -h
usage: venv [-h] [--system-site-packages]
            [--symlinks | --copies] [--clear]
            [--upgrade] [--without-pip]
            [--prompt PROMPT]
            ENV_DIR [ENV_DIR ...]
Creates virtual Python environments in one or more target directories.
positional arguments:
  ENV_DIR               A directory to create the
                        environment in.
optional arguments:
  -h, --help            show this help message and
                        exit
  --system-site-packages
                        Give the virtual
                        environment access to the
                        system site-packages dir.
  --symlinks            Try to use symlinks rather
                        than copies, when symlinks
                        are not the default for the
                        platform.
  --copies              Try to use copies rather
                        than symlinks, even when
                        symlinks are the default
                        for the platform.
  --clear               Delete the contents of the
                        environment directory if it
                        already exists, before
                        environment creation.
  --upgrade             Upgrade the environment
                        directory to use this
                        version of Python, assuming
                        Python has been upgraded
                        in-place.
  --without-pip         Skips installing or
                        upgrading pip in the
                        virtual environment (pip is
                        bootstrapped by default)
  --prompt PROMPT       Provides an alternative
                        prompt prefix for this
                        environment.
Once an environment has been created, you may wish
to activate it, e.g. by sourcing an activate script
in its bin directory.

A convenient way of dealing with virtual environments, especially if you often have to swap between them, is to use the virtualenvwrapper module:

  1. To install it, run this:
    $ pip install virtualenvwrapper
    
  2. Then, add the following variables to your shell startup script, these are normally .bashrc or .bash_profile. The virtual environments will be installed under the WORKON_HOME directory instead of the same directory as the project, as shown previously:
    export WORKON_HOME=~/.virtualenvs
    source /usr/local/bin/virtualenvwrapper.sh
    

Sourcing the startup script or opening a new terminal will allow you to create new virtual environments:

$ mkvirtualenv automation_cookbook
...
Installing setuptools, pip, wheel...done.
(automation_cookbook) $ deactivate
$ workon automation_cookbook
(automation_cookbook) $

For more information, view the documentation of virtualenvwrapper at https://virtualenvwrapper.readthedocs.io/en/latest/index.html.

An alternative tool for defining environments is Poetry (https://python-poetry.org/). This tool is designed for creating consistent environments with clear dependencies, and provides commands for upgrades and managing dependency packages. Check it out to see whether it's useful in your use case.

Hitting the Tab key after workon autocompletes the command with the available environments.

See also

  • The Installing third-party packages recipe, covered later in the chapter.
  • The Using a third-party tool—parse recipe, covered later in the chapter.
 

Installing third-party packages

One of the strongest capabilities of Python is the ability to use an impressive catalog of third-party packages that cover an amazing amount of ground in different areas, from modules specialized in performing numerical operations, machine learning, and network communications, to command-line convenience tools, database access, image processing, and much more!

Most of them are available on the official Python Package Index (https://pypi.org/), which has more than 200,000 packages ready to use. In this book, we'll install some of them. In general, it's worth spending a little time researching external tools when trying to solve a problem. It's very likely that someone else has already created a tool that solves all, or at least part, of the problem.

More important than finding and installing a package is keeping track of which packages are being used. This greatly helps with replicability, meaning the ability to start the whole environment from scratch in any situation.

Getting ready

The starting point is to find a package that will be of use in our project.

A great one is requests, a module that deals with HTTP requests and is known for its easy and intuitive interface, as well as its great documentation. Take a look at the documentation, which can be found here: https://requests.readthedocs.io/en/master/.

We'll use requests throughout this book when dealing with HTTP connections.

The next step will be to choose the version to use. In this case, the latest (2.22.0, at the time of writing) will be perfect. If the version of the module is not specified, by default it will install the latest version, which can lead to inconsistencies in different environments as newer versions are released.

We'll also use the great delorean module for time handling (version 1.0.0: http://delorean.readthedocs.io/en/latest/).

How to do it…

  1. Create a requirements.txt file in our main directory, which will specify all the requirements for our project. Let's start with delorean and requests:
    delorean==1.0.0
    requests==2.22.0
    
  2. Install all the requirements with the pip command:
    $ pip install -r requirements.txt
    ...
    Successfully installed babel-2.8.0 certifi-2019.11.28 chardet-3.0.4 delorean-1.0.0 humanize-0.5.1 idna-2.8 python-dateutil-2.8.1 pytz-2019.3 requests-2.22.0 six-1.14.0 tzlocal-2.0.0 urllib3-1.25.7
    

    Show the available modules installed using pip list:

    $ pip list
    Package         Version
    --------------- ----------
    Babel           2.8.0
    certifi         2019.11.28
    chardet         3.0.4
    Delorean        1.0.0
    humanize        2.0.0
    idna            2.8
    pip             19.2.3
    python-dateutil 2.8.1
    pytz            2019.3
    requests        2.22.0
    setuptools      41.2.0
    six             1.14.0
    tzlocal         2.0.0
    urllib3         1.25.8
    
  3. You can now use both modules when using the virtual environment:
    $ python
    Python 3.8.1 (default, Dec 27 2019, 18:05:45)
    [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import delorean
    >>> import requests
    

How it works…

The requirements.txt file specifies the module and version, and pip performs a search on pypi.org.

Note that creating a new virtual environment from scratch and running the following will completely recreate your environment, which makes replicability very straightforward:

$ pip install -r requirements.txt

Note that step 2 of the How to do it section automatically installs other modules that are dependencies, such as urllib3.

There's more…

If any of the modules need to be changed to a different version because a new version is available, change them using requirements and run the install command again:

$ pip install -r requirements.txt

This is also applicable when a new module needs to be included.

At any point, the freeze command can be used to display all of the installed modules. freeze returns the modules in a format compatible with requirements.txt, making it possible to generate a file with our current environment:

$ pip freeze > requirements.txt

This will include dependencies, so expect a lot more modules in the file.

Finding great third-party modules is sometimes not easy. Searching for specific functionality can work well, but, sometimes, there are great modules that are a surprise because they do things you never thought of. A great curated list is Awesome Python (https://awesome-python.com/), which covers a lot of great tools for common Python use cases, such as cryptography, database access, date and time handling, and more.

In some cases, installing packages may require additional tools, such as compilers or a specific library that supports some functionality (for example, a particular database driver). If that's the case, the documentation will explain the dependencies.

See also

  • The Activating a virtual environment recipe, covered earlier in this chapter.
  • The Using a third-party tool—parse recipe, covered later in this chapter, to learn how to use one installed third-party module.
 

Creating strings with formatted values

One of the basic abilities when dealing with creating text and documents is to be able to properly format values into structured strings. Python is smart at presenting good defaults, such as properly rendering a number, but there are a lot of options and possibilities.

We'll discuss some of the common options when creating formatted text using the example of a table.

Getting ready

The main tool to format strings in Python is the format method. It works with a defined mini-language to render variables this way:

result = template.format(*parameters)

The template is a string that is interpreted based on the mini-language. At its simplest, templating replaces the values between the curly brackets with the parameters. Here are a couple of examples:

>>> 'Put the value of the string here: {}'.format('STRING')
"Put the value of the string here: STRING"
>>> 'It can be any type ({}) and more than one ({})'.format(1.23, 'STRING')
"It can be any type (1.23) and more than one (STRING)"
>> 'Specify the order: {1}, {0}'.format('first', 'second')
'Specify the order: second, first'
>>> 'Or name parameters: {first}, {second}'.format(second='SECOND', first='FIRST')
'Or name parameters: FIRST, SECOND'

In 95% of cases, this formatting will be all that's required; keeping things simple is great! But for complicated times, such as when aligning strings automatically and creating good looking text tables, the mini-language format has more options.

How to do it…

  1. Write the following script, recipe_format_strings_step1.py, to print an aligned table:
    # INPUT DATA
    data = [
        (1000, 10),
        (2000, 17),
        (2500, 170),
        (2500, -170),
    ]
    # Print the header for reference
    print('REVENUE | PROFIT | PERCENT')
    # This template aligns and displays the data in the proper format
    TEMPLATE = '{revenue:>7,} | {profit:>+6} | {percent:>7.2%}'
    # Print the data rows
    for revenue, profit in data:
        row = TEMPLATE.format(revenue=revenue, profit=profit, percent=profit / revenue)
        print(row)
    
  2. Run it to display the following aligned table. Note that PERCENT is correctly displayed as a percentage:
    REVENUE | PROFIT | PERCENT
      1,000 |    +10 |   1.00%
      2,000 |    +17 |   0.85%
      2,500 |   +170 |   6.80%
      2,500 |   -170 |  -6.80%
    

How it works…

The TEMPLATE constant defines three columns, each one defined by a parameter named revenue, profit, and percent. This makes it explicit and straightforward to apply the template to the format call.

After the name of the parameter, there's a colon that separates the format definition. Note that everything is inside the curly brackets. In all columns, the format specification sets the width to seven characters, to make sure all the columns have the same width, and aligns the values to the right with the > symbol:

  • revenue adds a thousands separator with the , symbol—[{revenue:>7,}].
  • profit adds a + sign for positive values. A - sign for negatives is added automatically—[{profit:>+7}].
  • percent displays a percent value with a precision of two decimal places—[{percent:>7.2%}]. This is done through .2 (precision) and adding a % symbol for the percentage.

There's more…

You may have also seen the available Python formatting with the % operator. While it works for simple formatting, it is less flexible than the formatted mini-language, and it is not recommended for use.

A great new feature since Python 3.6 is to use f-strings, which perform a format action using defined variables:

>>> param1 = 'first'
>>> param2 = 'second'
>>> f'Parameters {param1}:{param2}'
'Parameters first:second'

This simplifies a lot of the code and allows us to create very descriptive and readable code.

Be careful when using f-strings to ensure that the string is replaced at the proper time. A common problem is that the variable defined to be rendered is not yet defined. For example, TEMPLATE, defined previously, won't be defined as an f-string, as revenue and the rest of the parameters are not available at that point. All variables defined at the scope of the string definition will be available, both local and global.

If you need to write a curly bracket, you'll need to repeat it twice. Note that each duplication will be displayed as a single curly bracket, plus a curly bracket for the value replacement, making a total of three brackets:

>> value = 'VALUE'
>>> f'This is the value, in curly brackets {{{value}}}'
'This is the value, in curly brackets {VALUE}'

This allows us to create meta templates—templates that produce templates. In some cases, this will be useful, but they get complicated very quickly. Use with care, as it's easy to produce code that will be difficult to read.

Representing characters that have a special meaning usually requires some sort of special way to define them, for example, by duplicating the curly bracket like we see here. This is called "escaping" and it's a common process in any code representation.

The Python Format Specification mini-language has more options than the ones shown here.

As the language tries to be quite concise, sometimes it can be difficult to determine the position of the symbols. You may sometimes ask yourself questions, like "is the + symbol before or after the width parameters?" Read the documentation with care and remember to always include a colon before the format specification.

Please refer to the full documentation and examples on the Python then it would look like website: https://docs.python.org/3/library/string.html#formatspec or at this fantastic web page—https://pyformat.info—that shows lots of examples.

See also

  • The Template Reports recipe in Chapter 5, Generating Fantastic Reports, to learn more advanced template techniques.
  • The Manipulating strings recipe, covered later in this chapter, to learn more about working with text.
 

Manipulating strings

When dealing with text, it's often necessary to manipulate and process it; that is, to be able to join it, split it into regular chunks, or change it to be uppercase or lowercase. We'll discuss more advanced methods for parsing text and separating it later; however, in lots of cases, it is useful to divide a paragraph into lines, sentences, or even words. Other times, words will require some characters to be removed or a word will need to be replaced with a canonical version to be able to compare it with a predetermined value.

Getting ready

We'll define a basic piece of text and transform it into its main components; then, we'll reconstruct it. As an example, a report needs to be transformed into a new format to be sent via email.

The input format we'll use in this example will be this:

    AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
    HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
    WITH THE OBJECTIVES FOR THE YEAR. THE MAIN DRIVER OF THE SALES HAS BEEN
    THE NEW PACKAGE DESIGNED UNDER THE SUPERVISION OF OUR MARKETING DEPARTMENT.
    OUR EXPENSES HAS BEEN CONTAINED, INCREASING ONLY BY 0.7%, THOUGH THE BOARD
    CONSIDERS IT NEEDS TO BE FURTHER REDUCED. THE EVALUATION IS SATISFACTORY
    AND THE FORECAST FOR THE NEXT QUARTER IS OPTIMISTIC. THE BOARD EXPECTS
    AN INCREASE IN PROFIT OF AT LEAST 2 MILLION DOLLARS.

We need to redact the text to eliminate any references to numbers. It needs to be properly formatted by adding a new line after each period, justified with 80 characters, and transformed into ASCII for compatibility reasons.

The text will be stored in the INPUT_TEXT variable in the interpreter.

How to do it…

  1. After entering the text, split it into individual words:
    >>> INPUT_TEXT = '''
    ...     AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
    ...     HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
    ...
    '''
    >>> words = INPUT_TEXT.split()
    
  2. Replace any numerical digits with an 'X' character:
    >>> redacted = [''.join('X' if w.isdigit() else w for w in word) for word in words]
    
  3. Transform the text into pure ASCII (note that the name of the company contains the letter ñ, which is not ASCII):
    >>> ascii_text = [word.encode('ascii', errors='replace').decode('ascii')
    ...               for word in redacted]
    
  4. Group the words into 80-character lines:
    >>> newlines = [word + '\n' if word.endswith('.') else word for word in ascii_text]
    >>> LINE_SIZE = 80
    >>> lines = []
    >>> line = ''
    >>> for word in newlines:
    ...     if line.endswith('\n') or len(line) + len(word) + 1 > LINE_SIZE:
    ...         lines.append(line)
    ...         line = ''
    ...     line = line + ' ' + word
    
  5. Format all of the lines as titles and join them as a single piece of text:
    >>> lines = [line.title() for line in lines]
    >>> result = '\n'.join(lines)
    
  6. Print the result:
    >>> print(result)
     After The Close Of The Second Quarter, Our Company, Casta?Acorp Has Achieved A Growth In The Revenue Of X.Xx%. This Is In Line With The Objectives For The Year. The Main Driver Of The Sales Has Been The New Package Designed Under The Supervision Of Our Marketing Department. Our Expenses Has Been Contained, Increasing Only By X.X%, Though The Board Considers It Needs To Be Further Reduced. The Evaluation Is Satisfactory And The Forecast For The Next Quarter Is Optimistic.
    

How it works…

Each step performs a specific transformation of the text:

  • The first step splits the text into the default separators, whitespaces, and new lines. This splits it into individual words with no lines or multiple spaces for separation.
  • To replace the digits, we go through every character of each word. For each one, if it's a digit, an 'X' is returned instead. This is done with two list comprehensions, one to run on the list, and another on each word, replacing them only if there's a digit —['X' if w.isdigit() else w for w in word]. Note that the words are joined together again.
  • Each of the words is encoded into an ASCII byte sequence and decoded back again into the Python string type. Note the use of the errors parameter to force the replacement of unknown characters such as ñ.

    The difference between strings and bytes is not very intuitive at first, especially if you never have to worry about multiple languages or encoding transformations. In Python 3, there's a strong separation between strings (internal Python representation) and bytes. So most of the tools applicable to strings won't be available in byte objects. Unless you have a good idea of why you need a byte object, always work with Python strings. If you need to perform transformations like the one in this task, encode and decode in the same line so that you keep your objects within the comfortable realm of Python strings. If you are interested in learning more about encodings, you can refer to this brief article: https://eli.thegreenplace.net/2012/01/30/the-bytesstr-dichotomy-in-python-3 and this other, longer and more detailed one: http://www.diveintopython3.net/strings.html.

  • The next step adds an extra newline character (the \n character) for all words ending with a period. This marks the different paragraphs. After that, it creates a line and adds the words one by one. If an extra word will make it go over 80 characters, it finishes the line and starts a new one. If the line already ends with a new line, it finishes it and starts another one as well. Note that there's an extra space added to separate the words.
  • Finally, each of the lines is capitalized as a Title (the first letter of each word is uppercased) and all the lines are joined through new lines.

There's more…

Some other useful operations that can be performed on strings are as follows:

  • Strings can be sliced like any other list. This means that "word"[0:2] will return "wo".
  • Use .splitlines() to separate lines with a newline character.
  • There are .upper() and .lower() methods, which return a copy with all of the characters set to uppercase or lowercase. Their use is very similar to .title():
    >>> 'UPPERCASE'.lower()
    'uppercase'
    
  • For easy replacements (for example, changing all As to Bs or changing mine \ to ours), use .replace(). This method is useful for very simple cases, but replacements can get tricky easily. Be careful with the order of replacements to avoid collisions and case sensitivity issues. Note the wrong replacement in the following example:
    >>> 'One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.'.replace('ring', 'necklace')
    'One necklace to rule them all, one necklace to find them, One necklace to bnecklace them all and in the darkness bind them.'
    

This is similar to the issues we'll encounter with regular expressions matching unexpected parts of your code. There are more examples to follow later. Refer to the regular expressions recipes for more information.

To wrap text lines, you can use the textwrap module included in the standard library, instead of manually counting characters. View the documentation here: https://docs.python.org/3/library/textwrap.html.

If you work with multiple languages, or with any kind of non-English input, it is very useful to learn the basics of Unicode and encodings. In a nutshell, given the vast amount of characters in all the different languages in the world, including alphabets not related to the Latin one, such as Chinese or Arabic, there's a standard to try and cover all of them so that computers can properly understand them. Python 3 greatly improved this situation, making the internal objects of the strings can deal with all of those characters. The default encoding that Python uses, and the most common and compatible one, is currently UTF-8.

A good article to learn about the basics of UTF-8 is this blog post: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.

Dealing with encodings is still relevant when reading from external files that can be encoded in different encodings (for example, CP-1252 or windows-1252, which is a common encoding produced by legacy Microsoft systems, or ISO 8859-15, which is the industry standard).

See also

  • The Creating strings with formatted values recipe, covered earlier in the chapter, to learn the basics of string creation.
  • The Introducing regular expressions recipe, covered later in the chapter, to learn how to detect and extract patterns in text.
  • The Going deeper into regular expressions recipe, covered later in the chapter, to further your knowledge of regular expressions.
  • The Dealing with encodings recipe in Chapter 4, Searching and Reading Local Files, to learn about different kinds of encodings.
 

Extracting data from structured strings

In a lot of automated tasks, we'll need to treat input text structured in a known format and extract the relevant information. For example, a spreadsheet may define a percentage in a piece of text (such as 37.4%) and we want to retrieve it in a numerical format to apply it later (0.374, as a float).

In this recipe, we'll learn how to process sale logs that contain inline information about a product, such as whether it has been sold, its price, profit made, and other information.

Getting ready

Imagine that we need to parse information stored in sales logs. We'll use a sales log with the following structure:

[<Timestamp in iso format>] - SALE - PRODUCT: <product id> - PRICE: $<price of the sale>

For example, a specific log may look like this:

[2018-05-05T10:58:41.504054] - SALE - PRODUCT: 1345 - PRICE: $09.99

Note that the price has a leading zero. All prices will have two digits for the dollars and two for the cents.

The standard ISO 8601 defines standard ways of representing the time and date. It is widely used in the computing world and can be parsed and generated by virtually any computer language.

We need to activate our virtual environment before we start:

$ source .venv/bin/activate

How to do it…

  1. In the Python interpreter, make the following imports. Remember to activate your virtualenv, as described in the Creating a virtual environment recipe:
    >>> import delorean
    >>> from decimal import Decimal
    
  2. Enter the log to parse:
    >>> log = '[2018-05-05T11:07:12.267897] - SALE - PRODUCT: 1345 - PRICE: $09.99'
    
  3. Split the log into its parts, which are divided by - (note the space before and after the dash). We ignore the SALE part as it doesn't add any relevant information:
    >>> divide_it = log.split(' - ')
    >>> timestamp_string, _, product_string, price_string = divide_it
    
  4. Parse the timestamp into a datetime object:
    >>> timestamp = delorean.parse(timestamp_string.strip('[]'))
    
  5. Parse the product_id into an integer:
    >>> product_id = int(product_string.split(':')[-1])
    
  6. Parse the price into a Decimal type:
    >>> price = Decimal(price_string.split('$')[-1])
    
  7. Now you have all of the values in native Python format:
    >> timestamp, product_id, price
    (Delorean(datetime=datetime.datetime(2018, 5, 5, 11, 7, 12, 267897), timezone='UTC'), 1345, Decimal('9.99'))
    

How it works…

The basic working of this is to isolate each of the elements and then parse them into the proper type. The first step is to split the full log into smaller parts. The string is a good divider, as it splits it into four parts—a timestamp one, one with just the word SALE, the product, and the price.

In the case of the timestamp, we need to isolate the ISO format, which is in brackets in the log. That's why the timestamp has the brackets stripped from it. We use the delorean module (introduced earlier) to parse it into a datetime object.

The word SALE is ignored. There's no relevant information there.

To isolate the product ID, we split the product part at the colon. Then, we parse the last element as an integer:

>>> product_string.split(':')
['PRODUCT', ' 1345']
>>> int(' 1345')
1345

To divide the price, we use the dollar sign as a separator, and parse it as a Decimal character:

>>> price_string.split('$')
['PRICE: ', '09.99']
>>> Decimal('09.99')
Decimal('9.99')

As described in the next section, do not parse this value into a float type, as it will change the precision.

There's more…

These log elements can be combined together into a single object, helping to parse and aggregate them. For example, we could define a class in Python code in the following way:

class PriceLog(object):
  def __init__(self, timestamp, product_id, price):
    self.timestamp = timestamp
    self.product_id = product_id
    self.price = price
  def __repr__(self):
    return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
                                            self.product_id,
                                            self.price)
  @classmethod
  def parse(cls, text_log):
    '''
    Parse from a text log with the format
    [<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>
    to a PriceLog object
    '''
    divide_it = text_log.split(' - ')
    tmp_string, _, product_string, price_string = divide_it
    timestamp = delorean.parse(tmp_string.strip('[]'))
    product_id = int(product_string.split(':')[-1])
    price = Decimal(price_string.split('$')[-1])
    return cls(timestamp=timestamp, product_id=product_id, price=price)

So, the parsing can be done as follows:

>>> log = '[2018-05-05T12:58:59.998903] - SALE - PRODUCT: 897 - PRICE: $17.99'
>>> PriceLog.parse(log)
<PriceLog (Delorean(datetime=datetime.datetime(2018, 5, 5, 12, 58, 59, 998903), timezone='UTC'), 897, 17.99)>

Avoid using float types for prices. Floats numbers have precision problems that may produce strange errors when aggregating multiple prices, for example:

>>> 0.1 + 0.1 + 0.1 
0.30000000000000004

Try these two options to avoid any problems:

  • Use integer cents as the base unit: This means multiplying currency inputs by 100 and transforming them into Integers (or whatever fractional unit is correct for the currency used). You may still want to change the base when displaying them.
  • Parse into the decimal type: The Decimal type keeps the fixed precision and works as you'd expect. You can find further information about the Decimal type in the Python documentation at https://docs.python.org/3.8/library/decimal.html.

If you use the Decimal type, parse the results directly into Decimal from the string. If transforming it first into a float, you can carry the precision errors to the new type.

See also

  • The Creating a virtual environment recipe, covered earlier in the chapter, to learn how to start a virtual environment with installed modules.
  • The Using a third-party tool—parse recipe, covered later in the chapter, to further your knowledge of how to use third-party tools to deal with text.
  • The Introducing regular expressions recipe, covered later in the chapter, to learn how to detect and extract patterns from text.
  • The Going deeper into regular expressions recipe, covered later in the chapter, to further your knowledge of regular expressions.
 

Using a third-party tool—parse

While manually parsing data, as seen in the previous recipe, works very well for small strings, it can be very laborious to tweak the exact formula to work with a variety of inputs. What if the input has an extra dash sometimes? Or it has a variable length header depending on the size of one of the fields?

A more advanced option is to use regular expressions, as we'll see in the next recipe. But there's a great module in Python called parse (https://github.com/r1chardj0n3s/parse), which allows us to reverse format strings. It is a fantastic tool that's powerful, easy to use, and greatly improves the readability of code.

Getting ready

Add the parse module to the requirements.txt file in our virtual environment and reinstall the dependencies, as shown in the Creating a virtual environment recipe.

The requirements.txt file should look like this:

delorean==1.0.0
requests==2.22.0
parse==1.14.0

Then, reinstall the modules in the virtual environment:

$ pip install -r requirements.txt
...
Collecting parse==1.14.0
  Downloading https://files.pythonhosted.org/packages/4a/ea/9a16ff916752241aa80f1a5ec56dc6c6defc5d0e70af2d16904a9573367f/parse-1.14.0.tar.gz
...
Installing collected packages: parse
  Running setup.py install for parse ... done
Successfully installed parse-1.14.0

How to do it…

  1. Import the parse function:
    >>> from parse import parse
    
  2. Define the log to parse, in the same format as in the Extracting data from structured strings recipe:
    >>> LOG = '[2018-05-06T12:58:00.714611] - SALE - PRODUCT: 1345 - PRICE: $09.99'
    
  3. Analyze it and describe it as you would do when trying to print it, like this:
    >>> FORMAT = '[{date}] - SALE - PRODUCT: {product} - PRICE: ${price}'
    
  4. Run parse and check the results:
    >>> result = parse(FORMAT, LOG)
    >>> result
    <Result () {'date': '2018-05-06T12:58:00.714611', 'product': '1345', 'price': '09.99'}>
    >>> result['date']
    '2018-05-06T12:58:00.714611'
    >>> result['product']
    '1345'
    >>> result['price']
    '09.99'
    
  5. Note the results are all strings. Define the types to be parsed:
    >>> FORMAT = '[{date:ti}] - SALE - PRODUCT: {product:d} - PRICE: ${price:05.2f}'
    
  6. Parse once again:
    >>> result = parse(FORMAT, LOG)
    >>> result
    <Result () {'date': datetime.datetime(2018, 5, 6, 12, 58, 0, 714611), 'product': 1345, 'price': 9.99}>
    >>> result['date']
    datetime.datetime(2018, 5, 6, 12, 58, 0, 714611)
    >>> result['product']
    1345
    >>> result['price']
    9.99
    
  7. Define a custom type for the price to avoid issues with the float type:
    >>> from decimal import Decimal
    >>> def price(string):
    ...   return Decimal(string)
    ...
    >>> FORMAT = '[{date:ti}] - SALE - PRODUCT: {product:d} - PRICE: ${price:price}'
    >>> parse(FORMAT, LOG, {'price': price})
    <Result () {'date': datetime.datetime(2018, 5, 6, 12, 58, 0, 714611), 'product': 1345, 'price': Decimal('9.99')}>
    

How it works…

The parse module allows us to define a format, as a string, that reverses the format method when parsing values. A lot of the concepts that we discussed when creating strings apply here—put values in brackets, define the type after a colon, and so on.

By default, as seen in step 4, the values are parsed as strings. This is a good starting point when analyzing text. The values can be parsed into more useful native types, as shown in steps 5 and 6 in the How to do it section. Please note that while most of the parsing types are the same as the ones in the Python Format Specification mini-language, there are some others available, such as ti for timestamps in ISO format.

Though we are using timestamp in this book in a more liberal way as a replacement for "Date and time," in the strictest sense, it should only be used for numeric formats, such as Unix timestamp or epoch, defined as the number of seconds since a particular time.

The usage of a timestamp that includes other formats is common anyway as it's a clear and understandable concept, but be sure to agree to formats when sharing information with others.

If native types are not enough, our own parsing can be defined, as demonstrated in step 7 of the How to do it section. Note that the definition of the price function gets a string and returns the proper format, in this case, a Decimal type.

All the issues about floats and price information described in the There's more section of the Extracting data from structured strings recipe apply here as well.

There's more…

The timestamp can also be translated into a delorean object for consistency. Also, delorean objects carry over time zone information. Adding the same structure as in the previous recipe gives the following object, which is capable of parsing logs:

import parse
from decimal import Decimal
import delorean
class PriceLog(object):
  def __init__(self, timestamp, product_id, price):
    self.timestamp = timestamp
    self.product_id = product_id
    self.price = price
  def __repr__(self):
    return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
                                            self.product_id,
                                            self.price)
  @classmethod
  def parse(cls, text_log):
    '''
    Parse from a text log with the format
    [<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>     to a PriceLog object
   '''
    def price(string):
      return Decimal(string)
    def isodate(string):
      return delorean.parse(string)
      FORMAT = ('[{timestamp:isodate}] - SALE - PRODUCT: {product:d} - '
              'PRICE: ${price:price}')
      formats = {'price': price, 'isodate': isodate}
      result = parse.parse(FORMAT, text_log, formats)
      return cls(timestamp=result['timestamp'],
               product_id=result['product'],
               price=result['price'])

So, parsing it returns similar results:

>>> log = '[2018-05-06T14:58:59.051545] - SALE - PRODUCT: 827 - PRICE: $22.25'
>>> PriceLog.parse(log)
<PriceLog (Delorean(datetime=datetime.datetime(2018, 6, 5, 14, 58, 59, 51545), timezone='UTC'), 827, 22.25)>

This code is contained in the GitHub file, https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/blob/master/Chapter01/price_log.py

All supported parse types can be found in the documentation at https://github.com/r1chardj0n3s/parse#format-specification.

See also

  • The Extracting data from structured strings recipe, covered earlier in this chapter, to learn how to use simple processes to get information from text.
  • The Introducing regular expressions recipe, covered later in this chapter, to learn how to detect and extract patterns from text.
  • The Going deeper into regular expressions recipe, covered later in this chapter, to further your knowledge of regular expressions.
 

Introducing regular expressions

A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically, the definition of a structured kind of text) to check with other strings to see if they match or not.

It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase "n"s and "a"s after that. Let's show some possible comparisons and results:

Text to compare Result

Anna

Match

Bob

No match (No initial A)

Alice

No match (l is not n or a after initial A)

James

No match (No initial A)

Aaan

Match

Ana

Match

Annnn

Match

Aaaan

Match

ANNA

No match (N is not n or a)

Table 1.1: A pattern matching example

If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are also very useful because they allow us to perform incredibly powerful pattern matching.

Some common uses of regexes are:

  • Validating input data: For example, a phone number that is only numbers, dashes, and brackets.
  • String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
  • Scrapping: Find the occurrences of something in a long piece of text. For example, find all of the emails in a web page.
  • Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.

Getting ready

The python module to deal with regexes is called re. The main function we'll cover is re.search(), which returns a match object with information about what matched the pattern.

As regex patterns are also defined as strings, we'll differentiate them by prefixing them with an r, such as r'pattern'. This is the Python way of labeling a text as raw string literals, meaning that the string within is taken literally, without any escaping. This means that a "\" is used as a backslash instead of an escaping sequence. For example, without the r prefix, \n means a newline character.

Some characters are special and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.

The simplest form is just a literal string. For example, the regex pattern r'LOG' matches the string 'LOGS', but not the string 'NOT A MATCH'. If there's no match, re.search returns None. If there is, it returns a special Match object:

>>> import re
>>> re.search(r'LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'LOG', 'NOT A MATCH')
>>>

How to do it…

  1. Import the re module:
    >>> import re
    
  2. Then, match a pattern that is not at the start of the string:
    >>> re.search(r'LOG', 'SOME LOGS')
    <_sre.SRE_Match object; span=(5, 8), match='LOG'>
    
  3. Match a pattern that is only at the start of the string. Note the ^ character:
    >>> re.search(r'^LOG', 'LOGS')
    <_sre.SRE_Match object; span=(0, 3), match='LOG'>
    >>> re.search(r'^LOG', 'SOME LOGS')
    >>>
    
  4. Match a pattern only at the end of the string. Note the $ character:
    >>> re.search(r'LOG$', 'SOME LOG')
    <_sre.SRE_Match object; span=(5, 8), match='LOG'>
    >>> re.search(r'LOG$', 'SOME LOGS')
    >>>
    
  5. Match the word 'thing' (not excluding things), but not something or anything. Note the \b at the start of the second pattern:
    >>> STRING = 'something in the things she shows me'
    >>> match = re.search(r'thing', STRING)
    >>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
    ('some', 'thing', ' in the things she shows me')
    >>> match = re.search(r'\bthing', STRING)
    >>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
    ('something in the ', 'thing', 's she shows me')
    
  6. Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matched string:
    >>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890') <_sre.SRE_Match object; span=(20, 32), match='1234-567-890'>
    >>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group()
    '1234-567-890'
    
  7. Match an email address naively:
    >>> re.search(r'\[email protected]\S+', 'my email is [email protected]').group() '[email protected]'
    

How it works…

The re.search function matches a pattern, no matter its position in the string. As explained previously, this will return None if the pattern is not found, or a Match object.

The following special characters are used:

  • ^: Marks the start of the string
  • $: Marks the end of the string
  • \b: Marks the start or end of a word
  • \S: Marks any character that's not a whitespace, including characters like * or $

More special characters are shown in the next recipe, Going deeper into regular expressions.

In step 6 of the How to do it section, the r'[0123456789-]+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0 and 9 (any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.

Step 7 again uses the + sign to match as many characters as necessary before the @ and again after it. In this case, the character match is \S, which matches any non-whitespace character.

Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as [email protected]@test.com. A better regex for most uses is r"(^[a-zA-Z0-9_.+-][email protected][a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)". You can go to http://emailregex.com/ to find it, along with links to more information.

Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and hard-to-read regex.

The resulting matching object returns the position where the matched pattern starts and ends (using the start and end methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns.

The difference displayed in step 5 is a very common one. Trying to capture GP (as in General Practitioner, for a medical doctor) can end up capturing eggplant and bagpipe! Similarly, things\b won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b for just the word GP.

The specific matched pattern can be retrieved by calling group(), as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:

>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]

There's more…

Dealing with regexes can be difficult and complex. Please allow time to test your matches and be sure that they work as you expect in order to avoid nasty surprises.

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

– Jamie Zawinski

Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is with HTML parsing; refer to Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.

Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.

You can check your regexes interactively with some tools. A good one that's freely available online is https://regex101.com/, which displays each of the elements and explains the regex. Double-check that you're using the Python flavor:

Figure 1.1: An example using RegEx101

Note that the EXPLANATION box in the preceding image describes that \b matches a word boundary (the start or end of a word), and that thing matches literally these characters.

Regexes, in some cases, can be very slow, or even susceptible to what's called a regex denial-of-service attack, a string created to confuse a particular regex so that it takes an enormous amount of time. In the worst-case scenario, it can even block the computer. While automating tasks probably won't get you into those problems, keep an eye out in case a regex takes too long to process.

See also

  • The Extracting data from structured strings recipe, covered earlier in the chapter, to learn simple techniques to extract information from text.
  • The Using a third-party tool—parse recipe, covered earlier in the chapter, to use a third-party tool to extract information from text.
  • The Going deeper into regular expressions recipe, covered later in the chapter, to further your knowledge of regular expressions.
 

Going deeper into regular expressions

In this recipe, we'll learn more about how to deal with regular expressions. After introducing the basics, we will dig a little deeper into pattern elements, introduce groups as a better way to retrieve and parse strings, learn how to search for multiple occurrences of the same string, and deal with longer texts.

How to do it…

  1. Import re:
    >>> import re
    
  2. Match a phone pattern as part of a group (in brackets). Note the use of \d as a special character for any digit:
    >>> match = re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
    >>> match.group()
    'the phone number is 1234-567-890'
    >>> match.group(1)
    '1234-567-890'
    
  3. Compile a pattern and capture a case-insensitive pattern with a yes|no option:
    >>> pattern = re.compile(r'The answer to question (\w+) is (yes|no)', re.IGNORECASE)
    >>> pattern.search('Naturally, the answer to question 3b is YES')
    <_sre.SRE_Match object; span=(10, 42), match='the answer to question 3b is YES' >
    >>> pattern.search('Naturally, the answer to question 3b is YES').groups()
    ('3b', 'YES')
    
  4. Match all the occurrences of cities and state abbreviations in the text. Note that they are separated by a single character, and the name of the city always starts with an uppercase letter. Only four states are matched for simplicity:
    >>> PATTERN = re.compile(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)')
    >>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
    >>> list(PATTERN.finditer(TEXT))
    [<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>, <_sre.SRE_Match object; span=(73, 85), match='Corvallis OR'>, <_sre.SRE_Match object; span=(113, 122), match='Toledo.OH'>, <_sre.SRE_Match object; span=(157, 172), match='Grand Rapids,MI'>]
    >>> _[0].groups() ('Odessa', 'TX')
    

How it works…

The new special characters that were introduced are as follows:

  • \d: Marks any digit (0 to 9).
  • \s: Marks any character that's a whitespace, including tabs and other whitespace special characters. Note that this is the reverse of \S, which was introduced in the previous recipe.
  • \w: Marks any letter (this includes digits, but excludes characters such as periods).
  • .: (dot): Marks any character.

Note that the same letter in uppercase or lowercase means the opposite match, for example, \d matches a digit, while \D matches a non-digit.

To define groups, put the defined groups in parentheses. Groups can be retrieved individually. This makes them perfect for matching a bigger pattern that contains a variable part to be processed in the next step, as demonstrated in step 2. Note the difference with the step 6 pattern in the previous recipe. In this case, the pattern is not only the number, but it includes the prefix text, even if we then extract only the number:

>>> re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(4, 36), match='the phone number is 1234-567-890'>
>>> _.group(1)
'1234-567-890'
>>> re.search(r'[0123456789-]+', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(0, 2), match='37'>
>>> _.group()
'37'

Remember that group 0 (.group() or .group(0)) is always the whole match. The rest of the groups are ordered as they appear.

Patterns can be compiled as well. This saves some time if the pattern needs to be matched over and over. To use it that way, compile the pattern and then use that object to perform searches, as shown in steps 3 and 4. Some extra flags can be added, such as making the pattern case insensitive.

Step 4's pattern requires a little bit of information. It's composed of two groups, separated by a single character. The special character "." (dot) means it matches everything. In our example, it matches a period, a whitespace, and a comma. The second group is a straightforward selection of defined options, in this case, US state abbreviations.

The first group starts with an uppercase letter ([A-Z]) and accepts any combination of letters or spaces ([\w\s]+?), but not punctuation marks such as periods or commas. This matches the cities, including those that are composed of more than one word.

The final +? makes the match of letters non-greedy, matching as few characters as possible. This avoids some problems such as when there are no punctuation symbols between the cities. Take a look at the result where we don't include the non-greedy qualifier for the second match and how it includes two elements:

>>> PATTERN = re.compile(r'([A-Z][\w\s]+).(TX|OR|OH|MI)')
>>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
>>> list(PATTERN.finditer(TEXT))[1]
<re.Match object; span=(73, 122), match='Corvallis OR and the mud hens come from Toledo.OH>

Note that this pattern starts on any uppercase letter and keeps matching until it finds a state, unless separated by a punctuation mark, which may not be what's expected, for example:

>>> re.search(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)', 'This is a test, Escanaba MI')
<_sre.SRE_Match object; span=(16, 27), match='Escanaba MI'>
>>> re.search(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)', 'This is a test with Escanaba MI')
<_sre.SRE_Match object; span=(0, 31), match='This is a test with Escanaba MI'>

Step 4 also shows you how to find more than one occurrence in a long text. While the .findall() method exists, it doesn't return the full match object, while .findalliter() does. As is common now in Python 3, .findalliter() returns an iterator that can be used in a for loop or list comprehension. Note that .search() returns only the first occurrence of the pattern, even if more matches appear:

>>> PATTERN.search(TEXT)
<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>
>>> PATTERN.findall(TEXT)
[('Odessa', 'TX'), ('Corvallis', 'OR'), ('Toledo', 'OH')]

There's more…

The special characters can be reversed if they are case swapped. For example, the reverse of the ones we used are as follows:

  • \D: Marks any non-digit.
  • \W: Marks any non-letter.
  • \B: Marks a position that's not at the start or end of a word. For example, r'thing\B' will match things but not thing.

The most commonly used special characters are typically \d (digits) and \w (letters and digits), as they mark common patterns to search for.

Groups can be assigned names as well. This makes them more explicit at the expense of making the group more verbose in the following shape— (?P<groupname>PATTERN). Groups can be referred to by name with .group(groupname) or by calling .groupdict() while maintaining its numeric position.

For example, the step 4 pattern can be described as follows:

>>> PATTERN = re.compile(r'(?P<city>[A-Z][\w\s]+?).(?P<state>TX|OR|OH|MN)')
>>> match = PATTERN.search(TEXT)
>>> match.groupdict() {'city': 'Odessa', 'state': 'TX'}
>>> match.group('city') 'Odessa'
>>> match.group('state') 'TX'
>>> match.group(1), match.group(2) ('Odessa', 'TX')

Regular expressions are a very extensive topic. There are whole technical books devoted to them and they can be notoriously deep. The Python documentation is a good reference to use (https://docs.python.org/3/library/re.html) and to learn more.

If you feel a little intimidated at the start, it's a perfectly natural feeling. Analyze each of the patterns with care, dividing them into smaller parts, and they will start to make sense. Don't be afraid to run a regex interactive analyzer!

Regexes can be really powerful and generic, but they may not be the proper tool for what you are trying to achieve. We've seen some caveats and patterns that have subtleties. As a rule of thumb, if a pattern starts to feel complicated, it's time to search for a different tool. Remember the previous recipes as well and the options they presented, such as parse.

See also

  • The Introducing regular expressions recipe, covered earlier in the chapter, to learn the basics of using regular expressions.
  • The Using a third-party tool—parse recipe, covered earlier in the chapter, to learn a different technique to extract information from text.
 

Adding command-line arguments

A lot of tasks can be best structured as a command-line interface that accepts different parameters to change the way it works, for example, scraping a web page from a provided URL or other URL. Python includes a powerful argparse module in the standard library to create rich command-line argument parsing with minimal effort.

Getting ready

The basic use of argparse in a script can be shown in three steps:

  1. Define the arguments that your script is going to accept, generating a new parser.
  2. Call the defined parser, returning an object with all of the resulting arguments.
  3. Use the arguments to call the entry point of your script, which will apply the defined behavior.

Try to use the following general structure for your scripts:

IMPORTS
def main(main parameters):
  DO THINGS
if __name__ == '__main__':
    DEFINE ARGUMENT PARSER
    PARSE ARGS
    VALIDATE OR MANIPULATE ARGS, IF NEEDED
    main(arguments)

The main function makes it easy to know what the entry point for the code is. The section under the if statement is only executed if the file is called directly, but not if it's imported. We'll follow this for all the steps.

How to do it…

  1. Create a script that will accept a single integer as a positional argument, and will print a hash symbol that amount of times. The recipe_cli_step1.py script is as follows, but note that we are following the structure presented previously, and the main function is just printing the argument:
    import argparse
    def main(number):
        print('#' * number)
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('number', type=int, help='A number')
        args = parser.parse_args()
        
        main(args.number)
    
  2. Call the script and check how the parameter is presented. Calling the script with no arguments displays the automatic help. Use the automatic argument -h to display the extended help:
    $ python3 recipe_cli_step1.py
    usage: recipe_cli_step1.py [-h] number
    recipe_cli_step1.py: error: the following arguments are required: number
    $ python3 recipe_cli_step1.py -h
    usage: recipe_cli_step1.py [-h] number
    positional arguments:
      number      A number
    optional arguments:
     -h, --help show this help message and exit
    
  3. Calling the script with the extra parameters works as expected:
    $ python3 recipe_cli_step1.py 4
    ####
    $ python3 recipe_cli_step1.py not_a_number
    usage: recipe_cli_step1.py [-h] number
    recipe_cli_step1.py: error: argument number: invalid int value: 'not_a_number'
    
  4. Change the script to accept an optional argument for the character to print. The default will be "#". The recipe_cli_step2.py script will look like this:
    import argparse
    def main(character, number):
        print(character * number)
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('number', type=int, help='A number')
        parser.add_argument('-c', type=str, help='Character to print',
                            default='#')
    args = parser.parse_args()
    main(args.c, args.number)
    
  5. The help is updated, and using the -c flag allows us to print different characters:
    $ python3 recipe_cli_step2.py -h
    usage: recipe_cli_step2.py [-h] [-c C] number
    positional arguments:
      number      A number
    optional arguments:
     -h, --help show this help message and exit
     -c C Character to print
    $ python3 recipe_cli_step2.py 4
    ####
    $ python3 recipe_cli_step2.py 5 -c m
    mmmmm
    
  6. Add a flag that changes the behavior when present. The recipe_cli_step3.py script is as follows:
    import argparse
    def main(character, number):
        print(character * number)
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('number', type=int, help='A number')
        parser.add_argument('-c', type=str, help='Character to print',
                            default='#')
        parser.add_argument('-U', action='store_true', default=False,
                            dest='uppercase',
                            help='Uppercase the character')
        args = parser.parse_args()
        if args.uppercase:
            args.c = args.c.upper()
        main(args.c, args.number)
    
  7. Calling it uppercases the character if the -U flag is added:
    $ python3 recipe_cli_step3.py 4 -c f
    ffff
    $ python3 recipe_cli_step3.py 4 -c f -U
    FFFF
    

How it works…

As described in step 1 of the How to do it section, the arguments are added to the parser through .add_arguments. Once all of the arguments are defined, calling parse_args() returns an object that contains the results (or exits if there's an error).

Each argument should add a help description, but their behavior can change greatly:

  • If an argument starts with a -, it is considered an optional parameter, like the -c argument in step 4. If not, it's a positional argument, like the number argument in step 1.
  • For clarity, always define a default value for optional parameters. It will be None if you don't, but this may be confusing.
  • Remember to always add a help parameter with a description of the parameter; help is automatically generated, as shown in step 2.
  • If a type is present, it will be validated, for example, number in step 3. By default, the type will be a string.
  • The actions store_true and store_false can be used to generate flags, arguments that don't require any extra parameters. Set the corresponding default value as the opposite Boolean. This is demonstrated in the U argument in steps 6 and 7.
  • The name of the property in the args object will be, by default, the name of the argument (without the dash, if it's present). You can change it with dest. For example, in step 6, the command-line argument -U is described as uppercase.

Changing the name of an argument for internal usage is very useful when using short arguments, such as single letters. A good command-line interface will use -c, but, internally, it's probably a good idea to use a more verbose label, such as configuration_file. Remember, explicit is better than implicit!

  • Some arguments can work in coordination with others, as shown in step 3. Perform all of the required operations to pass the main function as clear and concise parameters. For example, in step 3, only two parameters are passed, but one may have been modified.

There's more…

You can create long arguments as well with double dashes, for example:

 parser.add_argument('-v', '--verbose', action='store_true', default=False,
                     help='Enable verbose output')

This will accept both -v and --verbose, and it will store the name verbose.

Adding long names is a good way of making the interface more intuitive and easy to remember. It's easy to remember after a couple of times that there's a verbose option, and it starts with a v.

The main inconvenience when dealing with command-line arguments may be that you end up with too many of them. This creates confusion. Try to make your arguments as independent as possible and don't make too many dependencies between them; otherwise, handling the combinations can be tricky.

In particular, try to not create more than a couple of positional arguments, as they won't have mnemonics. Positional arguments also accept default values, but most of the time, that won't be the expected behavior.

For advanced details, refer to the Python documentation of argparse (https://docs.python.org/3/library/argparse.html).

See also

  • The Creating a virtual environment recipe, covered earlier in this chapter, to learn how to create an environment installing third-party modules.
  • The Installing third-party packages recipe, covered earlier in this chapter, to learn how to install and use external modules in the virtual environment.

About the Author

  • Jaime Buelta

    Jaime Buelta is a full-time Python developer since 2010 and a regular speaker at PyCon Ireland. He has been a professional programmer for over two decades with a rich exposure to a lot of different technologies throughout his career. He has developed software for a variety of fields and industries, including aerospace, networking and communications, industrial SCADA systems, video game online services, and finance services.

    Browse publications by this author

Latest Reviews

(1 reviews total)
I had problems getting my ebook

Recommended For You

Book Title
Unlock this full book FREE 10 day trial
Start Free Trial