Numbers, Strings, and Tuples

We'll cover these recipes to introduce basic Python data types:

Creating meaningful names and using variables
Working with large and small integers
Choosing between float, decimal, and fraction
Choosing between true division and floor division
Rewriting an immutable string
String parsing with regular expressions
Building complex strings with "template".format()
Building complex strings from lists of characters
Using the Unicode characters that aren't on our keyboards
Encoding strings – creating ASCII and UTF-8 bytes
Decoding bytes – how to get proper characters from some bytes
Using tuples of items

Introduction

This chapter will look at some central types of Python objects. We'll look at the different kinds of numbers, working with strings, and using tuples. We'll look at these first because they're the simplest kinds of data Python works with. In later chapters, we'll look at data collections.

Most of these recipes assume a beginner's level of understanding of Python 3. We'll be looking at how we use the essential built-in types available in Python—numbers, strings, and tuples. Python has a rich variety of numbers, and two different division operators, so we'll need to look closely at the choices available to us.

When working with strings, there are several common operations that are important. We'll explore some of the differences between bytes—as used by our OS files, and Strings—as used by Python. We'll look at how we can exploit the full power of the Unicode character set.

In this chapter, we'll show the recipes as if we're working from the >>> prompt in interactive Python. This is sometimes called the read-eval-print loop (REPL). In later chapters, we'll look more closely at writing script files. The goal is to encourage interactive exploration because it's a great way to learn the language.

Creating meaningful names and using variables

How can we be sure our programs make sense? One of the core elements of making expressive code is to use meaningful names. But what counts as meaningful? In this recipe, we'll review some common rules for creating meaningful Python names.

We'll also look at some of Python's assignment statement variations. We can, for example, assign more than one variable in a single statement.

Getting ready

The core issue when creating a name is to ask ourselves the question what is this thing? For software, we'd like a name that's descriptive of the object being named. Clearly, a name like x is not very descriptive, it doesn't seem to refer to an actual thing.

Vague, non-descriptive names are distressingly common in some programming. It's not helpful to others when we use them. A descriptive name helps everyone.

When naming things, it's also important to separate the problem domain—what we're really trying to accomplish—from the solution domain. The solution domain consists of the technical details of Python, OS, and Internet. Anyone who reads the code can see the solution; it doesn't require deep explanation. The problem domain, however, can be obscured by technical details. It's our job to make the problem clearly visible. Well-chosen names will help.

How to do it...

We'll look at names first. Then we'll move on to assignment.

Choosing names wisely

On a purely technical level, Python names must begin with a letter. They can include any number of letters, digits, and the _ character. Python 3 is based on Unicode, so a letter is not limited to the Latin alphabet. While the A-Z Latin alphabet is commonly used, it's not required.

When creating a descriptive variable, we want to create names that are both specific and articulate the relationships among things in our programs. One widely used technique is to create longer names in a style that moves from particular to general.

The steps to choosing a name are as follows:

The last part of the name is a very broad summary of the thing. In a few cases, this may be all we need; context will supply the rest. We'll suggest some typical broad summary categories later.
Use a prefix to narrow this name around your application or problem domain.
If needed, put more narrow and specialized prefixes on this name to clarify how it's distinct from other classes, modules, packages, functions, and other objects. When in doubt about prefixing, remember how domain names work. Think of mail.google.com—the name flows from particular to general. There's no magic about the three levels of naming, but it often happens to work out that way.
Format the name depending on how it's used in Python. There are three broad classes of things we'll put names on, which are shown as follows:
- Classes: A class has a name that summarizes the objects that are part of the class. These names will (often) use CapitalizedCamelCase. The first letter of a class name is capitalized to emphasize that it's a class, not an instance of the class. A class is often a generic concept, rarely a description of a tangible thing.
- Objects: A name for an object usually uses snake_case - all lowercase with multiple _ characters between words. In Python, this includes variables, functions, modules, packages, parameters, attributes of objects, methods of classes, and almost everything else.
- Script and module files: These are really the OS resources, as seen by Python. Therefore, a filename should follow the conventions for Python objects, using letters, the _ characters and ending with the .py extension. It's technically possible to have pretty wild and free filenames. Filenames that don't follow Python rules can be difficult to use as a module or package.

How do we choose the broad category part of a name? The general category depends on whether we're talking about a thing or a property of a thing. While the world is full of things, we can create some board groupings that are helpful. Some of the examples are Document, Enterprise, Place, Program, Product, Process, Person, Asset, Rule, Condition, Plant, Animal, Mineral, and so on.

We can then narrow these with qualifiers:

    FinalStatusDocument
    ReceivedInventoryItemName

The first example is a class called Document. We've narrowed it slightly by adding a prefix to call it a StatusDocument. We narrowed it even further by calling it a FinalStatusDocument. The second example is a Name that we narrowed by specifying that it's a ReceivedInventoryItemName. This example required a four-level name to clarify the class.

An object often has properties or attributes. These have a decomposition based in the kind of information that's being represented. Some examples of terms that should be part of a complete name are amount, code, identifier, name, text, date, time, datetime, picture, video, sound, graphic, value, rate, percent, measure, and so on.

The idea is to put the narrow, more detailed description first, and the broad kind of information last:

    measured_height_value
    estimated_weight_value
    scheduled_delivery_date
    location_code

In the first example, height narrows a more general representation term value. And measured_height_value further narrows this. Given this name, we can expect to see other variations on height. Similar thinking applies to weight_value, delivery_date and location_code. Each of these has a narrowing prefix or two.

Some things to avoid: Don't include detailed technical type information using coded prefixes or suffixes. This is often called Hungarian Notation; we don't use f_measured_height_value where the f is supposed to mean a floating-point. A variable like measured_height_value can be any numeric type and Python will do all the necessary conversions. The technical decoration doesn't offer much help to someone reading our code, because the type specification can be misleading or even incorrect. Don't waste a lot of effort forcing names to look like they belong together. We don't need to make SpadesCardSuit, ClubsCardSuit, and so on. Python has many different kinds of namespaces, including packages, modules, and classes, as well as namespace objects to gather related names together. If you combine these names in a CardSuit class, you can use CardSuit.Spades, which uses the class as namespace to separate these names from other, similar names.

Assigning names to objects

Python doesn't use static variable definitions. A variable is created when a name is assigned to an object. It's important to think of the objects as central to our processing, and variables as little more than sticky notes that identify an object. Here's how we use the fundamental assignment statement:

Create an object. In many of the examples we'll create objects as literals. We'll use 355 or 113 as literal representations of integer objects in Python. We might use a string like FireBrick or a tuple like (178, 34, 34).
Write the following kind of statement: variable = object. Here are some examples:

      >>> circumference_diameter_ratio = 355/113
      >>> target_color_name = 'FireBrick'
      >>> target_color_rgb = (178, 34, 34)

We've created some objects and assigned them to variables. The first object is the result of a calculation. The next two objects are simple literals. Generally, objects are created by expressions that involve functions or classes.

This basic statement isn't the only kind of assignment. We can assign a single object to multiple variables using a kind of duplicated assignment like this:

>>> target_color_name = first_color_name = 'FireBrick'

This creates two names for the same string object. We can confirm this by checking the internal ID values that Python uses:

>>> id(target_color_name) == id(first_color_name)
True

This comparison shows us that the internal identifiers for these two objects are the same.

A test for equality uses ==. Simple assignment uses =.

When we look at numbers and collections, we'll see that we can combine assignment with an operator. We can do things like this:

>>> total_count = 0
>>> total_count += 5
>>> total_count += 6
>>> total_count
11

We've augmented assignment with an operator. total_count += 5 is the same as total_count = total_count + 5. This technique has the advantage of being shorter.

How it works...

This approach to creating names follows the pattern of using narrow, more specific qualifiers first and the wider, less-specific category last. This follows the common convention used for domain names and e-mail addresses.

For example, a domain name like mail.google.com has a specific service, a more general enterprise, and finally a very general domain. This follows the principle of narrow-to-wider.

As another example, service@packtpub.com starts with a specific destination name, has a more general enterprise, and finally a very general domain. Even the name of destination (PacktPub) is a two-part name with a narrow enterprise name (Packt) followed by a wider industry (Pub, short for publishing). (We don't agree with those who suggest it stands for Public House.)

The assignment statement is the only way to put a name on an object. We noted that we can have two names for the same underlying object. This isn't too useful right now. But in Chapter 4, Built-in Data Structures – list, set, dict we'll see some interesting consequences of multiple names for a single object.

There's more...

We'll try to show descriptive names in all of the recipes.

We have to grant exceptions to existing software which doesn't follow this pattern. It's often better to be consistent with legacy software than impose new rules even if the new rules are better.

Almost every example will involve assignment to variables. It's central to stateful object-oriented programming.

We'll look at classes and class names in Chapter 6, Basics of Classes and Objects; we'll look at modules in Chapter 11, Application Integration.

Working with large and small integers

Many programming languages make a distinction between integers, bytes, and long integers. Some languages include distinctions for signed versus unsigned integers. How do we map these concepts to Python?

The easy answer is that we don't. Python handles integers of all sizes in a uniform way. From bytes to immense numbers with hundreds of digits, it's all just integers to Python.

Getting ready

Imagine you need to calculate something really big. For example, calculate the number of ways to permute the cards in a 52-card deck. The number 52! = 52 × 51 × 50 × ... × 2 × 1, is a very, very large number. Can we do this in Python?

How to do it...

Don't worry. Really. Python behaves as if it has one universal type of integer, and this covers all of the bases from bytes to numbers that fill all of the memory. Here are the steps to using integers properly:

Write the numbers you need. Here are some smallish numbers: 355, 113. There’s no practical upper limit.

Creating a very small value—a single byte—looks like this:

      >>> 2
      2

Or perhaps this, if you want to use base 16:

      >>> 0xff
      255

In later recipes, we'll look at a sequence of bytes that has only a single value in it:

      >>> b'\xfe'
      b'\xfe'

This isn't—technically—an integer. It has a prefix of b' that shows us it's a 1-byte sequence.

Creating a much, much bigger number with a calculation might look like this:

      >>> 2**2048 
      323...656

This number has 617 digits. We didn't show all of them.

How it works...

Internally, Python uses two kinds of numbers. The conversion between these two is seamless and automatic.

For smallish numbers, Python will generally use 4 or 8 byte integer values. Details are buried in CPython's internals, and depend on the facilities of the C-compiler used to build Python.

For largish numbers, over sys.maxsize, Python switches to large integer numbers which are sequences of digits. Digit, in this case, often means a 30-bit value.

How many ways can we permute a standard deck of 52 cards? The answer is 52! ≈ 8 × 10⁶⁷. Here's how we can compute that large number. We'll use the factorial function in the math module, shown as follows:

>>> import math
>>> math.factorial(52)
80658175170943878571660636856403766975289505440883277824000000000000

Yes, these giant numbers work perfectly.

The first parts of our calculation of 52! (from 52 × 51 × 50 × ... down to about 42) could be performed entirely using the smallish integers. After that, the rest of the calculation had to switch to largish integers. We don't see the switch; we only see the results.

For some of the details on the internals of integers, we can look at this:

>>> import sys
>>> import math
>>> math.log(sys.maxsize, 2)
63.0
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)

The sys.maxsize value is the largest of the small integer values. We computed the log to base 2 to find out how many bits are required for this number.

This tells us that our Python uses 63-bit values for small integers. The range of smallish integers is from -2⁶⁴ ... 2⁶³ - 1. Outside this range, largish integers are used.

The values in sys.int_info tells us that large integers are a sequence of numbers that use 30-bit digits, and each of these digits occupies 4 bytes.

A large value like 52! consists of 8 of these 30-bit-sized digits. It can be a little confusing to think of a digit as requiring 30 bits to represent. Instead of 10 symbols used to represent base 10 numbers, we'd need 2**30 distinct symbols for each digit of these large numbers.

A calculation involving a number of big integer values can consume a fair bit of memory. What about small numbers? How can Python manage to keep track of lots of little numbers like one and zero?

For the commonly used numbers (-5 to 256) Python actually creates a secret pool of objects to optimize memory management. You can see this when you check the id() value for integer objects:

>>> id(1)
4297537952
>>> id(2)
4297537984
>>> a=1+1
>>> id(a)
4297537984

We've shown the internal id for the integer 1 and the integer 2. When we calculate a value, the resulting object turns out to be the same integer 2 object that was found in the pool.

When you try this, your id() values may be different. However, every time the value of 2 is used, it will be the same object; on the author's laptop, it's id = 4297537984. This saves having many, many copies of the 2 object cluttering up memory.

Here's a little trick for seeing exactly how huge a number is:

>>> len(str(2**2048))
617

We created a string from a calculated number. Then we asked what the length of the string was. The response tells us that the number had 617 digits.

There's more...

Python offers us a broad set of arithmetic operators: +, -, *, /, //, %, and **. The / and // are for division; we'll look at these in a separate recipe named Choosing between true division and floor division. The ** raises a number to a power.

For dealing with individual bits, we have some additional operations. We can use &, ^, |, <<, and >>. These operators work bit-by-bit on the internal binary representations of integers. These compute a binary AND, a binary Exclusive OR, Inclusive OR, Left Shift, and Right Shift respectively.

While these will work on very big integers, they don't really make much sense outside the world of individual bytes. Some binary files and network protocols will involve looking at the bits within an individual byte of data.

We can play around with these operators by using the bin() function to see what's going on.

Here's a quick example of what we mean:

>>> xor = 0b0011 ^ 0b0101
>>> bin(xor)
'0b110'

We've used 0b0011 and 0b0101 as our two strings of bits. This helps to clarify precisely what the two numbers have as their binary representation. We applied the exclusive or (^) operator to these two sequences of bits. We used the bin() function to see the result as a string of bits. We can carefully line up the bits to see what the operator did.

We can decompose a byte into portions. Say we want to separate the left-most two bits from the other six bits. One way to do this is with bit-fiddling expressions like these:

>>> composite_byte = 0b01101100
>>> bottom_6_mask =  0b00111111
>>> bin(composite_byte >> 6)
'0b1'
>>> bin(composite_byte & bottom_6_mask)
'0b101100'

We've defined a composite byte which has 01 in the most significant two bits, and 101100 in the least significant six bits. We used the >> shift operator to shift the value by six positions, removing the least significant bits and preserving the two most significant bits. We used the & operator with a mask. Where the mask has 1 bit, a position's value is preserved in the result, where a mask has 0 bits, the result position is set to 0.

Choosing between float, decimal, and fraction

Python offers us several ways to work with rational numbers and approximations of irrational numbers. We have three basic choices:

Float
Decimal
Fraction

With so many choices, when do we use each of these?

Getting ready

It's important to be sure about our core mathematical expectations. If we're not sure what kind of data we have, or what kinds of results we want to get, we really shouldn't be coding. We need to take a step back and review things with pencil and paper.

There are three general cases for math that involve numbers beyond integers, which are:

Currency: Dollars, cents, or euros. Currency generally has a fixed number of decimal places. There are rounding rules used to determine what 7.25% of $2.95 is.
Rational Numbers or Fractions: When we're working with American units for feet and inches, or cooking measurements in cups and fluid ounces, we often need to work in fractions. When we scale a recipe that serves eight, for example, down to five people, we're doing fractional math using a scaling factor of 5/8 . How do we apply this to 2/3 cup of rice and still get a measurement that fits an American kitchen gadget?
Irrational Numbers: This includes all other kinds of calculations. It's important to note that digital computers can only approximate these numbers, and we'll occasionally see odd little artifacts of this approximation. The float approximations are very fast, but sometimes suffer from truncation issues.

When we have one of the first two cases, we should avoid floating-point numbers.

How to do it...

We'll look at each of the three cases separately. First, we'll look at computing with currency. Then we'll look at rational numbers, and finally irrational or floating-point numbers. Finally, we'll look at making explicit conversions among these various types.

Doing currency calculations

When working with currency, we should always use the decimal module. If we try to use Python's built-in float values, we'll have problems with rounding and truncation of numbers.

To work with currency, we'll do this. Import the Decimal class from the decimal module:

      >>> from decimal import Decimal

Create Decimal objects from strings or integers:

      >>> from decimal import Decimal
      >>> tax_rate = Decimal('7.25')/Decimal(100)
      >>> purchase_amount = Decimal('2.95')
      >>> tax_rate * purchase_amount
      Decimal('0.213875')

We created the tax_rate from two Decimal objects. One was based on a string, the other based on an integer. We could have used Decimal('0.0725') instead of doing the division explicitly.

The result is a hair over $0.21. It's computed out correctly to the full number of decimal places.

If you try to create decimal objects from floating-point values, you'll see unpleasant artifacts of float approximations. Avoid mixing Decimal and float. To round to the nearest penny, create a penny object:

      >>> penny=Decimal('0.01')

Quantize your data using this penny object:

      >>> total_amount = purchase_amount + tax_rate*purchase_amount
      >>> total_amount.quantize(penny)
      Decimal('3.16')

This shows how we can use the default rounding rule of ROUND_HALF_EVEN.

Every financial wizard has a different style of rounding. The Decimal module offers every variation. We might, for example, do something like this:

>>> import decimal
>>> total_amount.quantize(penny, decimal.ROUND_UP)
Decimal('3.17')

This shows the consequences of using a different rounding rule.

Fraction calculations

When we're doing calculations that have exact fraction values, we can use the fractions module. This provides us handy rational numbers that we can use. To work with fractions, we’ll do this:

Import the Fraction class from the fractions module:

      >>> from fractions import Fraction

Create Fraction objects from strings, integers, or pairs of integers. If you create fraction objects from floating-point values, you may see unpleasant artifacts of float approximations. When the denominator is a power of 2, things can work out exactly:

      >>> from fractions import Fraction
      >>> sugar_cups = Fraction('2.5')
      >>> scale_factor = Fraction(5/8)
      >>> sugar_cups * scale_factor
      Fraction(25, 16)

We created one fraction from a string, 2.5. We created the second fraction from a floating-point calculation, 5/8. Because the denominator is a power of 2, this works out exactly.

The result, 25/16, is a complex-looking fraction. What's a nearby fraction that might be simpler?

    >>> Fraction(24,16)
    Fraction(3, 2)

We can see that we'll use almost a cup and a half to scale the recipe for five people instead of eight.

Floating-point approximations

Python's built-in float type is capable of representing a wide variety of values. The trade-off here is that float often involves an approximation. In some cases—specifically when doing division that involves powers of 2—it can be as exact as a fraction. In all other cases, there may be small discrepancies that reveal the differences between the implementation of float and the mathematical ideal of an irrational number.

To work with float, we often need to round values to make them look sensible. Recognize that all calculations are an approximation:

      >>> (19/155)*(155/19)
      0.9999999999999999

Mathematically, the value should be 1. Because of the approximations used for float, the answer isn't exact. It's not wrong by much, but it's wrong. When we round appropriately, the value is more useful:

      >>> answer= (19/155)*(155/19)
      >>> round(answer, 3)
      1.0

Know the error term. In this case, we know what the exact answer is supposed to be, so we can compare our calculation with the known correct answer. This gives us the general error value that can creep into floating-point numbers:

      >>> 1-answer
      1.1102230246251565e-16

For most floating-point errors, this is the typical value—about 10^-16. Python has clever rules that hide this error some of the time by doing some automatic rounding. For this calculation, however, the error wasn't hidden.

This is a very important consequence.

Don't compare floating-point values for exact equality.

When we see code that uses an exact == test between floating-point numbers, there are going to be problems when the approximations differ by a single bit.

Converting numbers from one type to another

We can use the float() function to create a float value from another value. It looks like this:

>>> float(total_amount)
3.163875
>>> float(sugar_cups * scale_factor)
1.5625

In the first example, we converted a Decimal value to float. In the second example, we converted a Fraction value to float.

As we just saw, we're never happy trying to convert float to Decimal or Fraction:

>>> Fraction(19/155)
Fraction(8832866365939553, 72057594037927936)
>>> Decimal(19/155)
Decimal('0.12258064516129031640279123394066118635237216949462890625')

In the first example, we did a calculation among integers to create a float value that has a known truncation problem. When we created a Fraction from that truncated float value, we got some terrible looking numbers that exposed the details of the truncation.

Similarly, the second example tried to create a Decimal value from a float.

How it works...

For these numeric types, Python offers us a variety of operators: +, -, *, /, //, %, and **. These are for addition, subtraction, multiplication, true division, truncated division, modulus, and raising to a power. We'll look at the two division operators in the Choosing between true division and floor division recipe.

Python is adept at converting numbers between the various types. We can mix int and float values; the integers will be promoted to floating-point to provide the most accurate answer possible. Similarly, we can mix int and Fraction and the results will be Fractions. We can also mix int and Decimal. We cannot casually mix Decimal with float or Fraction; we need to provide explicit conversions.

It's important to note that float values are really approximations. The Python syntax allows us to write numbers as decimal values; that's not how they're processed internally.

We can write a value like this in Python, using ordinary base-10 values:

>>> 8.066e+67
8.066e+67

The actual value used internally will involve a binary approximation of the decimal value we wrote.

The internal value for this example, 8.066e+67, is this:

>>> 6737037547376141/2**53*2**226
8.066e+67

The numerator is a big number, 6737037547376141. The denominator is always 2⁵³. Since the denominator is fixed, the resulting fraction can only have 53 meaningful bits of data. Since more bits aren't available, values might get truncated. This leads to tiny discrepancies between our idealized abstraction and actual numbers. The exponent (2²²⁶) is required to scale the fraction up to the proper range.

Mathematically, 6737037547376141 * 2²²⁶/2⁵³.

We can use math.frexp() to see these internal details of a number:

>>> import math
>>> math.frexp(8.066E+67)
(0.7479614202861186, 226)

The two parts are called the mantissa and the exponent. If we multiply the mantissa by 2⁵³, we always get a whole number, which is the numerator of the binary fraction.

The error we noticed earlier matches this quite nicely: 10^-16 ≈ 2^-53 .

Unlike the built-in float, a Fraction is an exact ratio of two integer values. As we saw in the Working with large and small integers recipe, integers in Python can be very large. We can create ratios which involve integers with a large number of digits. We're not limited by a fixed denominator.

A Decimal value, similarly, is based on a very large integer value, and a scaling factor to determine where the decimal place goes. These numbers can be huge and won't suffer from peculiar representation issues.

Why use floating-point? Two reasons: Not all computable numbers can be represented as fractions. That's why mathematicians introduced (or perhaps discovered) irrational numbers. The built-in float type is as close as we can get to the mathematical abstraction of irrational numbers. A value like √2, for example, can't be represented as a fraction. Also, float values are very fast.

There's more...

The Python math module contains a number of specialized functions for working with floating-point values. This module includes common functions such as square root, logarithms, and various trigonometry functions. It has some other functions such as gamma, factorial, and the Gaussian error function.

The math module includes several functions that can help us do more accurate floating-point calculations. For example, the math.fsum() function will compute a floating-point sum more carefully than the built-in sum() function. It's less susceptible to approximation issues.

We can also make use of the math.isclose() function to compare two floating-point values to see if they're nearly equal:

>>> (19/155)*(155/19) == 1.0
False
>>> math.isclose((19/155)*(155/19), 1)
True

This function provides us with a way to compare floating-point numbers meaningfully.

Python also offers complex data. This involves a real and an imaginary part. In Python, we write 3.14+2.78j to represent the complex number 3.14 + 2.78 √-1. Python will comfortably convert between float and complex. We have the usual group of operators available for complex numbers.

To support complex numbers, there's a cmath package. The cmath.sqrt() function, for example, will return a complex value rather than raise an exception when extracting the square root of a negative number. Here's an example:

>>> math.sqrt(-2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: math domain error
>>> cmath.sqrt(-2)
1.4142135623730951j

This is essential when working with complex numbers.

Choosing between true division and floor division

Python offers us two kinds of division operators. What are they, and how do we know which one to use? We'll also look at the Python division rules and how they apply to integer values.

Getting ready

There are several general cases for doing division:

A div-mod pair: We want two parts—the quotient and the remainder. We often use this when converting values from one base to another. When we convert seconds to hours, minutes, and seconds, we'll be doing a div-mod kind of division. We don't want the exact number of hours, we want a truncated number of hours, the remainder will be converted to minutes and seconds.
The true value: This is a typical floating-point value—it will be a good approximation to the quotient. For example, if we're computing an average of several measurements, we usually expect the result to be floating-point, even if the input values are all integers.
A rational fraction value: This is often necessary when working in American units of feet, inches, and cups. For this, we should be using the Fraction class. When we divide Fraction objects, we always get exact answers.

We need to decide which of these cases apply, so we know which division operator to use.

How to do it...

We'll look at the three cases separately. First we'll look at truncated floor division. Then we'll look at true floating-point division. Finally, we'll look at division of fractions.

Doing floor division

When we are doing the div-mod kind of calculations, we might use floor division, //, and modulus, %. Or, we might use the divmod() function.

We'll divide the number of seconds by 3600 to get the value of hours; the modulus, or remainder, can be converted separately to minutes and seconds:

      >>> total_seconds = 7385
      >>> hours = total_seconds//3600
      >>> remaining_seconds = total_seconds % 3600

Again, using remaining values, we'll divide the number of seconds by 60 to get minutes; the remainder is a number of seconds less than 60:

      >>> minutes = remaining_seconds//60
      >>> seconds = remaining_seconds % 60
      >>> hours, minutes, seconds
      (2, 3, 5)

Here's the alternative, using the divmod() function:

Compute quotient and remainder at the same time:

      >>> total_seconds = 7385
      >>> hours, remaining_seconds = divmod(total_seconds, 3600)

Compute quotient and remainder again:

      >>> minutes, seconds = divmod(remaining_seconds, 60)
      >>> hours, minutes, seconds
      (2, 3, 5)

Doing true division

A true value calculation gives as a floating-point approximation. For example, about how many hours is 7386 seconds? Divide using the true division operator:

>>> total_seconds = 7385
>>> hours = total_seconds / 3600
>>> round(hours,4)
2.0514

We provided two integer values, but got a floating-point exact result. Consistent with our previous recipe for using floating-point values, we rounded the result to avoid having to look at tiny error values.

This true division is a feature of Python 3. We'll look at this from a Python 2 perspective in the next sections.

Rational fraction calculations

We can do division using Fraction objects and integers. This forces the result to be a mathematically exact rational number:

Create at least one Fraction value:

      >>> from fractions import Fraction
      >>> total_seconds = Fraction(7385)

Use the Fraction value in a calculation. Any integer will be promoted to a Fraction:

      >>> hours = total_seconds / 3600
      >>> hours
      Fraction(1477, 720)

If necessary, convert the exact fraction to a floating-point approximation:

      >>> round(float(hours),4)
      2.0514

First, we created a Fraction object for the total number of seconds. When we do arithmetic on fractions, Python will promote any integers to be fractions; this promotion means that the math is done as exactly as possible.

How it works...

Python 3 has two division operators.

The / true division operator always tries to produce a true, floating-point result. It does this even when the two operands are integers. This is an unusual operator in this respect. All other operators try to preserve the type of the data. The true division operation - when applied to integers - produces a float result.
The // truncated division operator always tries to produce a truncated result. For two integer operands, this is the truncated quotient. For two floating-point operands, this is a truncated floating-point result:

      >>> 7358.0 // 3600.0
      2.0

By default, Python 2 only has one division operator. For programmers still using Python 2, we can start using these new division operators with this:

>>> from __future__ import division

This import will install the Python 3 division rules.

Rewriting an immutable string

How can we rewrite an immutable string? We can't change individual characters inside a string:

    >>> title = "Recipe 5: Rewriting, and the Immutable String"
    >>> title[8]= ''
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'str' object does not support item assignment

Since this doesn't work, how do we make a change to a string?

Getting ready

Let's assume we have a string like this:

>>> title = "Recipe 5: Rewriting, and the Immutable String"

We'd like to do two transformations:

Remove the part before the :
Replace the punctuation with _, and make all the characters lowercase

Since we can't replace characters in a string object, we have to work out some alternatives. There are several common things we can do, shown as follows:

A combination of slicing and concatenating a string to create a new string.
When shortening, we often use the partition() method.
We can replace a character or a substring with the replace() method.
We can expand the string into a list of characters, then join the string back into a single string again. This is the subject for a separate recipe, Building complex strings with a list of characters.

How to do it...

Since we can't update a string in place, we have to replace the string variable's object with each modified result. We'll use a statement that looks like this:

    some_string = some_string.method()

Or we could even use:

    some_string = some_string[:chop_here]

We'll look at a number of specific variations on this general theme. We'll slice a piece of a string, we'll replace individual characters within a string, and we'll apply blanket transformations such as making the string lowercase. We'll also look at ways to remove extra _ that show up in our final string.

Slicing a piece of a string

Here's how we can shorten a string via slicing:

Find the boundary:

      >>> colon_position = title.index(':')

The index function locates a particular substring and returns the position where that substring can be found. If the substring doesn't exist, it raises an exception. This is always true of the result title[colon_position] == ':'.

Pick the substring:

      >>> discard_text, post_colon_text = title[:colon_position], title[colon_position+1:]
      >>> discard_text
      'Recipe 5'
      >>> post_colon_text
      ' Rewriting, and the Immutable String'

We've used the slicing notation to show the start:end of the characters to pick. We also used multiple assignment to assign two variables, discard_text and post_colon_text, from two expressions.

We can use partition() as well as manual slicing. Find the boundary and partition:

>>> pre_colon_text, _, post_colon_text = title.partition(':')
>>> pre_colon_text
'Recipe 5'
>>> post_colon_text
' Rewriting, and the Immutable String'

The partition function returns three things: the part before the target, the target, and the part after the target. We used multiple assignment to assign each object to a different variable. We assigned the target to a variable named _ because we're going to ignore that part of the result. This is a common idiom for places where we must provide a variable, but we don't care about using the object.

Updating a string with a replacement

We can use replace() to remove punctuation marks. When using replace to switch punctuation marks, save the results back into the original variable. In this case, post_colon_text:

>>> post_colon_text = post_colon_text.replace(' ', '_')
>>> post_colon_text = post_colon_text.replace(',', '_')
>>> post_colon_text
'_Rewriting__and_the_Immutable_String'

This has replaced the two kinds of punctuation with the desired _ characters. We can generalize this to work with all punctuation. This leverages the for statement, which we'll look at in Chapter 2, Statements and Syntax.

We can iterate through all punctuation characters:

>>> from string import whitespace, punctuation
>>> for character in whitespace + punctuation:
...     post_colon_text = post_colon_text.replace(character, '_')
>>> post_colon_text
'_Rewriting__and_the_Immutable_String'

As each kind of punctuation character is replaced, we assign the latest and greatest version of the string to the post_colon_text variable.

Making a string all lowercase

Another transformational step is changing a string to all lowercase. As with the previous examples, we'll assign the results back to the original variable. Use the lower() method, assigning the result to the original variable:

>>> post_colon_text = post_colon_text.lower()

Removing extra punctuation marks

In many cases, there are some additional steps we might follow. We often want to remove leading and trailing _ characters. We can use strip() for this:

>>> post_colon_text = post_colon_text.strip('_')

In some cases, we'll have multiple _ characters because we had multiple punctuation marks. The final step would be something like this to cleanup up multiple _ characters:

>>> while '__' in post_colon_text:
...    post_colon_text = post_colon_text.replace('__', '_')

This is yet another example of the same pattern we've been using to modify a string in place. This depends on the while statement, which we'll look at in Chapter 2, Statements and Syntax.

How it works...

We can't—technically—modify a string in place. The data structure for a string is immutable. However, we can assign a new string back to the original variable. This technique behaves the same as modifying a string in place.

When a variable's value is replaced, the previous value no longer has any references and is garbage collected. We can see this by using the id() function to track each individual string object:

>>> id(post_colon_text) 
4346207968
>>> post_colon_text = post_colon_text.replace('_','-')
>>> id(post_colon_text) 
4346205488

Your actual id numbers may be different. What's important is that the original string object assigned to post_colon_text had one id. The new string object assigned to post_colon_text has a different id. It's a new string object.

When the old string has no more references, it is removed from memory automatically.

We made use of slice notation to decompose a string. A slice has two parts: [start:end]. A slice always includes the starting index. String indices always start with zero as the first item. It never includes the ending index.

The items in a slice have an index from start to end-1. This is sometimes called a half-open interval.

Think of a slice like this: all characters where the index, i, are in the range start ≤ i < end.

We noted briefly that we can omit the start or end indices. We can actually omit both. Here are the various options available:

title[colon_position]: A single item, the : we found using title.index(':').
title[:colon_position]: A slice with the start omitted. It begins at the first position, index of zero.
title[colon_position+1:]: A slice with the end omitted. It ends at the end of the string, as if we said len(title).
title[:]: Since both start and end are omitted, this is the entire string. Actually, it's a copy of the entire string. This is the quick and easy way to duplicate a string.

There's more...

There are more features to indexing in Python collections like a string. The normal indices start with 0 at the left end. We have an alternate set of indices using negative names that work from the right end of a string.

title[-1] is the last character in the title, g
title[-2] is the next-to-last character, n
title[-6:] is the last six characters, String

We have a lot of ways to pick pieces and parts out of a string.

Python offers dozens of methods for modifying a string. Section 4.7 of the Python Standard Library describes the different kinds of transformations that are available to us. There are three broad categories of string methods. We can ask about a string, we can parse a string, and we can transform a string. Methods such as isnumeric() tell us if a string is all digits.

Here's an example:

>>> 'some word'.isnumeric()
False
>>> '1298'.isnumeric()
True

We've looked at parsing with the partition() method. And we've looked at transforming with the lower() method.

String parsing with regular expressions

How do we decompose a complex string? What if we have complex, tricky punctuation? Or—worse yet—what if we don't have punctuation, but have to rely on patterns of digits to locate meaningful information?

Getting ready

The easiest way to decompose a complex string is by generalizing the string into a pattern and then writing a regular expression that describes that pattern.

There are limits to the patterns that regular expressions can describe. When we're confronted with deeply-nested documents in a language like HTML, XML, or JSON, we often run into problems, and can't use regular expressions.

The re module contains all of the various classes and functions we need to create and use regular expressions.

Let's say that we want to decompose text from a recipe website. Each line looks like this:

>>> ingredient = "Kumquat: 2 cups"

We want to separate the ingredient from the measurements.

How to do it...

To write and use regular expressions, we often do this:

Generalize the example. In our case, we have something that we can generalize as:

      (ingredient words): (amount digits) (unit words)

We've replaced literal text with a two-part summary: what it means and how it's represented. For example, ingredient is represented as words, amount is represented as digits. Import the re module:

      >>> import re

Rewrite the pattern into Regular Expression (RE) notation:

      >>> pattern_text = r'(?P<ingredient>\w+):\s+(?P<amount>\d+)\s+(?P<unit>\w+)'

We've replaced representation hints such as words with \w+. We've replaced digits with \d+. And we've replaced single spaces with \s+ to allow one or more spaces to be used as punctuation. We've left the colon in place, because in the regular expression notation, a colon matches itself.

For each of the fields of data, we've used ?P<name> to provide a name that identifies the data we want to extract. We didn't do this around the colon or the spaces because we don't want those characters.

REs use a lot of \ characters. To make this work out nicely in Python, we almost always use raw strings. The r' prefix tells Python not to look at the \ characters and not to replace them with special characters that aren't on our keyboards.

Compile the pattern:

      >>> pattern = re.compile(pattern_text)

Match the pattern against input text. If the input matches the pattern, we'll get a match object that shows details of the matching:

      >>> match = pattern.match(ingredient)
      >>> match is None
      False
      >>> match.groups()
      ('Kumquat', '2', 'cups')

This, by itself, is pretty cool: we have a tuple of the different fields within the string. We'll return to the use of tuples in a recipe named Using tuples.

Extract the named groups of characters from the match object:

      >>> match.group('ingredient')
      'Kumquat'
      >>> match.group('amount')
      '2'
      >>> match.group('unit')
      'cups'

Each group is identified by the name we used in the (?P<name>...) part of the RE.

How it works...

There are a lot of different kinds of string patterns that we can describe with RE.

We've shown a number of character classes:

\w matches any alphanumeric character (a to z, A to Z, 0 to 9)
\d matches any decimal digit
\s matches any space or tab character

These classes also have inverses:

\W matches any character that's not a letter or a digit
\D matches any character that's not a digit
\S matches any character that's not some kind of space or tab

Many characters match themselves. Some characters, however, have special meaning, and we have to use \ to escape from that special meaning:

We saw that + as a suffix means to match one or more of the preceeding patterns. \d+ matches one or more digits. To match an ordinary +, we need to use \+.
We also have * as a suffix which matches zero or more of the preceding patterns. \w* matches zero or more characters. To match a *, we need to use \*.
We have ? as a suffix which matches zero or one of the preceding expressions. This character is used in other places, and has a slightly different meaning. We saw it in (?P<name>...) where it was inside the () to define special properties for the grouping.
The . matches any single character. To match a . specifically, we need to use \.

We can create our own unique sets of characters using [] to enclose the elements of the set. We might have something like this:

    (?P<name>\w+)\s*[=:]\s*(?P<value>.*)

This has a \w+ to match any number of alphanumeric characters. This will be collected into a group with the name of name.

It uses \s* to match an optional sequence of spaces.

It matches any character in the set [=:]. One of the two characters in this set must be present.

It uses \s* again to match an optional sequence of spaces.

Finally, it uses .* to match everything else in the string. This is collected into a group named value.

We can use this to parse strings like this:

    size = 12 
    weight: 14

By being flexible with the punctuation, we can make a program easier to use. We'll tolerate any number of spaces, and either an = or a : as a separator.

There's more...

A long regular expression can be awkward to read. We have a clever Pythonic trick for presenting an expression in a way that's much easier to read:

>>> ingredient_pattern = re.compile(
... r'(?P<ingredient>\w+):\s+' # name of the ingredient up to the ":"
... r'(?P<amount>\d+)\s+'      # amount, all digits up to a space
... r'(?P<unit>\w+)'           # units, alphanumeric characters
... )

This leverages three syntax rules:

A statement isn't finished until the () characters match
Adjacent string literals are silently concatenated into a single long string
Anything between # and the end of the line is a comment, and is ignored

We've put Python comments after the important clauses in our regular expression. This can help us understand what we did, and perhaps help us diagnose problems later.

Building complex strings with "template".format()

Creating complex strings is, in many ways, the polar opposite of parsing a complex string. We generally find that we'll use a template with substitution rules to put data into a more complex format.

Getting ready

Let's say we have pieces of data that we need to turn into a nicely formatted message. We might have data including the following:

>>> id = "IAD"
>>> location = "Dulles Intl Airport"
>>> max_temp = 32
>>> min_temp = 13
>>> precipitation = 0.4

And we'd like a line that looks like this:

IAD : Dulles Intl Airport : 32 / 13 / 0.40

How to do it...

Create a template string from the result, replacing all of the data items with {} placeholders. Inside each placeholder, put the name of the data item.

      '{id} : {location} : {max_temp} / {min_temp} / {precipitation}'

For each data item, append :data type information to the placeholders in the template string. The basic data type codes are:
- s for string
- d for decimal number
- f for floating-point number

It would look like this:

       '{id:s}  : {location:s} : {max_temp:d} / {min_temp:d} / {precipitation:f}'

Add length information where required. Length is not always required, and in some cases, it's not even desirable. In this example, though, the length information assures that each message has a consistent format. For strings and decimal numbers, prefix the format with the length like this: 19s or 3d. For floating-point numbers use a two part prefix like this: 5.2f to specify the total length of five characters with two to the right of the decimal point. Here's the whole format:

      '{id:3d}  : {location:19s} : {max_temp:3d} / {min_temp:3d} / {precipitation:5.2f}'

Use the format() method of this string to create the final string:

      >>> '{id:3s}  : {location:19s} :  {max_temp:3d} / {min_temp:3d} / {precipitation:5.2f}'.format(
      ... id=id, location=location, max_temp=max_temp,
      ... min_temp=min_temp, precipitation=precipitation
      ... )
      'IAD  : Dulles Intl Airport :   32 /  13 /  0.40'

We've provided all of the variables by name in the format() method of the template string. This can get tedious. In some cases, we might want to build a dictionary object with the variables. In that case, we can use the format_map() method:

>>> data = dict(
... id=id, location=location, max_temp=max_temp,
... min_temp=min_temp, precipitation=precipitation
... )
>>> '{id:3s}  : {location:19s} :  {max_temp:3d} / {min_temp:3d} / {precipitation:5.2f}'.format_map(data)
'IAD  : Dulles Intl Airport :   32 /  13 /  0.40'

We'll return to dictionaries in Chapter 4, Build-in Data Structures – list, set, dict.

The built-in vars() function builds a dictionary of all of the local variables for us:

>>> '{id:3s}  : {location:19s} :  {max_temp:3d} / {min_temp:3d} / {precipitation:5.2f}'.format_map(
...    vars()
... )
'IAD  : Dulles Intl Airport :   32 /  13 /  0.40'

The vars() function is very handy for building a dictionary automatically.

How it works...

The string format() and format_map() methods can do a lot of relatively sophisticated string assembly for us.

The basic feature is to interpolate data into a string based on names of keyword arguments or keys in a dictionary. Variables can also be interpolated by position—we can provide position numbers instead of names. We can use a format specification like {0:3s} to use the first positional argument to format().

We've seen three of the formatting conversions—s, d, f—there are many others. Details are in Section 6.1.3 of the Python Standard Library. Here are some of the format conversions we might use:

b is for binary, base 2.
c is for Unicode character. The value must be a number, which is converted to a character. Often, we use hexadecimal numbers for this so you might want to try values such as 0x2661 through 0x2666 for fun.
d is for decimal numbers.
E and e are for scientific notations. 6.626E-34 or 6.626e-34 depending on which E or e character is used.
F and f are for floating-point. For not a number the f format shows lowercase nan; the F format shows uppercase NAN.
G and g are for general. This switches automatically between E and F (or e and f,) to keep the output in the given sized field. For a format of 20.5G, up to 20-digit numbers will be displayed using F formatting. Larger numbers will use E formatting.
n is for locale-specific decimal numbers. This will insert , or . characters depending on the current locale settings. The default locale may not have a thousand separators defined. For more information, see the locale module.
o is for octal, base 8.
s is for string.
X and x is for hexadecimal, base 16. The digits include uppercase A-F and lowercase a-f, depending on which X or x format character is used.
% is for percentage. The number is multiplied by 100 and includes the %.

We have a number of prefixes we can use for these different types. The most common one is the length. We might use {name:5d} to put in a 5-digit number. There are several prefixes for the preceding types:

Fill and alignment: We can specify a specific filler character (space is the default) and an alignment. Numbers are generally aligned to the right and strings to the left. We can change that using <, >, or ^. This forces left alignment, right alignment, or centering. There's a peculiar = alignment that's used to put padding after a leading sign.
Sign: The default rule is a leading negative sign where needed. We can use + to put a sign on all numbers, - to put a sign only on negative numbers, and a space to use a space instead of a plus for positive numbers. In scientific output, we must use {value: 5.3f}. The space makes sure that room is left for the sign, assuring that all the decimal points line up nicely.
Alternate form: We can use the # to get an alternate form. We might have something like {0:#x}, {0:#o}, {0:#b} to get a prefix on hexadecimal, octal, or binary values. With a prefix, the numbers will look like 0xnnn, 0onnn, or 0bnnn. The default is to omit the two character prefix.
Leading zero: We can include 0 to get leading zeros to fill in the front of a number. Something like {code:08x) will produce a hexadecimal value with leading zeroes to pad it out to eight characters.
Width and precision: For integer values and strings, we only provide the width. For floating-point values we often provide width.precision.

There are some times when we won't use a {name:format} specification. Sometimes we'll need to use a {name!conversion} specification. There are only three conversions available.

{name!r} shows the representation that would be produced by repr(name)
{name!s} shows the string value that would be produced by str(name)
{name!a} shows the ASCII value that would be produced by ascii(name)

In Chapter 6, Basics of Classes and Objects, we'll leverage the idea of the {name!r} format specification to simplify displaying information about related objects.

There's more...

A handy debugging hack this:

print("some_variable={some_variable!r}".format_map(vars()))

The vars() function—with no arguments—collects all of the local variables into a mapping. We provide that mapping for format_map(). The format template can use lots of {variable_name!r} to display details about various objects we have in local variables.

Inside a class definition we can use techniques such as vars(self). This looks forward to Chapter 6, Basics of Classes and Objects:

>>> class Summary:
...     def __init__(self, id, location, min_temp, max_temp, precipitation):
...         self.id= id
...         self.location= location
...         self.min_temp= min_temp
...         self.max_temp= max_temp
...         self.precipitation= precipitation
...     def __str__(self):
...         return '{id:3s}  : {location:19s} :  {max_temp:3d} / {min_temp:3d} / {precipitation:5.2f}'.format_map(
...             vars(self)
...         )
>>> s= Summary('IAD', 'Dulles Intl Airport', 13, 32, 0.4)
>>> print(s)
IAD  : Dulles Intl Airport :   32 /  13 /  0.40

Our class definition includes a __str__() method. This method relies on vars(self) to create a useful dictionary of just the attribute of the object.

Building complex strings from lists of characters

How can we make very complex changes to an immutable string? Can we assemble a string from individual characters?

In most cases, the recipes we've already seen give us a number of tools for creating and modifying strings. There are yet more ways in which we can tackle the string manipulation problem. We'll look at using a list object. This will dovetail with some of the recipes in Chapter 4, Built-in Data Structures – list, set, dict.

Getting ready

Here's a string that we'd like to rearrange:

>>> title = "Recipe 5: Rewriting an Immutable String"

We'd like to do two transformations:

Remove the part before the :
Replace the punctuation with _, and make all the characters lowercase

We'll make use of the string module:

>>> from string import whitespace, punctuation

This has two important constants:

string.whitespace lists all of the common whitespace characters, including space and tab
string.punctuation lists the common ASCII punctuation marks. Unicode has a larger list of punctuation marks; that's also available based on your locale settings

How to do it...

We can work with a string exploded into a list. We'll look at lists in more depth in Chapter 4, Built-in Data Structures – list, set, dict.

Explode the string into a list object:

      >>> title_list = list(title)

Find the partition character. The index() method for a list has the same semantics as the index() method for a list. It locates the position with the given value:

      >>> colon_position = title_list.index(':')

Delete the characters no longer needed. The del statement can remove items from a list. Lists are a mutable data structures:

      >>> del title_list[:colon_position+1]

We don't need to carefully work with the useful piece of the original string. We can remove items from a list.

Replace punctuation by stepping through each position. In this case, we'll use a for statement to visit every index in the string:

      >>> for position in range(len(title_list)):
      ...    if title_list[position] in whitespace+punctuation:
      ...        title_list[position]= '_'

The expression range(len(title_list)) generates all of the values between 0 and len(title_list)-1. This assures us that the value of position will be each value index in the list. Join the list of characters to create a new string. It seems a little odd to use zero-length string, '', as a separator when concatenating strings together. However, it works perfectly:

      >>> title = ''.join(title_list)
      >>> title
      '_Rewriting_an_Immutable_String'

We assigned the resulting string back to the original variable. The original string object, which had been referred to by that variable, is no longer needed: it's removed from memory. The new string object replaces the value of the variable.

How it works...

This is a change in representation trick. Since a string is immutable, we can't update it. We can, however, convert it into a mutable form; in this case, a list. We can do whatever changes are required to the mutable list object. When we're done, we can change the representation from a list back to a string.

Strings provide a number of features that lists don't. Conversely, strings provide a number of features a list doesn't have. We can't convert a list to lowercase the way we can convert a string.

There's an important trade-off here:

Strings are immutable, that makes them very fast. Strings are focused on Unicode characters. When we look at mappings and sets, we can use strings as keys for mappings and items in sets because the value is immutable.
Lists are mutable. Operations are slower. Lists can hold any kind of item. We can't use a list as a key for a mapping or an item in a set because the value could change.

Strings and lists are both specialized kinds of sequences. Consequently, they have a number of common features. The basic item indexing and slicing features are shared. Similarly a list uses the same kind of negative index values that a string does: list[-1] is the last item in a list object.

We'll return to mutable data structures in Chapter 4, Built-in Data Structures – list, set, dict.

There's more

Once we've started working with a list of characters instead of a string, we no longer have the string processing methods. We do have a number of list-processing techniques available to us. In addition to being able to delete items from a list, we can append an item, extend a list with another list, and insert a character into the list.

We can also change our viewpoint slightly, and look at a list of strings instead of a list of characters. The technique of doing ''.join(list) will work when we have a list of strings as well as a list of characters. For example, we might do this:

>>> title_list.insert(0, 'prefix')
>>> ''.join(title_list)
'prefix_Rewriting_an_Immutable_String'

Our title_list object will be mutated into a list that contains a six-character string, prefix, plus 30 individual characters.

Using the Unicode characters that aren't on our keyboards

A big keyboard might have almost 100 individual keys. Fewer than 50 of these are letters, numbers and punctuation. At least a dozen are function keys that do things other than simply insert letters into a document. Some of the keys are different kinds of modifiers that are meant to be used in conjunction with another key—we might have Shift, Ctrl, Option, and Command.

Most operating systems will accept simple key combinations that create about 100 or so characters. More elaborate key combinations may create another 100 or so less popular characters. This isn't even close to covering the million characters from the world's alphabets. And there are icons, emoticons, and dingbats galore in our computer fonts. How do we get to all of those glyphs?

Getting ready

Python works in Unicode. There are millions of individual Unicode characters available.

We can see all the available characters at https://en.wikipedia.org/wiki/List_of_Unicode_characters and also http://www.unicode.org/charts/.

We'll need the Unicode character number. We might also want the Unicode character name.

A given font on our computer may not be designed to provide glyphs for all of those characters. In particular, Windows computer fonts may have trouble displaying some of these characters. Using the Windows command to change to code page 65001 is sometimes necessary:

chcp 65001

Linux and Mac OS X rarely have problems with Unicode characters.

How to do it...

Python uses escape sequences to extend the ordinary characters we can type to cover the vast space of Unicode characters. The escape sequences start with a \ character. The next character tells exactly how the Unicode character will be represented. Locate the character that's needed. Get the name or the number. The numbers are always given as hexadecimal, base 16. They're often written as U+2680. The name might be DIE FACE-1. Use \unnnn with up to a four-digit number. Or use \N{name} with the spelled-out name. If the number is more than four digits, use \Unnnnnnnn with the number padded out to eight digits:

Yes, we can include a wide variety of characters in Python output. To place a \ character in the string, we need to use \\. For example, we might need this for Windows filenames.

How it works...

Python uses Unicode internally. The 128 or so characters we can type directly using the keyboard all have handy internal Unicode numbers.

When we write:

'HELLO'

Python treats it as shorthand for this:

'\u0048\u0045\u004c\u004c\u004f'

Once we get beyond the characters on our keyboards, the remaining millions of characters are identified only by their number.

When the string is being compiled by Python, the \uxx, \Uxxxxxxxx, and \N{name} are all replaced by the proper Unicode character. If we have something syntactically wrong—for example, \N{name with no closing }—we'll get an immediate error from Python's internal syntax checking.

Back in the String parsing with regular expressions recipe, we noted that regular expressions use a lot of \ characters and we specifically do not want Python's normal compiler to touch them; we used the r' prefix on a regular expression string to prevent the \ from being treated as an escape and possibly converted to something else.

What if we need to use Unicode in a Regular Expression? We'll need to use \\ all over the place in the Regular Expression. We might see this '\\w+[\u2680\u2681\u2682\u2683\u2684\u2685]\\d+'. We skipped the r' prefix on the string. We doubled up the \ used for Regular Expressions. We used \uxxxx for the Unicode characters that are part of the pattern. Python's internal compiler will replace the \uxxxx with Unicode characters and the \\ with a single \ internally.

When we look at a string at the >>> prompt, Python will display the string in its canonical form. Python prefers to use the ' as a delimiter even though we can use either ' or " for a string delimiter. Python doesn't generally display raw strings, instead it puts all of the necessary escape sequences back into the string: >>> r"\w+" '\\w+' We provided a string in raw form. Python displayed it in canonical form.

Encoding strings – creating ASCII and UTF-8 bytes

Our computer files are bytes. When we upload or download from the Internet, the communication works in bytes. A byte only has 256 distinct values. Our Python characters are Unicode. There are a lot more than 256 Unicode characters.

How do we map Unicode characters to bytes for writing to a file or transmitting?

Getting ready

Historically, a character occupied 1 byte. Python leverages the old ASCII encoding scheme for bytes; this sometimes leads to confusion between bytes and proper strings of Unicode characters.

Unicode characters are encoded into sequences of bytes. We have a number of standardized encodings and a number of non-standard encodings.

Plus, we also have some encodings that only work for a small subset of Unicode characters. We try to avoid this, but there are some situations where we'll need to use a subset encoding scheme.

Unless we have a really good reason, we almost always use the UTF-8 encoding for Unicode characters. Its main advantage is that it's a compact representation for the Latin alphabet used for English and a number of European languages.

Sometimes, an Internet protocol requires ASCII characters. This is a special case that requires some care because the ASCII encoding can only handle a small subset of Unicode characters.

How to do it...

Python will generally use our OS's default encoding for files and Internet traffic. The details are unique to each OS:

We can make a general setting using the PYTHONIOENCODING environment variable. We set this outside of Python to assure that a particular encoding is used everywhere. Set the environment variable as:

      export PYTHONIOENCODING=UTF-8

Run Python:

      python3.5

We sometimes need to make specific settings when we open a file inside our script. We'll return this in Chapter 8, Input/Output, Physical Format, Logical Layout. Open the file with a given encoding. Read or write Unicode characters to the file:

      >>> with open('some_file.txt', 'w', encoding='utf-8') as output:
      ...     print( 'You drew \U0001F000', file=output )
      >>> with open('some_file.txt', 'r', encoding='utf-8') as input:
      ...     text = input.read()
      >>> text
      'You drew �'

We can also manually encode characters, in the rare case that we need to open a file in bytes mode; if we use a mode of wb, we'll need to use manual encoding:

>>> string_bytes = 'You drew \U0001F000'.encode('utf-8')
>>> string_bytes
b'You drew \xf0\x9f\x80\x80'

We can see that a sequence of bytes (\xf0\x9f\x80\x80) was used to encode a single Unicode character, U+1F000, .

How it works...

Unicode defines a number of encoding schemes. While UTF-8 is the most popular, there are also UTF-16 and UTF-32. The number is the typical number of bits per character. A file with 1000 characters encoded in UTF-32 would be 4000 8-bit bytes. A file with 1000 characters encoded in UTF-8 could be as few as 1000 bytes, depending on the exact mix of characters. In the UTF-8 encoding, characters with Unicode numbers above U+007F require multiple bytes.

Various OS's have their own coding schemes. Mac OS X files are often encoded in Mac Roman or Latin-1. Windows files might use CP1252 encoding.

The point with all of these schemes is to have a sequence of bytes that can be mapped to a Unicode character. And—going the other way—a way to map each Unicode character to one or more bytes. Ideally, all of the Unicode characters are accounted for. Pragmatically, some of these coding schemes are incomplete. The tricky part is to avoid writing any more bytes than is necessary.

The historical ASCII encoding can only represent about 250 of the Unicode characters as bytes. It's easy to create a string which cannot be encoded using the ASCII scheme.

Here's what the error looks like:

>>> 'You drew \U0001F000'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f000' in position 9: ordinal not in range(128)

We may see this kind of error when we accidentally open a file with a poorly chosen encoding. When we see this, we'll need to change our processing to select a more useful encoding; ideally, UTF-8.

Bytes vs Strings Bytes are often displayed using printable characters. We'll see b'hello' as a short-hand for a five-byte value. The letters are chosen using the old ASCII encoding scheme. Many byte values from about 0x20 to 0xFE will be shown as characters. This can be confusing. The prefix of b' is our hint that we're looking at bytes, not proper Unicode characters.

Decoding bytes – how to get proper characters from some bytes

How can we work with files that aren't properly encoded? What do we do with files written in the ASCII encoding?

A download from the Internet is almost always in bytes—not characters. How do we decode the characters from that stream of bytes?

Also, when we use the subprocess module, the results of an OS command are in bytes. How can we recover proper characters?

Much of this is also relevant to the material in Chapter 8, Input/Output, Physical Format, Logical Layout. We've included the recipe here because it's the inverse of the previous recipe, Encoding strings – creating ASCII and UTF-8 bytes.

Getting ready

Let's say we're interested in offshore marine weather forecasts. Perhaps because we own a large sailboat. Or perhaps because good friends of ours have a large sailboat and are departing the Chesapeake Bay for the Caribbean.

Are there any special warnings coming from the National Weather Services office in Wakefield, Virginia?

Here's where we can get the warnings: http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ.

We can download this with Python's urllib module:

>>> import urllib.request
>>> warnings_uri= 'http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ'
>>> with urllib.request.urlopen(warnings_uri) as source:
...     warnings_text= source.read()

Or, we can use programs like curl or wget to get this. We might do:

curl -O http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ
mv national.php\?prod\=SMW AKQ.html

Since curl left us with an awkward file name, we needed to rename the file.

The forecast_text value is a stream of bytes. It's not a proper string. We can tell because it starts like this:

>>> warnings_text[:80]
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or'

And goes on for a while providing details. Because it starts with b', it's bytes, not proper Unicode characters. It was probably encoded with UTF-8, which means some characters could have weird-looking \xnn escape sequences instead of proper characters. We want to have the proper characters.

Bytes vs Strings Bytes are often displayed using printable characters. We'll see b'hello' as a short-hand for a five-byte value. The letters are chosen using the old ASCII encoding scheme. Many byte values from about 0x20 to 0xFE will be shown as characters. This can be confusing. The prefix of b' is our hint that we're looking at bytes, not proper Unicode characters.

Generally, bytes behave somewhat like strings. Sometimes we can work with bytes directly. Most of the time, we'll want to decode the bytes and create proper Unicode characters.

How to do it..

.Determine the coding scheme if possible. In order to decode bytes to create proper Unicode characters, we need to know what encoding scheme was used. When we read XML documents, there's a big hint provided within the document:

      <?xml version="1.0" encoding="UTF-8"?>

When browsing web pages, there's often a header with this information:

      Content-Type: text/html; charset=ISO-8859-4

Sometimes an HTML page may include this as part of the header:

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

In other cases, we're left to guess. In the case of US Weather data, a good first guess is UTF-8. Other good guesses include ISO-8859-1. In some cases, the guess will depend on the language.

Section 7.2.3, Python Standard Library lists the standard encodings available. Decode the data:

      >>> document = forecast_text.decode("UTF-8")
      >>> document[:80]
      '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or'

The b' prefix is gone. We've created a proper string of Unicode characters from the stream of bytes.

If this step fails with an exception, we guessed wrong about the encoding. We need to try another encoding. Parse the resulting document.

Since this is an HTML document, we should use Beautiful Soup. See http://www.crummy.com/software/BeautifulSoup/.

We can, however, extract one nugget of information from this document without completely parsing the HTML:

>>> import re
>>> title_pattern = re.compile(r"\<h3\>(.*?)\</h3\>")
>>> title_pattern.search( document )
<_sre.SRE_Match object; span=(3438, 3489), match='<h3>There are no products active at this time.</h>

This tells us what we need to know: there are no warnings at this time. That doesn't mean smooth sailing, but it does mean that there aren't any major weather systems that can cause catastrophes.

How it works...

See the Encoding strings – creating ASCII and UTF-8 bytes recipe for more information on Unicode and the different ways that Unicode characters can be encoded into streams of bytes.

At the foundation of the operating system, files and network connections are built up from bytes. It's our software that decodes the bytes to discover the content. It might be characters, or images, or sounds. In some cases, the default assumptions are wrong and we need to do our own decoding.

Using tuples of items

What's the best way to represent simple (x,y) and (r,g,b) groups of values? How can we keep things which are pairs such as latitude and longitude together?

Getting ready

In the String parsing with regular expressions recipe, we skipped over an interesting data structure.

We had data that looked like this:

>>> ingredient = "Kumquat: 2 cups"

We parsed this into the meaningful data using a regular expression like this:

>>> import re
>>> ingredient_pattern = re.compile(r'(?P<ingredient>\w+):\s+(?P<amount>\d+)\s+(?P<unit>\w+)')
>>> match = ingredient_pattern.match( ingredient )
>>> match.groups()
('Kumquat', '2', 'cups')

The result is a tuple object with three pieces of data. There are lots of places where this kind of grouped data come in handy.

How to do it...

We'll look at two aspects to this: putting things into tuples and getting things out of tuples.

Creating tuples

There are lots of places where Python creates tuples of data for us. In the Getting ready section of the String Parsing with Regular Expressions recipe we showed how a regular expression match object will create a tuple of text that was parsed from a string.

We can create our own tuples, too. Here are the steps:

Enclose the data in ().
Separate the items with a ,.

      >>> from fractions import Fraction
      >>> my_data = ('Rice', Fraction(1/4), 'cups')

There's an important special case for the one-tuple, or singleton. We have to include an extra , even when there's only one item in the tuple.

>>> one_tuple = ('item', )
>>> len(one_tuple)
1

The () characters aren't always required. There are a few times where we can omit them. It's not a good idea to omit them, but we can see funny things when we have an extra comma: >>> 355, (355,)

The extra comma after 355 makes the value into a singleton tuple.

Extracting items from a tuple

The idea of a tuple is to be a container with a number of items that's fixed by the problem domain: for example, (red, green, blue) color numbers. The number of items is always three.

In our example, we've got an ingredient, and amount, and units. This must be a three-item collection. We can look at the individual items two ways:

By index position: Positions are numbered starting with zero from the left:

      >>> my_data[1]
      Fraction(1, 4)

Using multiple assignment:

      >>> ingredient, amount, unit = my_data
      >>> ingredient
      'Rice'
      >>> unit
      'cups'

Tuples—like strings—are immutable. We can't change the individual items inside a tuple. We use tuples when we want to keep the data together.

How it works...

Tuples are one example of the more general class of Sequence. We can do a few things with sequences.

Here's an example tuple that we can work with:

>>> t = ('Kumquat', '2', 'cups')

Here are some operations we can perform on this tuple:

How many items in t?

      >>> len(t)
      3

How many times does a particular value appear in t?

      >>> t.count('2')
      1

Which position has a particular value?

      >>> t.index('cups')
      2
      >>> t[2]
      'cups'

When an item doesn't exist, we'll get an exception:

      >>> t.index('Rice')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      ValueError: tuple.index(x): x not in tuple

Does a particular value exist?

      >>> 'Rice' in t
      False

There's more

A tuple, like a string, is a sequence of items. In the case of a string, it's a sequence of characters. In the case of a tuple, it's a sequence of many things. Because they're both sequences, they have some common features. We've noted that we can pluck out individual items by their index position. We can use the index() method to locate the position of an item.

The similarities end there. A string has many methods to create a new string that's a transformation of a string, plus methods to parse strings, plus methods to determine the content of the strings. A tuple doesn't have any of these bonus features. It's—perhaps—the simplest possible data structure.