Python Object-Oriented Programming - Fourth Edition

By Steven F. Lott , Dusty Phillips
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Objects in Python

About this book

Python Object-Oriented Programming, Fourth Edition dives deep into the various aspects of OOP, Python as an OOP language, common and advanced design patterns, and hands-on data manipulation and testing of more complex OOP systems. These concepts are consolidated by open-ended exercises, as well as a real-world case study at the end of every chapter, newly written for this edition. All example code is now compatible with Python 3.9+ syntax and has been updated with type hints for ease of learning.

Steven and Dusty provide a friendly, comprehensive tour of important OOP concepts, such as inheritance, composition, and polymorphism, and explain how they work together with Python’s classes and data structures to facilitate good design. UML class diagrams are generously used throughout the text for you to understand class relationships. Beyond the book’s focus on OOP, it features an in-depth look at Python’s exception handling and how functional programming intersects with OOP. Not one, but two very powerful automated testing systems, unittest and pytest, are introduced in this book. The final chapter provides a detailed discussion of Python's concurrent programming ecosystem.

By the end of the book, you will have a thorough understanding of how to think about and apply object-oriented principles using Python syntax and be able to confidently create robust and reliable programs.

Publication date:
July 2021
Publisher
Packt
Pages
714
ISBN
9781801077262

 

Objects in Python

We have a design in hand and are ready to turn that design into a working program! Of course, it doesn't usually happen this way. We'll be seeing examples and hints for good software design throughout the book, but our focus is object-oriented programming. So, let's have a look at the Python syntax that allows us to create object-oriented software.

After completing this chapter, we will understand the following:

  • Python's type hints
  • Creating classes and instantiating objects in Python
  • Organizing classes into packages and modules
  • How to suggest that people don't clobber an object's data, invalidating the internal state
  • Working with third-party packages available from the Python Package Index, PyPI

This chapter will also continue our case study, moving into the design of some of the classes.

 

Introducing type hints

Before we can look closely at creating classes, we need to talk a little bit about what a class is and how we're sure we're using it correctly. The central idea here is that everything in Python is an object.

When we write literal values like "Hello, world!" or 42, we're actually creating instances of built-in classes. We can fire up interactive Python and use the built-in type() function on the class that defines the properties of these objects:

>>> type("Hello, world!")
<class 'str'>
>>> type(42)
<class 'int'>

The point of object-oriented programming is to solve a problem via the interactions of objects. When we write 6*7, the multiplication of the two objects is handled by a method of the built-in int class. For more complex behaviors, we'll often need to write unique, new classes.

Here are the first two core rules of how Python objects work:

  • Everything in Python is an object
  • Every object is defined by being an instance of at least one class

These rules have many interesting consequences. A class definition we write, using the class statement, creates a new object of class type. When we create an instance of a class, the class object will be used to create and initialize the instance object.

What's the distinction between class and type? The class statement lets us define new types. Because the class statement is what we use, we'll call them classes throughout the text. See Python objects, types, classes, and instances - a glossary by Eli Bendersky: https://eli.thegreenplace.net/2012/03/30/python-objects-types-classes-and-instances-a-glossary for this useful quote:

"The terms "class" and "type" are an example of two names referring to the same concept."

We'll follow common usage and call the annotations type hints.

There's another important rule:

  • A variable is a reference to an object. Think of a yellow sticky note with a name scrawled on it, slapped on a thing.

This doesn't seem too earth-shattering but it's actually pretty cool. It means the type information – what an object is – is defined by the class(es) associated with the object. This type information is not attached to the variable in any way. This leads to code like the following being valid but very confusing Python:

>>> a_string_variable = "Hello, world!"
>>> type(a_string_variable)
<class 'str'>
>>> a_string_variable = 42
>>> type(a_string_variable)
<class 'int'>

We created an object using a built-in class, str. We assigned a long name, a_string_variable, to the object. Then, we created an object using a different built-in class, int. We assigned this object the same name. (The previous string object has no more references and ceases to exist.)

Here are the two steps, shown side by side, showing how the variable is moved from object to object:

Diagram

Description automatically generated

Figure 2.1: Variable names and objects

The various properties are part of the object, not the variable. When we check the type of a variable with type(), we see the type of the object the variable currently references. The variable doesn't have a type of its own; it's nothing more than a name. Similarly, asking for the id() of a variable shows the ID of the object the variable refers to. So obviously, the name a_string_variable is a bit misleading if we assign the name to an integer object.

Type checking

Let's push the relationship between object and type a step further, and look at some more consequences of these rules. Here's a function definition:

>>> def odd(n):
...     return n % 2 != 0
>>> odd(3)
True
>>> odd(4)
False

This function does a little computation on a parameter variable, n. It computes the remainder after division, the modulo. If we divide an odd number by two, we'll have one left over. If we divide an even number by two, we'll have zero left over. This function returns a true value for all odd numbers.

What happens when we fail to provide a number? Well, let's just try it and see (a common way to learn Python!). Entering code at the interactive prompt, we'll get something like this:

>>> odd("Hello, world!")
Traceback (most recent call last):
  File "<doctestexamples.md[9]>", line 1, in <module>
odd("Hello, world!")
  File "<doctestexamples.md[6]>", line 2, in odd
    return n % 2 != 0
TypeError: not all arguments converted during string formatting

This is an important consequence of Python's super-flexible rules: nothing prevents us from doing something silly that may raise an exception. This is an important tip:

Python doesn't prevent us from attempting to use non-existent methods of objects.

In our example, the % operator provided by the str class doesn't work the same way as the % operator provided by the int class, raising an exception. For strings, the % operator isn't used very often, but it does interpolation: "a=%d" % 113 computes a string 'a=113'; if there's no format specification like %d on the left side, the exception is a TypeError. For integers, it's the remainder in division: 355 % 113 returns an integer, 16.

This flexibility reflects an explicit trade-off favoring ease of use over sophisticated prevention of potential problems. This allows a person to use a variable name with little mental overhead.

Python's internal operators check that operands meet the requirements of the operator. The function definition we wrote, however, does not include any runtime type checking. Nor do we want to add code for runtime type checking. Instead, we use tools to examine code as part of testing. We can provide annotations, called type hints, and use tools to examine our code for consistency among the type hints.

First, we'll look at the annotations. In a few contexts, we can follow a variable name with a colon, :, and a type name. We can do this in the parameters to functions (and methods). We can also do this in assignment statements. Further, we can also add -> syntax to a function (or a class method) definition to explain the expected return type.

Here's how type hints look:

>>> def odd(n: int) -> bool:
...     return n % 2 != 0

We've added two type hints to our odd() little function definition. We've specified that argument values for the n parameter should be integers. We've also specified that the result will be one of the two values of the Boolean type.

While the hints consume some storage, they have no runtime impact. Python politely ignores these hints; this means they're optional. People reading your code, however, will be more than delighted to see them. They are a great way to inform the reader of your intent. You can omit them while you're learning, but you'll love them when you go back to expand something you wrote earlier.

The mypy tool is commonly used to check the hints for consistency. It's not built into Python, and requires a separate download and install. We'll talk about virtual environments and installation of tools later in this chapter, in the Third-party libraries section. For now, you can use python -m pip install mypy or conda install mypy if you're using the conda tool.

Let's say we had a file, bad_hints.py, in a src directory, with these two functions and a few lines to call the main() function:

def odd(n: int) -> bool:
    return n % 2 != 0
def main():
    print(odd("Hello, world!"))
if __name__ == "__main__":
    main()

When we run the mypy command at the OS's terminal prompt:

% mypy –strict src/bad_hints.py

The mypy tool is going to spot a bunch of potential problems, including at least these:

src/bad_hints.py:12: error: Function is missing a return type annotation
src/bad_hints.py:12: note: Use "-> None" if function does not return a value
src/bad_hints.py:13: error: Argument 1 to "odd" has incompatible type "str"; expected "int"

The def main(): statement is on line 12 of our example because our file has a pile of comments not shown above. For your version, the error might be on line 1. Here are the two problems:

  • The main() function doesn't have a return type; mypy suggests including -> None to make the absence of a return value perfectly explicit.
  • More important is line 13: the code will try to evaluate the odd() function using a str value. This doesn't match the type hint for odd() and indicates another possible error.

Most of the examples in this book will have type hints. We think they're always helpful, especially in a pedagogical context, even though they're optional. Because most of Python is generic with respect to type, there are a few cases where Python behavior is difficult to describe via a succinct, expressive hint. We'll steer clear of these edge cases in this book.

Python Enhancement Proposal (PEP) 585 covers some new language features to make type hints a bit simpler. We've used mypy version 0.812 to test all of the examples in this book. Any older version will encounter problems with some of the newer syntax and annotation techniques.

Now that we've talked about how parameters and attributes are described with type hints, let's actually build some classes.

 

Creating Python classes

We don't have to write much Python code to realize that Python is a very clean language. When we want to do something, we can just do it, without having to set up a bunch of prerequisite code. The ubiquitous hello world in Python, as you've likely seen, is only one line.

Similarly, the simplest class in Python 3 looks like this:

class MyFirstClass: 
    pass 

There's our first object-oriented program! The class definition starts with the class keyword. This is followed by a name (of our choice) identifying the class and is terminated with a colon.

The class name must follow standard Python variable naming rules (it must start with a letter or underscore, and can only be comprised of letters, underscores, or numbers). In addition, the Python style guide (search the web for PEP 8) recommends that classes should be named using what PEP 8 calls CapWords notation (start with a capital letter; any subsequent words should also start with a capital).

The class definition line is followed by the class contents, indented. As with other Python constructs, indentation is used to delimit the classes, rather than braces, keywords, or brackets, as many other languages use. Also, in line with the style guide, use four spaces for indentation unless you have a compelling reason not to (such as fitting in with somebody else's code that uses tabs for indents).

Since our first class doesn't actually add any data or behaviors, we simply use the pass keyword on the second line as a placeholder to indicate that no further action needs to be taken.

We might think there isn't much we can do with this most basic class, but it does allow us to instantiate objects of that class. We can load the class into the Python 3 interpreter, so we can interactively play with it. To do this, save the class definition mentioned earlier in a file named first_class.py and then run the python -i first_class.py command. The -i argument tells Python to run the code and then drop to the interactive interpreter. The following interpreter session demonstrates a basic interaction with this class:

>>> a = MyFirstClass()
>>> b = MyFirstClass()
>>> print(a)
<__main__.MyFirstClass object at 0xb7b7faec>
>>> print(b)
<__main__.MyFirstClass object at 0xb7b7fbac>

This code instantiates two objects from the new class, assigning the object variable names a and b. Creating an instance of a class is a matter of typing the class name, followed by a pair of parentheses. It looks much like a function call; calling a class will create a new object. When printed, the two objects tell us which class they are and what memory address they live at. Memory addresses aren't used much in Python code, but here, they demonstrate that there are two distinct objects involved.

We can see they're distinct objects by using the is operator:

>>> a is b
False

This can help reduce confusion when we've created a bunch of objects and assigned different variable names to the objects.

Adding attributes

Now, we have a basic class, but it's fairly useless. It doesn't contain any data, and it doesn't do anything. What do we have to do to assign an attribute to a given object?

In fact, we don't have to do anything special in the class definition to be able to add attributes. We can set arbitrary attributes on an instantiated object using dot notation. Here's an example:

class Point: 
    pass 
p1 = Point() 
p2 = Point() 
p1.x = 5 
p1.y = 4 
p2.x = 3 
p2.y = 6 
print(p1.x, p1.y) 
print(p2.x, p2.y) 

If we run this code, the two print statements at the end tell us the new attribute values on the two objects:

5 4
3 6

This code creates an empty Point class with no data or behaviors. Then, it creates two instances of that class and assigns each of those instances x and y coordinates to identify a point in two dimensions. All we need to do to assign a value to an attribute on an object is use the <object>.<attribute> = <value> syntax. This is sometimes referred to as dot notation. The value can be anything: a Python primitive, a built-in data type, or another object. It can even be a function or another class!

Creating attributes like this is confusing to the mypy tool. There's no easy way to include the hints in the Point class definition. We can include hints on the assignment statements, like this: p1.x: float = 5. In general, there's a much, much better approach to type hints and attributes that we'll examine in the Initializing the object section, later in this chapter. First, though, we'll add behaviors to our class definition.

Making it do something

Now, having objects with attributes is great, but object-oriented programming is really about the interaction between objects. We're interested in invoking actions that cause things to happen to those attributes. We have data; now it's time to add behaviors to our classes.

Let's model a couple of actions on our Point class. We can start with a method called reset, which moves the point to the origin (the origin is the place where x and y are both zero). This is a good introductory action because it doesn't require any parameters:

class Point: 
    def reset(self): 
        self.x = 0 
        self.y = 0 
p = Point() 
p.reset() 
print(p.x, p.y) 

This print statement shows us the two zeros on the attributes:

0 0

In Python, a method is formatted identically to a function. It starts with the def keyword, followed by a space, and the name of the method. This is followed by a set of parentheses containing the parameter list (we'll discuss that self parameter, sometimes called the instance variable, in just a moment), and terminated with a colon. The next line is indented to contain the statements inside the method. These statements can be arbitrary Python code operating on the object itself and any parameters passed in, as the method sees fit.

We've omitted type hints in the reset() method because it's not the most widely used place for hints. We'll look at the best place for hints in the Initializing the object section. We'll look a little more at these instance variables, first, and how the self variable works.

Talking to yourself

The one difference, syntactically, between methods of classes and functions outside classes is that methods have one required argument. This argument is conventionally named self; I've never seen a Python programmer use any other name for this variable (convention is a very powerful thing). There's nothing technically stopping you, however, from calling it this or even Martha, but it's best to acknowledge the social pressure of the Python community codified in PEP 8 and stick with self.

The self argument to a method is a reference to the object that the method is being invoked on. The object is an instance of a class, and this is sometimes called the instance variable.

We can access attributes and methods of that object via this variable. This is exactly what we do inside the reset method when we set the x and y attributes of the self object.

Pay attention to the difference between a class and an object in this discussion. We can think of the method as a function attached to a class. The self parameter refers to a specific instance of the class. When you call the method on two different objects, you are calling the same method twice, but passing two different objects as the self parameter.

Notice that when we call the p.reset() method, we do not explicitly pass the self argument into it. Python automatically takes care of this part for us. It knows we're calling a method on the p object, so it automatically passes that object, p, to the method of the class, Point.

For some, it can help to think of a method as a function that happens to be part of a class. Instead of calling the method on the object, we could invoke the function as defined in the class, explicitly passing our object as the self argument:

>>> p = Point() 
>>> Point.reset(p) 
>>> print(p.x, p.y)

The output is the same as in the previous example because, internally, the exact same process has occurred. This is not really a good programming practice, but it can help to cement your understanding of the self argument.

What happens if we forget to include the self argument in our class definition? Python will bail with an error message, as follows:

>>> class Point:
...     def reset():
...         pass
...
>>> p = Point()
>>> p.reset()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: reset() takes 0 positional arguments but 1 was given

The error message is not as clear as it could be ("Hey, silly, you forgot to define the method with a self parameter" could be more informative). Just remember that when you see an error message that indicates missing arguments, the first thing to check is whether you forgot the self parameter in the method definition.

More arguments

How do we pass multiple arguments to a method? Let's add a new method that allows us to move a point to an arbitrary position, not just to the origin. We can also include a method that accepts another Point object as input and returns the distance between them:

import math
class Point:
    def move(self, x: float, y: float) -> None:
        self.x = x
        self.y = y
    def reset(self) -> None:
        self.move(0, 0)
    def calculate_distance(self, other: "Point") -> float:
        return math.hypot(self.x - other.x, self.y - other.y)

We've defined a class with two attributes, x, and y, and three separate methods, move(), reset(), and calculate_distance().

The move() method accepts two arguments, x and y, and sets the values on the self object. The reset() method calls the move() method, since a reset is just a move to a specific known location.

The calculate_distance() method computes the Euclidean distance between two points. (There are a number of other ways to look at distance. In the Chapter 3, When Objects Are Alike, case study, we'll look at some alternatives.) For now, we hope you understand the math. The definition is , which is the math.hypot() function. In Python we'll use self.x, but mathematicians often prefer to write .

Here's an example of using this class definition. This shows how to call a method with arguments: include the arguments inside the parentheses and use the same dot notation to access the method name within the instance. We just picked some random positions to test the methods. The test code calls each method and prints the results on the console:

>>> point1 = Point()
>>> point2 = Point()
>>> point1.reset()
>>> point2.move(5, 0)
>>> print(point2.calculate_distance(point1))
5.0
>>> assert point2.calculate_distance(point1) == point1.calculate_distance(
...    point2
... )
>>> point1.move(3, 4)
>>> print(point1.calculate_distance(point2))
4.47213595499958
>>> print(point1.calculate_distance(point1))
0.0

The assert statement is a marvelous test tool; the program will bail if the expression after assert evaluates to False (or zero, empty, or None). In this case, we use it to ensure that the distance is the same regardless of which point called the other point's calculate_distance() method. We'll see a lot more use of assert in Chapter 13, Testing Object-Oriented Programs, where we'll write more rigorous tests.

Initializing the object

If we don't explicitly set the x and y positions on our Point object, either using move or by accessing them directly, we'll have a broken Point object with no real position. What will happen when we try to access it?

Well, let's just try it and see. Try it and see is an extremely useful tool for Python study. Open up your interactive interpreter and type away. (Using the interactive prompt is, after all, one of the tools we used to write this book.)

The following interactive session shows what happens if we try to access a missing attribute. If you saved the previous example as a file or are using the examples distributed with the book, you can load it into the Python interpreter with the python -i more_arguments.py command:

>>> point = Point()
>>> point.x = 5
>>> print(point.x)
5
>>> print(point.y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Point' object has no attribute 'y'

Well, at least it threw a useful exception. We'll cover exceptions in detail in Chapter 4Expecting the Unexpected. You've probably seen them before (especially the ubiquitous SyntaxError, which means you typed something incorrectly!). At this point, simply be aware that it means something went wrong.

The output is useful for debugging. In the interactive interpreter, it tells us the error occurred at line 1, which is only partially true (in an interactive session, only one statement is executed at a time). If we were running a script in a file, it would tell us the exact line number, making it easy to find the offending code. In addition, it tells us that the error is an AttributeError, and gives a helpful message telling us what that error means.

We can catch and recover from this error, but in this case, it feels like we should have specified some sort of default value. Perhaps every new object should be reset() by default, or maybe it would be nice if we could force the user to tell us what those positions should be when they create the object.

Interestingly, mypy can't determine whether y is supposed to be an attribute of a Point object. Attributes are – by definition – dynamic, so there's no simple list that's part of a class definition. However, Python has some widely followed conventions that can help name the expected set of attributes.

Most object-oriented programming languages have the concept of a constructor, a special method that creates and initializes the object when it is created. Python is a little different; it has a constructor and an initializer. The constructor method, __new__(), is rarely used unless you're doing something very exotic. So, we'll start our discussion with the much more common initialization method, __init__().

The Python initialization method is the same as any other method, except it has a special name, __init__. The leading and trailing double underscores mean this is a special method that the Python interpreter will treat as a special case.

Never name a method of your own with leading and trailing double underscores. It may mean nothing to Python today, but there's always the possibility that the designers of Python will add a function that has a special purpose with that name in the future. When they do, your code will break.

Let's add an initialization function on our Point class that requires the user to supply x and y coordinates when the Point object is instantiated:

class Point:
    def __init__(self, x: float, y: float) -> None:
        self.move(x, y)
    def move(self, x: float, y: float) -> None:
        self.x = x
        self.y = y
    def reset(self) -> None:
        self.move(0, 0)
    def calculate_distance(self, other: "Point") -> float:
        return math.hypot(self.x - other.x, self.y - other.y)

Constructing a Point instance now looks like this:

point = Point(3, 5) 
print(point.x, point.y) 

Now, our Point object can never go without both x and y coordinates! If we try to construct a Point instance without including the proper initialization parameters, it will fail with a not enough arguments error similar to the one we received earlier when we forgot the self argument in a method definition.

Most of the time, we put our initialization statements in an __init__() function. It's very important to be sure that all of the attributes are initialized in the __init__() method. Doing this helps the mypy tool by providing all of the attributes in one obvious place. It helps people reading your code, also; it saves them from having to read the whole application to find mysterious attributes set outside the class definition.

While they're optional, it's generally helpful to include type annotations on the method parameters and result values. After each parameter name, we've included the expected type of each value. At the end of the definition, we've included the two-character -> operator and the type returned by the method.

Type hints and defaults

As we've noted a few times now, hints are optional. They don't do anything at runtime. There are tools, however, that can examine the hints to check for consistency. The mypy tool is widely used to check type hints.

If we don't want to make the two arguments required, we can use the same syntax Python functions use to provide default arguments. The keyword argument syntax appends an equals sign after each variable name. If the calling object does not provide this argument, then the default argument is used instead. The variables will still be available to the function, but they will have the values specified in the argument list. Here's an example:

class Point:
    def __init__(self, x: float = 0, y: float = 0) -> None:
        self.move(x, y)

The definitions for the individual parameters can get long, leading to very long lines of code. In some examples, you'll see this single logical line of code expanded to multiple physical lines. This relies on the way Python combines physical lines to match ()'s. We might write this when the line gets long:

class Point:
    def __init__(
        self, 
        x: float = 0, 
        y: float = 0
    ) -> None:
        self.move(x, y)

This style isn't used very often, but it's valid and keeps the lines shorter and easier to read.

The type hints and defaults are handy, but there's even more we can do to provide a class that's easy to use and easy to extend when new requirements arise. We'll add documentation in the form of docstrings.

Explaining yourself with docstrings

Python can be an extremely easy-to-read programming language; some might say it is self-documenting. However, when carrying out object-oriented programming, it is important to write API documentation that clearly summarizes what each object and method does. Keeping documentation up to date is difficult; the best way to do it is to write it right into our code.

Python supports this through the use of docstrings. Each class, function, or method header can have a standard Python string as the first indented line inside the definition (the line that ends in a colon).

Docstrings are Python strings enclosed within apostrophes (') or quotation marks ("). Often, docstrings are quite long and span multiple lines (the style guide suggests that the line length should not exceed 80 characters), which can be formatted as multi-line strings, enclosed in matching triple apostrophe (''') or triple quote (""") characters.

A docstring should clearly and concisely summarize the purpose of the class or method it is describing. It should explain any parameters whose usage is not immediately obvious, and is also a good place to include short examples of how to use the API. Any caveats or problems an unsuspecting user of the API should be aware of should also be noted.

One of the best things to include in a docstring is a concrete example. Tools like doctest can locate and confirm these examples are correct. All the examples in this book are checked with the doctest tool.

To illustrate the use of docstrings, we will end this section with our completely documented Point class:

class Point:
    """
    Represents a point in two-dimensional geometric coordinates
    >>> p_0 = Point()
    >>> p_1 = Point(3, 4)
    >>> p_0.calculate_distance(p_1)
    5.0
    """
    def __init__(self, x: float = 0, y: float = 0) -> None:
        """
        Initialize the position of a new point. The x and y
        coordinates can be specified. If they are not, the
        point defaults to the origin.
        :param x: float x-coordinate
        :param y: float x-coordinate
        """
        self.move(x, y)
    def move(self, x: float, y: float) -> None:
        """
        Move the point to a new location in 2D space.
        :param x: float x-coordinate
        :param y: float x-coordinate
        """
        self.x = x
        self.y = y
    def reset(self) -> None:
        """
        Reset the point back to the geometric origin: 0, 0
        """
        self.move(0, 0)
    def calculate_distance(self, other: "Point") -> float:
        """
        Calculate the Euclidean distance from this point 
        to a second point passed as a parameter.
        :param other: Point instance
        :return: float distance
        """
        return math.hypot(self.x - other.x, self.y - other.y)

Try typing or loading (remember, it's python -i point.py) this file into the interactive interpreter. Then, enter help(Point)<enter> at the Python prompt.

You should see nicely formatted documentation for the class, as shown in the following output:

Help on class Point in module point_2:
class Point(builtins.object)
 |  Point(x: float = 0, y: float = 0) -> None
 |  
 |  Represents a point in two-dimensional geometric coordinates
 |  
 |  >>> p_0 = Point()
 |  >>> p_1 = Point(3, 4)
 |  >>> p_0.calculate_distance(p_1)
 |  5.0
 |  
 |  Methods defined here:
 |  
 |  __init__(self, x: float = 0, y: float = 0) -> None
 |      Initialize the position of a new point. The x and y
 |      coordinates can be specified. If they are not, the
 |      point defaults to the origin.
 |      
 |      :param x: float x-coordinate
 |      :param y: float x-coordinate
 |  
 |  calculate_distance(self, other: 'Point') -> float
 |      Calculate the Euclidean distance from this point
 |      to a second point passed as a parameter.
 |      
 |      :param other: Point instance
 |      :return: float distance
 |  
 |  move(self, x: float, y: float) -> None
 |      Move the point to a new location in 2D space.
 |      
 |      :param x: float x-coordinate
 |      :param y: float x-coordinate
 |  
 |  reset(self) -> None
 |      Reset the point back to the geometric origin: 0, 0
 |  
 |  ----------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Not only is our documentation every bit as polished as the documentation for built-in functions, but we can run python -m doctest point_2.py to confirm the example shown in the docstring.

Further, we can run mypy to check the type hints, also. Use mypy –-strict src/*.py to check all of the files in the src folder. If there are no problems, the mypy application doesn't produce any output. (Remember, mypy is not part of the standard installation, so you'll need to add it. Check the preface for information on extra packages that need to be installed.)

 

Modules and packages

Now we know how to create classes and instantiate objects. You don't need to write too many classes (or non-object-oriented code, for that matter) before you start to lose track of them. For small programs, we generally put all our classes into one file and add a little script at the end of the file to start them interacting. However, as our projects grow, it can become difficult to find the one class that needs to be edited among the many classes we've defined. This is where modules come in. Modules are Python files, nothing more. The single file in our small program is a module. Two Python files are two modules. If we have two files in the same folder, we can load a class from one module for use in the other module.

The Python module name is the file's stem; the name without the .py suffix. A file named model.py is a module named model. Module files are found by searching a path that includes the local directory and the installed packages.

The import statement is used for importing modules or specific classes or functions from modules. We've already seen an example of this in our Point class in the previous section. We used the import statement to get Python's built-in math module and use its hypot() function in the distance calculation. Let's start with a fresh example.

If we are building an e-commerce system, we will likely be storing a lot of data in a database. We can put all the classes and functions related to database access into a separate file (we'll call it something sensible: database.py). Then, our other modules (for example, customer models, product information, and inventory) can import classes from the database module in order to access the database.

Let's start with a module called database. It's a file, database.py, containing a class called Database. A second module called products is responsible for product-related queries. The classes in the products module need to instantiate the Database class from the database module so that they can execute queries on the product table in the database.

There are several variations on the import statement syntax that can be used to access the Database class. One variant is to import the module as a whole:

>>> import database
>>> db = database.Database("path/to/data")

This version imports the database module, creating a database namespace. Any class or function in the database module can be accessed using the database.<something> notation.

Alternatively, we can import just the one class we need using the from...import syntax:

>>> from database import Database
>>> db = Database("path/to/data")

This version imported only the Database class from the database module. When we have a few items from a few modules, this can be a helpful simplification to avoid using longer, fully qualified names like database.Database. When we import a number of items from a number of different modules, this can be a potential source of confusion when we omit the qualifiers.

If, for some reason, products already has a class called Database, and we don't want the two names to be confused, we can rename the class when used inside the products module:

>>> from database import Database as DB
>>> db = DB("path/to/data")

We can also import multiple items in one statement. If our database module also contains a Query class, we can import both classes using the following code:

from database import Database, Query

We can import all classes and functions from the database module using this syntax:

from database import * 

Don't do this. Most experienced Python programmers will tell you that you should never use this syntax (a few will tell you there are some very specific situations where it is useful, but we can disagree). One way to learn why to avoid this syntax is to use it and try to understand your code two years later. We can save some time and two years of poorly written code with a quick explanation now!

We've got several reasons for avoiding this:

  • When we explicitly import the database class at the top of our file using from database import Database, we can easily see where the Database class comes from. We might use db = Database() 400 lines later in the file, and we can quickly look at the imports to see where that Database class came from. Then, if we need clarification as to how to use the Database class, we can visit the original file (or import the module in the interactive interpreter and use the help(database.Database) command). However, if we use the from database import * syntax, it takes a lot longer to find where that class is located. Code maintenance becomes a nightmare.
  • If there are conflicting names, we're doomed. Let's say we have two modules, both of which provide a class named Database. Using from module_1 import * and from module_2 import * means the second import statement overwrites the Database name created by the first import. If we used import module_1 and import module_2, we'd use the module names as qualifiers to disambiguate module_1.Database from module_2.Database.
  • In addition, most code editors are able to provide extra functionality, such as reliable code completion, the ability to jump to the definition of a class, or inline documentation, if normal imports are used. The import * syntax can hamper their ability to do this reliably.
  • Finally, using the import * syntax can bring unexpected objects into our local namespace. Sure, it will import all the classes and functions defined in the module being imported from, but unless a special __all__ list is provided in the module, this import will also import any classes or modules that were themselves imported into that file!

Every name used in a module should come from a well-specified place, whether it is defined in that module, or explicitly imported from another module. There should be no magic variables that seem to come out of thin air. We should always be able to immediately identify where the names in our current namespace originated. We promise that if you use this evil syntax, you will one day have extremely frustrating moments of where on earth can this class be coming from?

For fun, try typing import this into your interactive interpreter. It prints a nice poem (with a couple of inside jokes) summarizing some of the idioms that Pythonistas tend to practice. Specific to this discussion, note the line "Explicit is better than implicit." Explicitly importing names into your namespace makes your code much easier to navigate than the implicit from module import * syntax.

Organizing modules

As a project grows into a collection of more and more modules, we may find that we want to add another level of abstraction, some kind of nested hierarchy on our modules' levels. However, we can't put modules inside modules; one file can hold only one file after all, and modules are just files.

Files, however, can go in folders, and so can modules. A package is a collection of modules in a folder. The name of the package is the name of the folder. We need to tell Python that a folder is a package to distinguish it from other folders in the directory. To do this, place a (normally empty) file in the folder named __init__.py. If we forget this file, we won't be able to import modules from that folder.

Let's put our modules inside an ecommerce package in our working folder, which will also contain a main.py file to start the program. Let's additionally add another package inside the ecommerce package for various payment options.

We need to exercise some caution in creating deeply nested packages. The general advice in the Python community is "flat is better than nested." In this example, we need to create a nested package because there are some common features to all of the various payment alternatives.

The folder hierarchy will look like this, rooted under a directory in the project folder, commonly named src:

src/
 +-- main.py
 +-- ecommerce/
     +-- __init__.py
     +-- database.py
     +-- products.py
     +-- payments/
     |   +-- __init__.py
     |   +-- common.py
     |   +-- square.py
     |   +-- stripe.py
     +-- contact/
         +-- __init__.py
         +-- email.py

The src directory will be part of an overall project directory. In addition to src, the project will often have directories with names like docs and tests. It's common for the project parent directory to also have configuration files for tools like mypy among others. We'll return to this in Chapter 13, Testing Object-Oriented Programs.

When importing modules or classes between packages, we have to be cautious about the structure of our packages. In Python 3, there are two ways of importing modules: absolute imports and relative imports. We'll look at each of them separately.

Absolute imports

Absolute imports specify the complete path to the module, function, or class we want to import. If we need access to the Product class inside the products module, we could use any of these syntaxes to perform an absolute import:

>>> import ecommerce.products
>>> product = ecommerce.products.Product("name1") 

Or, we could specifically import a single class definition from the module within a package:

>>> from ecommerce.products import Product 
>>> product = Product("name2") 

Or, we could import an entire module from the containing package:

>>> from ecommerce import products 
>>> product = products.Product("name3") 

The import statements use the period operator to separate packages or modules. A package is a namespace that contains module names, much in the way an object is a namespace containing attribute names.

These statements will work from any module. We could instantiate a Product class using this syntax in main.py, in the database module, or in either of the two payment modules. Indeed, assuming the packages are available to Python, it will be able to import them. For example, the packages can also be installed in the Python site-packages folder, or the PYTHONPATH environment variable could be set to tell Python which folders to search for packages and modules it is going to import.

With these choices, which syntax do we choose? It depends on your audience and the application at hand. If there are dozens of classes and functions inside the products module that we want to use, we'd generally import the module name using the from ecommerce import products syntax, and then access the individual classes using products.Product. If we only need one or two classes from the products module, we can import them directly using the from ecommerce.products import Product syntax. It's important to write whatever makes the code easiest for others to read and extend.

Relative imports

When working with related modules inside a deeply nested package, it seems kind of redundant to specify the full path; we know what our parent module is named. This is where relative imports come in. Relative imports identify a class, function, or module as it is positioned relative to the current module. They only make sense inside module files, and, further, they only make sense where there's a complex package structure.

For example, if we are working in the products module and we want to import the Database class from the database module next to it, we could use a relative import:

from .database import Database 

The period in front of database says use the database module inside the current package. In this case, the current package is the package containing the products.py file we are currently editing, that is, the ecommerce package.

If we were editing the stripe module inside the ecommerce.payments package, we would want, for example, to use the database package inside the parent package instead. This is easily done with two periods, as shown here:

from ..database import Database 

We can use more periods to go further up the hierarchy, but at some point, we have to acknowledge that we have too many packages. Of course, we can also go down one side and back up the other. The following would be a valid import from the ecommerce.contact package containing an email module if we wanted to import the send_mail function into our payments.stripe module:

from ..contact.email import send_mail

This import uses two periods indicating the parent of the payments.stripe package, and then uses the normal package.module syntax to go back down into the contact package to name the email module.

Relative imports aren't as useful as they might seem. As mentioned earlier, the Zen of Python (you can read it when you run import this) suggests "flat is better than nested". Python's standard library is relatively flat, with few packages and even fewer nested packages. If you're familiar with Java, the packages are deeply nested, something the Python community likes to avoid. A relative import is needed to solve a specific problem where module names are reused among packages. They can be helpful in a few cases. Needing more than two dots to locate a common parent-of-a-parent package suggests the design should be flattened out.

Packages as a whole

We can import code that appears to come directly from a package, as opposed to a module inside a package. As we'll see, there's a module involved, but it has a special name, so it's hidden. In this example, we have an ecommerce package containing two module files named database.py and products.py. The database module contains a db variable that is accessed from a lot of places. Wouldn't it be convenient if this could be imported as from ecommerce import db instead of from ecommerce.database import db?

Remember the __init__.py file that defines a directory as a package? This file can contain any variable or class declarations we like, and they will be available as part of the package. In our example, if the ecommerce/__init__.py file contained the following line:

from .database import db

We could then access the db attribute from main.py or any other file using the following import:

from ecommerce import db

It might help to think of the ecommerce/__init__.py file as if it were the ecommerce.py file. It lets us view the ecommerce package as having a module protocol as well as a package protocol. This can also be useful if you put all your code in a single module and later decide to break it up into a package of modules. The __init__.py file for the new package can still be the main point of contact for other modules using it, but the code can be internally organized into several different modules or subpackages.

We recommend not putting much code in an __init__.py file, though. Programmers do not expect actual logic to happen in this file, and much like with from x import *, it can trip them up if they are looking for the declaration of a particular piece of code and can't find it until they check __init__.py.

After looking at modules in general, let's dive into what should be inside a module. The rules are flexible (unlike other languages). If you're familiar with Java, you'll see that Python gives you some freedom to bundle things in a way that's meaningful and informative.

Organizing our code in modules

The Python module is an important focus. Every application or web service has at least one module. Even a seemingly "simple" Python script is a module. Inside any one module, we can specify variables, classes, or functions. They can be a handy way to store the global state without namespace conflicts. For example, we have been importing the Database class into various modules and then instantiating it, but it might make more sense to have only one database object globally available from the database module. The database module might look like this:

class Database:
    """The Database Implementation"""
    def __init__(self, connection: Optional[str] = None) -> None:
        """Create a connection to a database."""
        pass
database = Database("path/to/data")

Then we can use any of the import methods we've discussed to access the database object, for example:

from ecommerce.database import database 

A problem with the preceding class is that the database object is created immediately when the module is first imported, which is usually when the program starts up. This isn't always ideal, since connecting to a database can take a while, slowing down startup, or the database connection information may not yet be available because we need to read a configuration file. We could delay creating the database until it is actually needed by calling an initialize_database() function to create a module-level variable:

db: Optional[Database] = None
def initialize_database(connection: Optional[str] = None) -> None:
    global db
    db = Database(connection)

The Optional[Database] type hint signals to the mypy tool that this may be None or it may have an instance of the Database class. The Optional hint is defined in the typing module. This hint can be handy elsewhere in our application to make sure we confirm that the value for the database variable is not None.

The global keyword tells Python that the database variable inside initialize_database() is the module-level variable, outside the function. If we had not specified the variable as global, Python would have created a new local variable that would be discarded when the function exits, leaving the module-level value unchanged.

We need to make one additional change. We need to import the database module as a whole. We can't import the db object from inside the module; it might not have been initialized. We need to be sure database.initialize_database() is called before db will have a meaningful value. If we wanted direct access to the database object, we'd use database.db.

A common alternative is a function that returns the current database object. We could import this function everywhere we needed access to the database:

def get_database(connection: Optional[str] = None) -> Database:
    global db
    if not db:
        db = Database(connection) 
    return db

As these examples illustrate, all module-level code is executed immediately at the time it is imported. The class and def statements create code objects to be executed later when the function is called. This can be a tricky thing for scripts that perform execution, such as the main script in our e-commerce example. Sometimes, we write a program that does something useful, and then later find that we want to import a function or class from that module into a different program. However, as soon as we import it, any code at the module level is immediately executed. If we are not careful, we can end up running the first program when we really only meant to access a couple of functions inside that module.

To solve this, we should always put our startup code in a function (conventionally, called main()) and only execute that function when we know we are running the module as a script, but not when our code is being imported from a different script. We can do this by guarding the call to main inside a conditional statement, demonstrated as follows:

class Point:
    """
    Represents a point in two-dimensional geometric coordinates.
    """
    pass
def main() -> None:
    """
    Does the useful work.
    >>> main()
    p1.calculate_distance(p2)=5.0
    """
    p1 = Point()
    p2 = Point(3, 4)
    print(f"{p1.calculate_distance(p2)=}")
if __name__ == "__main__":
    main()

The Point class (and the main() function) can be reused without worry. We can import the contents of this module without any surprising processing happening. When we run it as a main program, however, it executes the main() function.

This works because every module has a __name__ special variable (remember, Python uses double underscores for special variables, such as a class' __init__ method) that specifies the name of the module when it was imported. When the module is executed directly with python module.py, it is never imported, so the __name__ is arbitrarily set to the "__main__" string.

Make it a policy to wrap all your scripts in an if __name__ == "__main__": test, just in case you write a function that you may want to be imported by other code at some point in the future.

So, methods go in classes, which go in modules, which go in packages. Is that all there is to it?

Actually, no. This is the typical order of things in a Python program, but it's not the only possible layout. Classes can be defined anywhere. They are typically defined at the module level, but they can also be defined inside a function or method, like this:

from typing import Optional
class Formatter:
    def format(self, string: str) -> str:
        pass
def format_string(string: str, formatter: Optional[Formatter] = None) -> str:
    """
    Format a string using the formatter object, which
    is expected to have a format() method that accepts
    a string.
    """
    class DefaultFormatter(Formatter):
        """Format a string in title case."""
        def format(self, string: str) -> str:
            return str(string).title()
    if not formatter:
        formatter = DefaultFormatter()
    return formatter.format(string)

We've defined a Formatter class as an abstraction to explain what a formatter class needs to have. We haven't used the abstract base class (abc) definitions (we'll look at these in detail in Chapter 6, Abstract Base Classes and Operator Overloading). Instead, we've provided the method with no useful body. It has a full suite of type hints, to make sure mypy has a formal definition of our intent.

Within the format_string() function, we created an internal class that is an extension of the Formatter class. This formalizes the expectation that our class inside the function has a specific set of methods. This connection between the definition of the Formatter class, the formatter parameter, and the concrete definition of the DefaultFormatter class assures that we haven't accidentally forgotten something or added something.

We can execute this function like this:

>>> hello_string = "hello world, how are you today?"
>>> print(f" input: {hello_string}")
 input: hello world, how are you today?
>>> print(f"output: {format_string(hello_string)}")
output: Hello World, How Are You Today?

The format_string function accepts a string and optional Formatter object and then applies the formatter to that string. If no Formatter instance is supplied, it creates a formatter of its own as a local class and instantiates it. Since it is created inside the scope of the function, this class cannot be accessed from anywhere outside of that function. Similarly, functions can be defined inside other functions as well; in general, any Python statement can be executed at any time.

These inner classes and functions are occasionally useful for one-off items that don't require or deserve their own scope at the module level, or only make sense inside a single method. However, it is not common to see Python code that frequently uses this technique.

We've seen how to create classes and how to create modules. With these core techniques, we can start thinking about writing useful, helpful software to solve problems. When the application or service gets big, though, we often have boundary issues. We need to be sure that objects respect each other's privacy and avoid confusing entanglements that make complex software into a spaghetti bowl of interrelationships. We'd prefer each class to be a nicely encapsulated ravioli. Let's look at another aspect of organizing our software to create a good design.

 

Who can access my data?

Most object-oriented programming languages have a concept of access control. This is related to abstraction. Some attributes and methods on an object are marked private, meaning only that object can access them. Others are marked protected, meaning only that class and any subclasses have access. The rest are public, meaning any other object is allowed to access them.

Python doesn't do this. Python doesn't really believe in enforcing laws that might someday get in your way. Instead, it provides unenforced guidelines and best practices. Technically, all methods and attributes on a class are publicly available. If we want to suggest that a method should not be used publicly, we can put a note in docstrings indicating that the method is meant for internal use only (preferably, with an explanation of how the public-facing API works!).

We often remind each other of this by saying "We're all adults here." There's no need to declare a variable as private when we can all see the source code.

By convention, we generally prefix an internal attribute or method with an underscore character, _. Python programmers will understand a leading underscore name to mean this is an internal variable, think three times before accessing it directly. But there is nothing inside the interpreter to stop them from accessing it if they think it is in their best interest to do so. Because, if they think so, why should we stop them? We may not have any idea what future uses our classes might be put to, and it may be removed in a future release. It's a pretty clear warning sign to avoid using it.

There's another thing you can do to strongly suggest that outside objects don't access a property or method: prefix it with a double underscore, __. This will perform name mangling on the attribute in question. In essence, name mangling means that the method can still be called by outside objects if they really want to do so, but it requires extra work and is a strong indicator that you demand that your attribute remains private.

When we use a double underscore, the property is prefixed with _<classname>. When methods in the class internally access the variable, they are automatically unmangled. When external classes wish to access it, they have to do the name mangling themselves. So, name mangling does not guarantee privacy; it only strongly recommends it. This is very rarely used, and often a source of confusion when it is used.

Don't create new double-underscore names in your own code, it will only cause grief and heartache. Consider this reserved for Python's internally defined special names.

What's important is that encapsulation – as a design principle – assures that the methods of a class encapsulate the state changes for the attributes. Whether or not attributes (or methods) are private doesn't change the essential good design that flows from encapsulation.

The encapsulation principle applies to individual classes as well as a module with a bunch of classes. It also applies to a package with a bunch of modules. As designers of object-oriented Python, we're isolating responsibilities and clearly encapsulating features.

And, of course, we're using Python to solve problems. It turns out there's a huge standard library available to help us create useful software. The vast standard library is why we describe Python as a "batteries included" language. Right out of the box, you have almost everything you need, no running to the store to buy batteries.

Outside the standard library, there's an even larger universe of third-party packages. In the next section, we'll look at how we extend our Python installation with even more ready-made goodness.

 

Third-party libraries

Python ships with a lovely standard library, which is a collection of packages and modules that are available on every machine that runs Python. However, you'll soon find that it doesn't contain everything you need. When this happens, you have two options:

  • Write a supporting package yourself
  • Use somebody else's code

We won't be covering the details about turning your packages into libraries, but if you have a problem you need to solve and you don't feel like coding it (the best programmers are extremely lazy and prefer to reuse existing, proven code, rather than write their own), you can probably find the library you want on the Python Package Index (PyPI) at http://pypi.python.org/. Once you've identified a package that you want to install, you can use a tool called pip to install it.

You can install packages using an operating system command such as the following:

% python -m pip install mypy

If you try this without making any preparation, you'll either be installing the third-party library directly into your system Python directory, or, more likely, will get an error that you don't have permission to update the system Python.

The common consensus in the Python community is that you don't touch any Python that's part of the OS. Older Mac OS X releases had a Python 2.7 installed. This was not really available for end users. It's best to think of it as part of the OS; and ignore it and always install a fresh, new Python.

Python ships with a tool called venv, a utility that gives you a Python installation called a virtual environment in your working directory. When you activate this environment, commands related to Python will work with your virtual environment's Python instead of the system Python. So, when you run pip or python, it won't touch the system Python at all. Here's how to use it:

cd project_directory
python -m venv env
source env/bin/activate    # on Linux or macOS
env/Scripts/activate.bat   # on Windows

(For other OSes, see https://docs.python.org/3/library/venv.html, which has all the variations required to activate the environment.)

Once the virtual environment is activated, you are assured that python -m pip will install new packages into the virtual environment, leaving any OS Python alone. You can now use the python -m pip install mypy command to add the mypy tool to your current virtual environment.

On a home computer – where you have access to the privileged files – you can sometimes get away with installing and working with a single, centralized system-wide Python. In an enterprise computing environment, where system-wide directories require special privileges, a virtual environment is required. Because the virtual environment approach always works, and the centralized system-level approach doesn't always work, it's generally a best practice to create and use virtual environments.

It's typical to create a different virtual environment for each Python project. You can store your virtual environments anywhere, but a good practice is to keep them in the same directory as the rest of the project files. When working with version control tools like Git, the .gitignore file can make sure your virtual environments are not checked into the Git repository.

When starting something new, we often create the directory, and then cd into that directory. Then, we'll run the python -m venv env utility to create a virtual environment, usually with a simple name like env, and sometimes with a more complex name like CaseStudy39.

Finally, we can use one of the last two lines in the preceding code (depending on the operating system, as indicated in the comments) to activate the environment.

Each time we do some work on a project, we can cd to the directory and execute the source (or activate.bat) line to use that particular virtual environment. When switching projects, a deactivate command unwinds the environment setup.

Virtual environments are essential for keeping your third-party dependencies separate from Python's standard library. It is common to have different projects that depend on different versions of a particular library (for example, an older website might run on Django 1.8, while newer versions run on Django 2.1). Keeping each project in separate virtual environments makes it easy to work in either version of Django. Furthermore, it prevents conflicts between system-installed packages and pip-installed packages if you try to install the same package using different tools. Finally, it bypasses any OS permission restrictions surrounding the OS Python.

There are several third-party tools for managing virtual environments effectively. Some of these include virtualenv, pyenv, virtualenvwrapper, and conda. If you're working in a data science environment, you'll probably need to use conda so you can install more complex packages. There are a number of features leading to a lot of different approaches to solving the problem of managing the huge Python ecosystem of third-party packages.

 

Case study

This section expands on the object-oriented design of our realistic example. We'll start with the diagrams created using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.

We'll describe the various considerations that are part of the Python implementation of the class definitions. We'll start with a review of the diagrams that describe the classes to be defined.

Logical view

Here's the overview of the classes we need to build. This is (except for one new method) the previous chapter's model:

Diagram

Description automatically generated

Figure 2.2: Logical view diagram

There are three classes that define our core data model, plus some uses of the generic list class. We've shown it using the type hint of List. Here are the four central classes:

  • The TrainingData class is a container with two lists of data samples, a list used for training our model and a list used for testing our model. Both lists are composed of KnownSample instances. Additionally, we'll also have a list of alternative Hyperparameter values. In general, these are tuning values that change the behavior of the model. The idea is to test with different hyperparameters to locate the highest-quality model.

    We've also allocated a little bit of metadata to this class: the name of the data we're working with, the datetime of when we uploaded the data the first time, and the datetime of when we ran a test against the model.

  • Each instance of the Sample class is the core piece of working data. In our example, these are measurements of sepal lengths and widths and petal lengths and widths. Steady-handed botany graduate students carefully measured lots and lots of flowers to gather this data. We hope that they had time to stop and smell the roses while they were working.
  • A KnownSample object is an extended Sample. This part of the design foreshadows the focus of Chapter 3, When Objects Are Alike. A KnownSample is a Sample with one extra attribute, the assigned species. This information comes from skilled botanists who have classified some data we can use for training and testing.
  • The Hyperparameter class has the k used to define how many of the nearest neighbors to consider. It also has a summary of testing with this value of k. The quality tells us how many of the test samples were correctly classified. We expect to see that small values of k (like 1 or 3) don't classify well. We expect middle values of k to do better, and very large values of k to not do as well.

The KnownSample class on the diagram may not need to be a separate class definition. As we work through the details, we'll look at some alternative designs for each of these classes.

We'll start with the Sample (and KnownSample) classes. Python offers three essential paths for defining a new class:

  • class definition; we'll focus on this to start.
  • @dataclass definition. This provides a number of built-in features. While it's handy, it's not ideal for programmers who are new to Python, because it can obscure some implementation details. We'll set this aside for Chapter 7, Python Data Structures.
  • An extension to the typing.NamedTuple class. The most notable feature of this definition will be that the state of the object is immutable; the attribute values cannot be changed. Unchanging attributes can be a useful feature for making sure a bug in the application doesn't mess with the training data. We'll set this aside for Chapter 7, also.

Our first design decision is to use Python's class statement to write a class definition for Sample and its subclass KnownSample. This may be replaced in the future (i.e., Chapter 7) with alternatives that use data classes as well as NamedTuple.

Samples and their states

The diagram in Figure 2.2 shows the Sample class and an extension, the KnownSample class. This doesn't seem to be a complete decomposition of the various kinds of samples. When we review the user stories and the process views, there seems to be a gap: specifically, the "make classification request" by a User requires an unknown sample. This has the same flower measurements attributes as a Sample, but doesn't have the assigned species attribute of a KnownSample. Further, there's no state change that adds an attribute value. The unknown sample will never be formally classified by a Botanist; it will be classified by our algorithm, but it's only an AI, not a Botanist.

We can make a case for two distinct subclasses of Sample:

  • UnknownSample: This class contains the initial four Sample attributes. A User provides these objects to get them classified. 
  • KnownSample: This class has the Sample attributes plus the classification result, a species name. We use these for training and testing the model.

Generally, we consider class definitions as a way to encapsulate state and behavior. An UnknownSample instance provided by a user starts out with no species. Then, after the classifier algorithm computes a species, the Sample changes state to have a species assigned by the algorithm.

A question we must always ask about class definitions is this:

Is there any change in behavior that goes with the change in state?

In this case, it doesn't seem like there's anything new or different that can happen. Perhaps this can be implemented as a single class with some optional attributes.

We have another possible state change concern. Currently, there's no class that owns the responsibility of partitioning Sample objects into the training or testing subsets. This, too, is a kind of state change.

This leads to a second important question: 

What class has responsibility for making this state change?

In this case, it seems like the TrainingData class should own the discrimination between testing and training data.

One way to help look closely at our class design is to enumerate all of the various states of individual samples. This technique helps uncover a need for attributes in the classes. It also helps to identify the methods to make state changes to objects of a class.

Sample state transitions

Let's look at the life cycles of Sample objects. An object's life cycle starts with object creation, then state changes, and (in some cases) the end of its processing life when there are no more references to it. We have three scenarios:

  1. Initial load: We'll need a load() method to populate a TrainingData object from some source of raw data. We'll preview some of the material in Chapter 9, Strings, Serialization, and File Paths, by saying that reading a CSV file often produces a sequence of dictionaries. We can imagine a load() method using a CSV reader to create Sample objects with a species value, making them KnownSample objects. The load() method splits the KnownSample objects into the training and testing lists, which is an important state change for a TrainingData object.
  2. Hyperparameter testing: We'll need a test() method in the Hyperparameter class. The body of the test() method works with the test samples in the associated TrainingData object. For each sample, it applies the classifier and counts the matches between Botanist-assigned species and the best guess of our AI algorithm. This points out the need for a classify() method for a single sample that's used by the test() method for a batch of samples. The test() method will update the state of the Hyperparameter object by setting the quality score.
  3. User-initiated classification: A RESTful web application is often decomposed into separate view functions to handle requests. When handling a request to classify an unknown sample, the view function will have a Hyperparameter object used for classification; this will be chosen by the Botanist to produce the best results. The user input will be an UnknownSample instance. The view function applies the Hyperparameter.classify() method to create a response to the user with the species the iris has been classed as. Does the state change that happens when the AI classifies an UnknownSample really matter? Here are two views:
    • Each UnknownSample can have a classified attribute. Setting this is a change in the state of the Sample. It's not clear that there's any behavior change associated with this state change. 
    • The classification result is not part of the Sample at all. It's a local variable in the view function. This state change in the function is used to respond to the user, but has no life within the Sample object.

There's a key concept underlying this detailed decomposition of these alternatives:

There's no "right" answer.

Some design decisions are based on non-functional and non-technical considerations. These might include the longevity of the application, future use cases, additional users who might be enticed, current schedules and budgets, pedagogical value, technical risk, the creation of intellectual property, and how cool the demo will look in a conference call.

In Chapter 1, Object-Oriented Design, we dropped a hint that this application is the precursor to a consumer product recommender. We noted: "The users eventually want to tackle complex consumer products, but recognize that solving a difficult problem is not a good way to learn how to build this kind of application. It's better to start with something of a manageable level of complexity and then refine and expand it until it does everything they need."

Because of that, we'll consider a change in state from UnknownSample to ClassifiedSample to be very important. The Sample objects will live in a database for additional marketing campaigns or possibly reclassification when new products are available and the training data changes.

We'll decide to keep the classification and the species data in the UnknownSample class.

This analysis suggests we can coalesce all the various Sample details into the following design:

Diagram

Description automatically generated

Figure 2.3: The updated UML diagram

This view uses the open arrowhead to show a number of subclasses of Sample. We won't directly implement these as subclasses. We've included the arrows to show that we have some distinct use cases for these objects. Specifically, the box for KnownSample has a condition species is not None to summarize what's unique about these Sample objects. Similarly, the UnknownSample has a condition, species is None, to clarify our intent around Sample objects with the species attribute value of None.

In these UML diagrams, we have generally avoided showing Python's "special" methods. This helps to minimize visual clutter. In some cases, a special method may be absolutely essential, and worthy of showing in a diagram. An implementation almost always needs to have an __init__() method.

There's another special method that can really help: the  __repr__() method is used to create a representation of the object. This representation is a string that generally has the syntax of a Python expression to rebuild the object. For simple numbers, it's the number. For a simple string, it will include the quotes. For more complex objects, it will have all the necessary Python punctuation, including all the details of the class and state of the object. We'll often use an f-string with the class name and the attribute values.

Here's the start of a class, Sample, which seems to capture all the features of a single sample: 

class Sample:
    def __init__(
        self,
        sepal_length: float,
        sepal_width: float,
        petal_length: float,
        petal_width: float,
        species: Optional[str] = None,
    ) -> None:
        self.sepal_length = sepal_length
        self.sepal_width = sepal_width
        self.petal_length = petal_length
        self.petal_width = petal_width
        self.species = species
        self.classification: Optional[str] = None
    def __repr__(self) -> str:
        if self.species is None:
            known_unknown = "UnknownSample"
        else:
            known_unknown = "KnownSample"
        if self.classification is None:
            classification = ""
        else:
            classification = f", {self.classification}"
        return (
            f"{known_unknown}("
            f"sepal_length={self.sepal_length}, "
            f"sepal_width={self.sepal_width}, "
            f"petal_length={self.petal_length}, "
            f"petal_width={self.petal_width}, "
            f"species={self.species!r}"
            f"{classification}"
            f")"
        )

The __repr__() method reflects the fairly complex internal state of this Sample object. The states implied by the presence (or absence) of a species and the presence (or absence) of a classification lead to small behavior changes. So far, any changes in object behavior are limited to the __repr__() method used to display the current state of the object. 

What's important is that the state changes do lead to a (tiny) behavioral change.

We have two application-specific methods for the Sample class. These are shown in the next code snippet:

    def classify(self, classification: str) -> None:
        self.classification = classification
    def matches(self) -> bool:
        return self.species == self.classification

The classify() method defines the state change from unclassified to classified. The matches() method compares the results of classification with a Botanist-assigned species. This is used for testing.

Here's an example of how these state changes can look:

>>> from model import Sample
>>> s2 = Sample(
...     sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
>>> s2.classification = "wrong"
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa', classification='wrong')

We have a workable definition of the Sample class. The __repr__() method is quite complex, suggesting there may be some improvements possible.

It can help to define responsibilities for each class. This can be a focused summary of the attributes and methods with a little bit of additional rationale to tie them together.

Class responsibilities

Which class is responsible for actually performing a test? Does the Training class invoke the classifier on each KnownSample in a testing set? Or, perhaps, does it provide the testing set to the Hyperparameter class, delegating the testing to the Hyperparameter class? Since the Hyperparameter class has responsibility for the k value, and the algorithm for locating the k-nearest neighbors, it seems sensible for the Hyperparameter class to run the test using its own k value and a list of KnownSample instances provided to it.

It also seems clear the TrainingData class is an acceptable place to record the various Hyperparameter trials. This means the TrainingData class can identify which of the Hyperparameter instances has a value of k that classifies irises with the highest accuracy.

There are multiple, related state changes here. In this case, both the Hyperparameter and TrainingData classes will do part of the work. The system – as a whole – will change state as individual elements change state. This is sometimes described as emergent behavior. Rather than writing a monster class that does many things, we've written smaller classes that collaborate to achieve the expected goals.

This test() method of TrainingData is something that we didn't show in the UML image. We included test() in the Hyperparameter class, but, at the time, it didn't seem necessary to add it to TrainingData.

Here's the start of the class definition:

class Hyperparameter:
    """A hyperparameter value and the overall quality of the classification."""
    def __init__(self, k: int, training: "TrainingData") -> None:
        self.k = k
        self.data: weakref.ReferenceType["TrainingData"] = weakref.ref(training)
        self.quality: float

Note how we write type hints for classes not yet defined. When a class is defined later in the file, any reference to the yet-to-be-defined class is a forward reference. The forward references to the not-yet-defined TrainingData class are provided as strings, not the simple class name. When mypy is analyzing the code, it resolves the strings into proper class names.

The testing is defined by the following method:

    def test(self) -> None:
        """Run the entire test suite."""
        training_data: Optional["TrainingData"] = self.data()
        if not training_data:
            raise RuntimeError("Broken Weak Reference")
        pass_count, fail_count = 0, 0
        for sample in training_data.testing:
            sample.classification = self.classify(sample)
            if sample.matches():
                pass_count += 1
            else:
                fail_count += 1
        self.quality = pass_count / (pass_count + fail_count)

We start by resolving the weak reference to the training data. This will raise an exception if there's a problem. For each testing sample, we classify the sample, setting the sample's classification attribute. The matches method tells us if the model's classification matches the known species. Finally, the overall quality is measured by the fraction of tests that passed. We can use the integer count, or a floating-point ratio of tests passed out of the total number of tests.

We won't look at the classification method in this chapter; we'll save that for Chapter 10, The Iterator Pattern. Instead, we'll finish this model by looking at the TrainingData class, which combines the elements seen so far.

The TrainingData class

The TrainingData class has lists with two subclasses of Sample objects. The KnownSample and UnknownSample can be implemented as extensions to a common parent class, Sample.

We'll look at this from a number of perspectives in Chapter 7. The TrainingData class also has a list with Hyperparameter instances. This class can have simple, direct references to previously defined classes.

This class has the two methods that initiate the processing:

  • The load() method reads raw data and partitions it into training data and test data. Both of these are essentially KnownSample instances with different purposes. The training subset is for evaluating the k-NN algorithm; the testing subset is for determining how well the k hyperparameter is working.
  • The test() method uses a Hyperparameter object, performs the test, and saves the result.

Looking back at Chapter 1's context diagram, we see three stories: Provide Training Data, Set Parameters and Test Classifier, and Make Classification Request. It seems helpful to add a method to perform a classification using a given Hyperparameter instance. This would add a classify() method to the TrainingData class. Again, this was not clearly required at the beginning of our design work, but seems like a good idea now.

Here's the start of the class definition:

class TrainingData:
    """A set of training data and testing data with methods to load and test the samples."""
    def __init__(self, name: str) -> None:
        self.name = name
        self.uploaded: datetime.datetime
        self.tested: datetime.datetime
        self.training: List[Sample] = []
        self.testing: List[Sample] = []
        self.tuning: List[Hyperparameter] = []

We've defined a number of attributes to track the history of the changes to this class. The uploaded time and the tested time, for example, provide some history. The training, testing, and tuning attributes have Sample objects and Hyperparameter objects.

We won't write methods to set all of these. This is Python and direct access to attributes is a huge simplification to complex applications. The responsibilities are encapsulated in this class, but we don't generally write a lot of getter/setter methods.

In Chapter 5, When to Use Object-Oriented Programming, we'll look at some clever techniques, like Python's property definitions, additional ways to handle these attributes.

The load() method is designed to process data given by another object. We could have designed the load() method to open and read a file, but then we'd bind the TrainingData to a specific file format and logical layout. It seems better to isolate the details of the file format from the details of managing training data. In Chapter 5, we'll look closely at reading and validating input. In Chapter 9, Strings, Serialization, and File Paths, we'll revisit the file format considerations.

For now, we'll use the following outline for acquiring the training data:

    def load(
            self, 
            raw_data_source: Iterable[dict[str, str]]
    ) -> None:
        """Load and partition the raw data"""
        for n, row in enumerate(raw_data_source):
            ... filter and extract subsets (See Chapter 6)
            ... Create self.training and self.testing subsets 
        self.uploaded = datetime.datetime.now(tz=datetime.timezone.utc)

We'll depend on a source of data. We've described the properties of this source with a type hint, Iterable[dict[str, str]]. The Iterable states that the method's results can be used by a for statement or the list function. This is true of collections like lists and files. It's also true of generator functions, the subject of Chapter 10, The Iterator Pattern.

The results of this iterator need to be dictionaries that map strings to strings. This is a very general structure, and it allows us to require a dictionary that looks like this:

{
    "sepal_length": 5.1, 
    "sepal_width": 3.5, 
    "petal_length": 1.4, 
    "petal_width": 0.2, 
    "species": "Iris-setosa"
}

This required structure seems flexible enough that we can build some object that will produce it. We'll look at the details in Chapter 9.

The remaining methods delegate most of their work to the Hyperparameter class. Rather than do the work of classification, this class relies on another class to do the work:

def test(
        self, 
        parameter: Hyperparameter) -> None:
    """Test this Hyperparameter value."""
    parameter.test()
    self.tuning.append(parameter)
    self.tested = datetime.datetime.now(tz=datetime.timezone.utc)
def classify(
        self, 
        parameter: Hyperparameter, 
        sample: Sample) -> Sample:
    """Classify this Sample."""
    classification = parameter.classify(sample)
    sample.classify(classification)
    return sample

In both cases, a specific Hyperparameter object is provided as a parameter. For testing, this makes sense because each test should have a distinct value. For classification, however, the "best" Hyperparameter object should be used for classification.

This part of the case study has built class definitions for Sample, KnownSample, TrainingData, and Hyperparameter. These classes capture parts of the overall application. This isn't complete, of course; we've omitted some important algorithms. It's good to start with things that are clear, identify behavior and state change, and define the responsibilities. The next pass of design can then fill in details around this existing framework.

 

Recall

Some key points in this chapter:

  • Python has optional type hints to help describe how data objects are related and what the parameters should be for methods and functions.
  • We create Python classes with the class statement. We should initialize the attributes in the special __init__() method.
  • Modules and packages are used as higher-level groupings of classes.
  • We need to plan out the organization of module content. While the general advice is "flat is better than nested," there are a few cases where it can be helpful to have nested packages.
  • Python has no notion of "private" data. We often say "we're all adults here"; we can see the source code, and private declarations aren't very helpful. This doesn't change our design; it simply removes the need for a handful of keywords.
  • We can install third-party packages using PIP tools. We can create a virtual environment, for example, with venv.
 

Exercises

Write some object-oriented code. The goal is to use the principles and syntax you learned in this chapter to ensure you understand the topics we've covered. If you've been working on a Python project, go back over it and see whether there are some objects you can create and add properties or methods to. If it's large, try dividing it into a few modules or even packages and play with the syntax. While a "simple" script may expand when refactored into classes, there's generally a gain in flexibility and extensibility.

If you don't have such a project, try starting a new one. It doesn't have to be something you intend to finish; just stub out some basic design parts. You don't need to fully implement everything; often, just a print("this method will do something") is all you need to get the overall design in place. This is called top-down design, in which you work out the different interactions and describe how they should work before actually implementing what they do. The converse, bottom-up design, implements details first and then ties them all together. Both patterns are useful at different times, but for understanding object-oriented principles, a top-down workflow is more suitable.

If you're having trouble coming up with ideas, try writing a to-do application. It can keep track of things you want to do each day. Items can have a state change from incomplete to completed. You might want to think about items that have an intermediate state of started, but not yet completed.

Now try designing a bigger project. A collection of classes to model playing cards can be an interesting challenge. Cards have a few features, but there are many variations on the rules. A class for a hand of cards has interesting state changes as cards are added. Locate a game you like and create classes to model cards, hands, and play. (Don't tackle creating a winning strategy; that can be hard.)

A game like Cribbage has an interesting state change where two cards from each player's hand are used to create a kind of third hand, called "the crib." Make sure you experiment with the package and module-importing syntax. Add some functions in various modules and try importing them from other modules and packages. Use relative and absolute imports. See the difference, and try to imagine scenarios where you would want to use each one.

 

Summary

In this chapter, we learned how simple it is to create classes and assign properties and methods in Python. Unlike many languages, Python differentiates between a constructor and an initializer. It has a relaxed attitude toward access control. There are many different levels of scope, including packages, modules, classes, and functions. We understood the difference between relative and absolute imports, and how to manage third-party packages that don't come with Python.

In the next chapter, we'll learn more about sharing implementation using inheritance.

About the Authors

  • Steven F. Lott

    Steven F. Lott has been programming since the 70s, when computers were large, expensive, and rare. As a contract software developer and architect, he has worked on hundreds of projects, from very small to very large. He's been using Python to solve business problems for almost 20 years.

    Browse publications by this author
  • Dusty Phillips

    Dusty Phillips is a Canadian software developer and an author currently living in New Brunswick. He has been active in the open-source community for 2 decades and has been programming in Python for nearly as long. He holds a master's degree in computer science and has worked for Facebook, the United Nations, and several startups.

    Browse publications by this author
Book Title
Unlock this book and the full library for FREE
Start free trial