Reader small image

You're reading from  Learn Python by Building Data Science Applications

Product typeBook
Published inAug 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789535365
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Philipp Kats
Philipp Kats
author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz
David Katz
author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

View More author details
Right arrow

Packaging and Testing with Poetry and PyTest

Until now, all our code has lived in either notebooks or Python files. While that is totally fine, with the growth in volume and complexity of our code, it is increasingly becoming a good idea to form one or more go-to sources for the code we use most frequently, as well as sources for the complex code that we don't want to risk adding mistakes to.

In this chapter, we will learn how to build our own packages for use in multiple projects or to be easily shared with others, using the poetry package. A package can work as a deliverable—something you can pass to or share with your client! Building and testing packages is a vital skill that increases your productivity and allows you to save time and reduce stress by enabling you to reuse the same properly tested body of code again and again.

Building packages also likely to...

Technical requirements

Building a package

So far in this book, we have been either using third-party packages, such as requests and pandas, or writing raw code as .py scripts or notebooks. While using Python files directly is absolutely fine for certain projects, it makes it hard for code to be reused and built upon; it is not sustainable for complex algorithms and tools that can be used over and over again. Such code is also hard to share as it has no overall structure, tends to decay over time, and doesn't have a robust dependency system; the code may not work on other systems with other packages (or other versions of packages) installed. Last but not least, this kind of practice affects the quality of our code, as we tend to write and use the code as a one-time solution. The best way to mitigate all those issues at once is to form your code into a package.

But what is a package? In Python, packages...

A few ways to build your package

The structure of a Python package is defined by a few specifications (https://packaging.python.org/specifications/) and PEPs (short for Python Enhancement Proposals, such as PEP517—https://www.python.org/dev/peps/pep-0517/, PEP518—https://www.python.org/dev/peps/pep-0518/, and PEP427—https://www.python.org/dev/peps/pep-0427/), and the overall definition comes from the Python Packaging Authority (PyPA). In essence, a package is required to have, in addition to the actual code, a special file with metadata, including the package name, the description version, the tags, Python version support details, the authors, and the dependencies. This file could be a Python setup.py file—which was the standard solution for a long time—or a pyproject.toml file. The latter is a new, safer approach, but does not have as well-designed...

Testing the code so far

How would we know whether the code is good, anyway? The only good way is to rigorously test your code. While it may sound like a lot of somewhat unnecessary work, it is a practice that will repay you many times over in the future—once you're sure your code behaves as intended, it is much easier to add new features and be sure that they didn't break any of the existing ones. Furthermore, you can upgrade dependencies or compare different implementations, all being sure that your code behaves as intended.

As for many other things, Python has a standard library for testing—unittest. In contrast to most of the standard libraries, however, unittest is fairly unpopular. Instead, another library, pytest, is considered the de facto industry standard for Python testing, as it provides a clean and reusable pattern of code and has support...

Automating the process with CI services

Now, as you may recall, we are working on a tests branch of our repository. If you go to GitHub, it may offer to create a pull request—a procedure meant to merge your branch into the master branch or any other branch, as in the yellow section of the following screenshot. Even if the interface does not offer this (it won't if there was already a pull request a few minutes before), you can create a pull request yourself, via the New pull request button. See the following screenshot:

Using GitHub, you can request other people to review your changes, comment on them, and more; GitHub will also confirm whether merging is possible or whether you'll need to resolve conflicts first.

While, in our case, we did run our tests locally and we know it is safe to merge, there is no way for others to check that easily. In order to make...

Generating documentation generation with sphinx

Documentation is king when it comes to supporting consumers of your code and convincing newcomers that it actually makes sense to buy in and use your package. For most people, a documentation website is the first place they go to learn about the package. It is, by definition, assumed to be the single source of truth on the code in its current version.

The role of documentation is usually threefold:

  • Explain how to install your package and what the general requirements are (for example, which Python versions are supported)
  • Show how to use the package (preferably with a quick example showing its immediate value)
  • Express the general idea and philosophy of the package

A documentation website does benefit from having tutorials, example cases, and a roadmap. With that being said, the core of any documentation website is, obviously, documentation...

Installing a package in editable mode

As we have mentioned, you can install a package from GitHub and it will behave the same as any other installed package—it can be upgraded or uninstalled.

Often, however, you will want to use a package while developing it. It would be hard to do both in the normal installation routine; you'd have to either update or re-install the package every time you made any developmental changes, just to reflect those changes. To get around this, there is a great feature that keeps the advantages of both worlds—your code is treated as a package but can be easily modified in place. This feature is called editable mode. Essentially, it means the folder on your filesystem is registered as a package, and so the imported package will always reflect all the changes that you've made.

In order to reap these benefits, you have to have a...

Summary

In this chapter, we went over all the processes of packaging code. In particular, we created a GitHub repository, generated a template via poetry, and added all the dependencies, meaning everyone can now install the package from GitHub using pip. We then went further, adding a few tests to make sure our package works as expected throughout future development. To simplify the process and make it transparent, we integrated a CI service, Azure pipelines, to run tests on each pull request in order to prevent us from merging failing code into production.

In the next chapter, we will review another case, building a robust, secure, production-ready data pipeline using luigi.

Questions

  1. What are the benefits of packaging code?
  2. What is the main difference between conda and pip as package managers?
  3. What is dependency resolution, and why is it hard?
  4. What are the benefits of poetry over standard setuptools?
  5. Why do we need tests?
  6. What is the purpose of CI?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz