Reader small image

You're reading from  Modern Data Architectures with Python

Product typeBook
Published inSep 2023
Reading LevelExpert
PublisherPackt
ISBN-139781801070492
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Right arrow

Integrating Continous Integration into Your Workflow

As we grow our projects, many data projects go from being a scattering of notebooks to a continuous integration (CI)-driven application. In this chapter, we will go through some of the tooling and concepts for stringing together your Python scripts and notebooks into a working data application. We will be using Jenkins for CI, GitHub for source control, workflows for orchestration, and Terraform for Infrastructure as Code (IaC). Those tools can be swapped out for your preferred tool without much effort.

In this chapter, we’re going to cover the following main topics:

  • Python wheels and creating a Python package
  • CI with Jenkins
  • Working with source control using GitHub
  • Creating Databricks jobs and controlling several jobs using workflows
  • Creating IaC using Terraform

Technical requirements

The tooling used in this chapter is tied to the technology stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

  • Databricks
  • AWS
  • Terraform Cloud
  • GitHub
  • Jenkins
  • Docker

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Databricks

As we have with many others, this chapter assumes you have a working version of Python 3.6 and the preceding tooling installed in your development environment. We will also assume that you have set up an AWS account, and have set up Databricks with that AWS account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If this command produces the tool version, then everything is working correctly:

Databricks -v

Now let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host created for your Databricks instance and the created token:

databricks configure --token

We can quickly determine whether...

CI tooling

When making the development transition to a more organized project, the tooling we will go through is organized around what is known as the software development life cycle. This is generally understood as the preferred path when writing software. This life cycle isn’t always a good fit – for example, in research-style projects such as data science projects.

We have set up a large number of tools, but let’s now take a look first at Git and GitHub.

Git and GitHub

Source control is a fundamental component in writing software and managing technology. When you write your code, you check it into source control. When you are ready to bring that code into the main branch, you create a pull request. A pull request is a process where other members of the team will review your code, discuss portions of that code, and work together with you. The output of a pull request is confidence in the new feature you are bringing into your project.

Let’s...

Python wheels and packages

A Python wheel file is a ZIP file that holds a Python package, which is another way to say it is ready to install and use. Often, when you use something from pip, it is a wheel. When you build a Python application and store it on a PyPI server, it’s a wheel.

Anatomy of a package

The typical entry point for your Python application is __main__.py.

The __init__.py file is found in every folder you use. It can have special purposes such as holding the version – for example, __version__ = "1.0.0".

An alternative to setup.py is pyproject.toml; both are central places to put project-wide configurations. You can specify requirements, among other things. Here is an example:

PACKAGE_REQUIREMENTS = ["pyyaml"]
LOCAL_REQUIREMENTS = [
    "pyspark==3.2.1",
    "delta-spark==1.1.0",
    "scikit-learn",
    "pandas...

DBX

DBX is a central tool meant for CI workloads when working with Databricks. You can use it to create a project template and deploy and launch your workflows. Since DBX uses Databricks APIs, it is able to use Databricks workflows. A workflow is a grouping of dbt notebooks or jobs meant to flow together.

These are some of the most important files:

  • .dbx/project.json: Organized by environments; used to manage configuration across your project.
  • project_folder: Used to store your Python code that isn’t included in notebooks or tests.
  • conf/deployment.yml: A YAML-based configuration file that allows you to define the details of Databricks workflows. You can define tasks for dbt notebooks and jobs at the moment.
  • notebooks: Used to hold Databricks notebooks.
  • tests: Should be used for integration and unit tests, with each in its own subfolder structure.

Important commands

To create your shell project (not required but useful), run the following command...

Testing code

It’s important to test your code, and there are many theoretical beliefs that people have about writing and testing code. Some people feel you need to write tests with or before your code. Some people feel you should have a test for every “group” of code you write. This is typically a decision for the development team to make, but understanding why is often very useful. When we write tests before or with our code, we force ourselves to write testable code, in smaller chunks. Code coverage is an emotional discussion, but I have always found that it’s an easy way to improve the quality of code.

Unit tests

A unit test is a test that doesn’t exceed the bounds of the system running your code and looks to validate an assumption for a group of code. Unit tests are generally for functions, methods, and classes that do not interact with the “outside” world. When you have a function that interacts with the outside world but...

Terraform – IaC

Terraform is a vendor-neutral tool and library for deploying and managing infrastructure. Terraform is written in the Hashicorp language, which is a high-level declarative language. The one key component that makes Terraform work is the state file. The state file is basically a transaction log for your resources. The transaction log is the way Terraform knows what to build, modify, and delete. If you lose your state file, you will have to manually manage the infrastructure that was created.

It is also possible to create small reusable modules of Terraform code and break up your state files into smaller units. In cases where you break your state file up, you can define other state files to use in your configuration file.

IaC

Many of us have been there: walking through GUI interfaces and building out our servers, clusters, and workspaces. How do we validate that we built out the correct infrastructure? Repeating the same step repeatedly can be difficult...

Jenkins

Jenkins is one of the many CI tools to automate deploying your code. Jenkins allows for a declarative language written in a file in the Git repository.

Jenkinsfile

Here is a basic example of a declarative pipeline that would live in a Jenkinsfile:

pipeline {
    agent any
    stages {
        stage('first stage') {
            steps {
                echo 'Step 1'
            }
        }
    }
}

In the preceding code, we first define which agent the work will be done on – for our use case, we will use any agent. In other cases, you might have work organized by teams, production, or other possible choices...

Practical lab

We’ll now use this lab to implement everything we have learned.

Problem 1

Create a repo and use Terraform to create a new cluster.

We now use the gh CLI to create the repository in GitHub:

gh repo create
? What would you like to do? Create a new repository on GitHub from scratch
? Repository name chapter_8_infra
? Description used for infrascture
? Visibility Public
? Would you like to add a README file? Yes
? Would you like to add a .gitignore? Yes
? Choose a .gitignore template Python
? Would you like to add a license? Yes
? Choose a license GNU Affero General Public License v3.0
? This will create "chapter_8_infra" as a public repository on GitHub. Continue? Yes
✓ Created repository bclipp/chapter_8_infra on GitHub
? Clone the new repository locally? Yes

Next, we create an organization, workspace, and project with Terraform Cloud:

Figure 8.11: Creating an organization

Figure 8.11: Creating an organization

Figure 8.12: Creating a workspace

Figure...

Summary

We covered a vast number of topics in one chapter, yet we have only touched the surface of CI. CI can be complex, but it doesn’t have to be. Hopefully, you have some working knowledge of some tools and techniques for automating your workloads across your data platform. In the next chapter, we will explore various ways to orchestrate our data workflows.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp