You're reading from Modern Data Architectures with Python

Product typeBook

Published inSep 2023

Reading LevelExpert

PublisherPackt

ISBN-139781801070492

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Brian Lipp

Integrating Continous Integration into Your Workflow

As we grow our projects, many data projects go from being a scattering of notebooks to a continuous integration (CI)-driven application. In this chapter, we will go through some of the tooling and concepts for stringing together your Python scripts and notebooks into a working data application. We will be using Jenkins for CI, GitHub for source control, workflows for orchestration, and Terraform for Infrastructure as Code (IaC). Those tools can be swapped out for your preferred tool without much effort.

In this chapter, we’re going to cover the following main topics:

Python wheels and creating a Python package
CI with Jenkins
Working with source control using GitHub
Creating Databricks jobs and controlling several jobs using workflows
Creating IaC using Terraform

Technical requirements

The tooling used in this chapter is tied to the technology stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

Databricks
AWS
Terraform Cloud
GitHub
Jenkins
Docker

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Databricks

As we have with many others, this chapter assumes you have a working version of Python 3.6 and the preceding tooling installed in your development environment. We will also assume that you have set up an AWS account, and have set up Databricks with that AWS account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If this command produces the tool version, then everything is working correctly:

Databricks -v

Now let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host created for your Databricks instance and the created token:

databricks configure --token

We can quickly determine whether...

CI tooling

When making the development transition to a more organized project, the tooling we will go through is organized around what is known as the software development life cycle. This is generally understood as the preferred path when writing software. This life cycle isn’t always a good fit – for example, in research-style projects such as data science projects.

We have set up a large number of tools, but let’s now take a look first at Git and GitHub.

Git and GitHub

Source control is a fundamental component in writing software and managing technology. When you write your code, you check it into source control. When you are ready to bring that code into the main branch, you create a pull request. A pull request is a process where other members of the team will review your code, discuss portions of that code, and work together with you. The output of a pull request is confidence in the new feature you are bringing into your project.

Let’s...

Python wheels and packages

A Python wheel file is a ZIP file that holds a Python package, which is another way to say it is ready to install and use. Often, when you use something from pip, it is a wheel. When you build a Python application and store it on a PyPI server, it’s a wheel.

Anatomy of a package

The typical entry point for your Python application is __main__.py.

The __init__.py file is found in every folder you use. It can have special purposes such as holding the version – for example, __version__ = "1.0.0".

An alternative to setup.py is pyproject.toml; both are central places to put project-wide configurations. You can specify requirements, among other things. Here is an example:

PACKAGE_REQUIREMENTS = ["pyyaml"]
LOCAL_REQUIREMENTS = [
    "pyspark==3.2.1",
    "delta-spark==1.1.0",
    "scikit-learn",
    "pandas...

DBX

DBX is a central tool meant for CI workloads when working with Databricks. You can use it to create a project template and deploy and launch your workflows. Since DBX uses Databricks APIs, it is able to use Databricks workflows. A workflow is a grouping of dbt notebooks or jobs meant to flow together.

These are some of the most important files:

.dbx/project.json: Organized by environments; used to manage configuration across your project.
project_folder: Used to store your Python code that isn’t included in notebooks or tests.
conf/deployment.yml: A YAML-based configuration file that allows you to define the details of Databricks workflows. You can define tasks for dbt notebooks and jobs at the moment.
notebooks: Used to hold Databricks notebooks.
tests: Should be used for integration and unit tests, with each in its own subfolder structure.

Important commands

To create your shell project (not required but useful), run the following command...

Testing code

It’s important to test your code, and there are many theoretical beliefs that people have about writing and testing code. Some people feel you need to write tests with or before your code. Some people feel you should have a test for every “group” of code you write. This is typically a decision for the development team to make, but understanding why is often very useful. When we write tests before or with our code, we force ourselves to write testable code, in smaller chunks. Code coverage is an emotional discussion, but I have always found that it’s an easy way to improve the quality of code.

Unit tests

A unit test is a test that doesn’t exceed the bounds of the system running your code and looks to validate an assumption for a group of code. Unit tests are generally for functions, methods, and classes that do not interact with the “outside” world. When you have a function that interacts with the outside world but...

Terraform – IaC

Terraform is a vendor-neutral tool and library for deploying and managing infrastructure. Terraform is written in the Hashicorp language, which is a high-level declarative language. The one key component that makes Terraform work is the state file. The state file is basically a transaction log for your resources. The transaction log is the way Terraform knows what to build, modify, and delete. If you lose your state file, you will have to manually manage the infrastructure that was created.

It is also possible to create small reusable modules of Terraform code and break up your state files into smaller units. In cases where you break your state file up, you can define other state files to use in your configuration file.

IaC

Many of us have been there: walking through GUI interfaces and building out our servers, clusters, and workspaces. How do we validate that we built out the correct infrastructure? Repeating the same step repeatedly can be difficult...

Jenkins

Jenkins is one of the many CI tools to automate deploying your code. Jenkins allows for a declarative language written in a file in the Git repository.

Jenkinsfile

Here is a basic example of a declarative pipeline that would live in a Jenkinsfile:

pipeline {
    agent any
    stages {
        stage('first stage') {
            steps {
                echo 'Step 1'
            }
        }
    }
}

In the preceding code, we first define which agent the work will be done on – for our use case, we will use any agent. In other cases, you might have work organized by teams, production, or other possible choices...

Practical lab

We’ll now use this lab to implement everything we have learned.

Problem 1

Create a repo and use Terraform to create a new cluster.

We now use the gh CLI to create the repository in GitHub:

gh repo create
? What would you like to do? Create a new repository on GitHub from scratch
? Repository name chapter_8_infra
? Description used for infrascture
? Visibility Public
? Would you like to add a README file? Yes
? Would you like to add a .gitignore? Yes
? Choose a .gitignore template Python
? Would you like to add a license? Yes
? Choose a license GNU Affero General Public License v3.0
? This will create "chapter_8_infra" as a public repository on GitHub. Continue? Yes
✓ Created repository bclipp/chapter_8_infra on GitHub
? Clone the new repository locally? Yes

Next, we create an organization, workspace, and project with Terraform Cloud:

Figure 8.11: Creating an organization

Figure...

Summary

We covered a vast number of topics in one chapter, yet we have only touched the surface of CI. CI can be complex, but it doesn’t have to be. Hopefully, you have some working knowledge of some tools and techniques for automating your workloads across your data platform. In the next chapter, we will explore various ways to orchestrate our data workflows.

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architectures with Python

Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages