Reader small image

You're reading from  Modern Data Architectures with Python

Product typeBook
Published inSep 2023
Reading LevelExpert
PublisherPackt
ISBN-139781801070492
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Right arrow

Completing Our Project

So, we are at the end of the project, and we now need to add meat to our work. We have built the scaffold of our project, but we don’t really have anything else at the moment. We still need to create code for all of our apps. When we have done this, we will deploy the code to the public PyPI servers. This will be critical because we are now going to pull our pipeline code from a code repository, which is the ideal scenario. We will also set up CI for our code, which will do the checking and scanning of our code. Given the limited space, we will not be covering deployment using the CI of pipeline code, but this is the next step in that process. We will also cover schema management and some limited data governance. The goal is to have a working example of a data pipeline that is in line with something you would see in production.

This chapter covers the following topics:

  • Documentation
  • Faking data with Mockaroo
  • Managing our schemas with...

Technical requirements

The tooling used in this chapter is tied to the technology stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

  • Databricks
  • GitHub
  • Terraform
  • PyPI

Documentation

When starting out on a project, it’s good to catch up on the basics of what the project is about and how it will be interacted with. Here, we will lay out our schemas and high-level C4 System Context diagrams. For these diagrams, I used PlantUML code, which is another simple language for creating diagrams. PyCharm will display them and check your syntax so it is very easy to work with.

Schema diagram

Schema diagrams are very useful for users who want to get a basic understanding of the data and how they might use it. Normally, in a schema diagram, you will find the field names, the types, and sometimes sample data. This type of diagram works well for structured data with few columns. If your data is semi-structured or has a significant number of columns, I would avoid using this diagram and use something in JSON format instead.

Here we have 3 tables in our Bronze layer: sales, machine_raw, and sap_BSEG.

Figure 12.1: Bronze layer 1

Figure 12.1: Bronze layer...

Faking data with Mockaroo

Faking data is a very important topic for anyone working with data pipelines. You are not always able to use real data. This could be for many reasons, including legal and company policies. In fact, this has often been the case for me. Faking data can also be problematic because does the fake data fully resemble your real data? Probably not. I think for this purpose we are mimicking real data, so it’s perfectly fine.

Mockaroo is a free hosted data service that can also be used to create simple REST APIs. Here, I created our three raw data schemas and then clicked on the button that says CREATE API.

Here, we are creating a schema for our machine API. We are using two Number and one Binomial Distribution column types. When done, you can click CREATE API.

Figure 12.9: Mockaroo machine schema

Figure 12.9: Mockaroo machine schema

Next, we are creating a schema for our sales API. We are using several columns, as outlined in the following figure. When done...

Managing our schemas with code

Our schema app will manage setting up and updating any schema changes we have to Databricks. It’s important to have a mechanism to manage schemas. Data swamps quickly form when the schema is not managed correctly. In this project, we are not referencing the schema app to have a central view of the schema. This might be a good idea for your project but creates the added overhead of dealing with package versioning.

In our configuration folder, we will keep data classes that define how we want our database and tables configured from a high level:

schema-jobs/schema_jobs/jobs/configuration/database_configuration.py
"""
fill in
"""
import abc
from dataclasses import dataclass
class DatabaseConfig(abc.ABC):
    """
    fill in
    """
    database_name = "dev"
@dataclass
class DatabaseConfiguration...

Building our data pipeline application

We now have our ETL app. Configuring it is more complex than the schema app but basically works the same.

Here, we define our REST APIs; each API has a name that matches the table name and a URL:

etl-jobs/etl_jobs/configuration/api_raw_data/apis.py
"""
fill in
"""

We have a tiny import section but let’s start here first:

import abc
from dataclasses import dataclass

Here we follow the same base configuration pattern. First, we create a base class and then we inherit that class, adding information. This pattern allows us to create uniformity and control the look of each class:

class restConfig(abc.ABC):
    """
    fill in
    """
    name = ""
    url = ""

We will now inherit from the base class and add information to our configuration:

...

Creating our machine learning application

Here is the main ML function. It will call functions to load data, create modeling data, and train our model:

ml-jobs/ml_jobs/jobs/build_sales_model.py
from ml_jobs.utils.data_prep.get_train_test_split import get_train_test_split
from ml_jobs.utils.extract.get_table import get_table
from ml_jobs.utils.management.setup_experiment import setup_experiment
from ml_jobs.utils.model.train_sales import train_sales
def build_sales_model():
    """
    fill in
    """
     spark = SparkSession \
        .builder \
        .appName("Schema App") \
        .getOrCreate()
    gold_sales = get_table("sales")
    model_data = get_train_test_split(gold_sales)
 ...

Displaying our data with dashboards

There are many ways to create dashboards; more often than not, I find simple notebooks with useful metrics are the more common dashboards. Here, I created a SQL-only notebook, which looks at the Gold sales table and creates a scatter plot of the data. This could be scheduled to update at any cadence needed using the schedule option at the top left.

Figure 12.12: Databricks chart 1

Figure 12.12: Databricks chart 1

Next, we will run a SQL command to do a count of failed engine data and passed engine data. Simple queries such as these are very common in dashboard notebooks.

Figure 12.13: Databricks chart 2

Figure 12.13: Databricks chart 2

Summary

So, here we are at the end of our project and data platform architecture journey. We covered topics including Spark, Delta Lake, Lambda architecture, Kafka, and MLOps. We also looked at how to build pipelines, package those pipelines, and deploy them centrally. Also, we delved into governance and platform design. This chapter and Chapter 11 built the initial beginnings of production-ready data pipelines. These pipelines can be used as the foundation of your data platform.

It’s my hope that you can take this project as a starting point for your next data platform project and build off it. I have presented many ideas and kernels of best practices. It’s up to you to continue the journey in your own projects. It doesn’t matter whether you choose the same tools or replace them with other tools that you prefer, the fundamental ideas should stay the same.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp