You're reading from Modern Data Architectures with Python

Product typeBook

Published inSep 2023

Reading LevelExpert

PublisherPackt

ISBN-139781801070492

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Brian Lipp

Completing Our Project

So, we are at the end of the project, and we now need to add meat to our work. We have built the scaffold of our project, but we don’t really have anything else at the moment. We still need to create code for all of our apps. When we have done this, we will deploy the code to the public PyPI servers. This will be critical because we are now going to pull our pipeline code from a code repository, which is the ideal scenario. We will also set up CI for our code, which will do the checking and scanning of our code. Given the limited space, we will not be covering deployment using the CI of pipeline code, but this is the next step in that process. We will also cover schema management and some limited data governance. The goal is to have a working example of a data pipeline that is in line with something you would see in production.

This chapter covers the following topics:

Documentation
Faking data with Mockaroo
Managing our schemas with...

Technical requirements

The tooling used in this chapter is tied to the technology stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

Databricks
GitHub
Terraform
PyPI

Documentation

When starting out on a project, it’s good to catch up on the basics of what the project is about and how it will be interacted with. Here, we will lay out our schemas and high-level C4 System Context diagrams. For these diagrams, I used PlantUML code, which is another simple language for creating diagrams. PyCharm will display them and check your syntax so it is very easy to work with.

Schema diagram

Schema diagrams are very useful for users who want to get a basic understanding of the data and how they might use it. Normally, in a schema diagram, you will find the field names, the types, and sometimes sample data. This type of diagram works well for structured data with few columns. If your data is semi-structured or has a significant number of columns, I would avoid using this diagram and use something in JSON format instead.

Here we have 3 tables in our Bronze layer: sales, machine_raw, and sap_BSEG.

Figure 12.1: Bronze layer...

Faking data with Mockaroo

Faking data is a very important topic for anyone working with data pipelines. You are not always able to use real data. This could be for many reasons, including legal and company policies. In fact, this has often been the case for me. Faking data can also be problematic because does the fake data fully resemble your real data? Probably not. I think for this purpose we are mimicking real data, so it’s perfectly fine.

Mockaroo is a free hosted data service that can also be used to create simple REST APIs. Here, I created our three raw data schemas and then clicked on the button that says CREATE API.

Here, we are creating a schema for our machine API. We are using two Number and one Binomial Distribution column types. When done, you can click CREATE API.

Figure 12.9: Mockaroo machine schema

Next, we are creating a schema for our sales API. We are using several columns, as outlined in the following figure. When done...

Managing our schemas with code

Our schema app will manage setting up and updating any schema changes we have to Databricks. It’s important to have a mechanism to manage schemas. Data swamps quickly form when the schema is not managed correctly. In this project, we are not referencing the schema app to have a central view of the schema. This might be a good idea for your project but creates the added overhead of dealing with package versioning.

In our configuration folder, we will keep data classes that define how we want our database and tables configured from a high level:

schema-jobs/schema_jobs/jobs/configuration/database_configuration.py
"""
fill in
"""
import abc
from dataclasses import dataclass
class DatabaseConfig(abc.ABC):
    """
    fill in
    """
    database_name = "dev"
@dataclass
class DatabaseConfiguration...

Building our data pipeline application

We now have our ETL app. Configuring it is more complex than the schema app but basically works the same.

Here, we define our REST APIs; each API has a name that matches the table name and a URL:

etl-jobs/etl_jobs/configuration/api_raw_data/apis.py
"""
fill in
"""

We have a tiny import section but let’s start here first:

import abc
from dataclasses import dataclass

Here we follow the same base configuration pattern. First, we create a base class and then we inherit that class, adding information. This pattern allows us to create uniformity and control the look of each class:

class restConfig(abc.ABC):
    """
    fill in
    """
    name = ""
    url = ""

We will now inherit from the base class and add information to our configuration:

...

Creating our machine learning application

Here is the main ML function. It will call functions to load data, create modeling data, and train our model:

ml-jobs/ml_jobs/jobs/build_sales_model.py
from ml_jobs.utils.data_prep.get_train_test_split import get_train_test_split
from ml_jobs.utils.extract.get_table import get_table
from ml_jobs.utils.management.setup_experiment import setup_experiment
from ml_jobs.utils.model.train_sales import train_sales
def build_sales_model():
    """
    fill in
    """
     spark = SparkSession \
        .builder \
        .appName("Schema App") \
        .getOrCreate()
    gold_sales = get_table("sales")
    model_data = get_train_test_split(gold_sales)
 ...

Displaying our data with dashboards

There are many ways to create dashboards; more often than not, I find simple notebooks with useful metrics are the more common dashboards. Here, I created a SQL-only notebook, which looks at the Gold sales table and creates a scatter plot of the data. This could be scheduled to update at any cadence needed using the schedule option at the top left.

Figure 12.12: Databricks chart 1

Next, we will run a SQL command to do a count of failed engine data and passed engine data. Simple queries such as these are very common in dashboard notebooks.

Figure 12.13: Databricks chart 2

Summary

So, here we are at the end of our project and data platform architecture journey. We covered topics including Spark, Delta Lake, Lambda architecture, Kafka, and MLOps. We also looked at how to build pipelines, package those pipelines, and deploy them centrally. Also, we delved into governance and platform design. This chapter and Chapter 11 built the initial beginnings of production-ready data pipelines. These pipelines can be used as the foundation of your data platform.

It’s my hope that you can take this project as a starting point for your next data platform project and build off it. I have presented many ideas and kernels of best practices. It’s up to you to continue the journey in your own projects. It doesn’t matter whether you choose the same tools or replace them with other tools that you prefer, the fundamental ideas should stay the same.

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architectures with Python

Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages