You're reading from Modern Data Architectures with Python

Product typeBook

Published inSep 2023

Reading LevelExpert

PublisherPackt

ISBN-139781801070492

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Brian Lipp

Understanding Data Analytics

A new discipline called analytics engineering has emerged. An analytics engineer is primarily focused on taking the data once it’s been delivered and crafting it into consumable data products. An analytics engineer is expected to document, clean, and manipulate whatever users need, whether they are data scientists or business executives. The process of curating and shaping this data can abstractly be understood as data modeling.

In this chapter, we will go over several approaches to data modeling and documentation. We will, at the same time, start looking into PySpark APIs, as well as working with tools for code-based documentation.

By the end of the chapter, you will have built the fundamental skills to start any data analytics project.

In this chapter, we’re going to cover the following main topics:

Graphviz and diagrams
Covering critical PySpark APIs for data cleaning and preparation
Data modeling for SQL and NoSQL...

Technical requirements

I suggest the following:

PyCharm
A Databricks notebook
A Python environment with the following packages installed:
- PySpark
- diagrams
Graphviz

Setting up your environment

In this chapter, we will mostly work locally in an IDE; I would like to help walk you through setting that environment up.

Python

You have several options for installing Python. Some directions are the following:

https://github.com/pyenv/pyenv or https://github.com/pyenv-win/pyenv-win for Windows
Using the system Python tool

Type the following:

python --version

You will be presented with a version of Python 3.5 or above, ideally.

venv

We can install and set up our virtual environment using the following:

python3 -m pip install virtualenv
python3 -m venv chapter2
chapter2\Scripts\activate
pip install <package>
deactivate

Graphviz

Graphviz is a very common and widely used graphic drawing tool. Its commonly used to create programmatically generated charts and diagrams.

Installations are available for every major OS here: https://graphviz.org/.

Workflow initialization

For this chapter, all Python code...

Cleaning and preparing your data

The history of data processing is long and has had several unique innovations. When you pull up data in Excel or any other data processing tool, you often see issues with the data that require fixes and changes. Data issues are extremely common, even with robust data practices. We will now go through several fundamental techniques using Apache Spark for cleansing and wrangling your data.

Duplicate values

Here, we set up our example DataFrame:

data_frame = spark.createDataFrame(data = [("Brian," "Engr," 1),
    ("Nechama", "Engr", 2),
    ("Naava", "Engr", 3),
    ("Miri", "Engr", 4),
    ("Brian", "Engr", 1),
    ("Miri", "Engr", 3),
  ], schema = ["name", "div", "ID"])

The first method...

Data documentation

Data can get complex and often documentation is non-existent. We will look into two areas that can be documented with code. Using code to document GUI tools allows for code to be stored and reviewed effectively like any other code written.

diagrams

The diagrams library is very useful for creating solution diagrams. A solution diagram is often a broad picture of the architecture and key components. It can be organized in a way that explains key interactions.

Here, we are creating a small example document using the diagrams package:

from diagrams import Cluster, Diagram
from diagrams.aws.analytics import Quicksight, EMR
with Diagram("Data Platform", show=False):
    with Cluster("Dev"):
        dashboards = Quicksight("Tableau")
        spark_clusters = [EMR("Notebook_cluster"), EMR("Jobs_cluster")]
 ...

Data modeling patterns

Data modeling is simply defining the shapes and forms in which our data will be persisted or accessed via. Modeling can work for structured data and semi-structured data. Modeling typically starts with understanding what the user wants and finding an appropriate shape or form to meet that need.

Relational

Relational modeling for transactional databases is a common set of techniques for making your data reliable, fast to write, and not duplicated.

First normal form

The first normal form is the process of making sure each column has only atomic values. That translates into conforming arrays and other types of semi-structured data.

An example of something not in first normal form is as follows:

|             fruit| id|
+------------------+---+
|    {apple -> red}|  1|

Here is the result of it being in first normal form:

| fruit| id|color|
+-...

Dimensional modeling

A dimensional model is traditionally seen in OLAP techniques such as data warehouses and data lakes using Apache Spark. The goal of dimensional modeling is to reduce duplication and create a central source of truth. One reason for the reduction of data duplication is to save on storage costs, which isn’t as much of a factor in modern cloud storage. This data model consists of dimensions and facts. The dimension is the entity that we are trying to model in the real world, such as CUSTOMER, PRODUCT, DATE, or LOCATION, and the fact holds the numerical data such as REVENUE, PROFIT, SALES $ VALUE, and so on. The primary key of the dimensions flows to the fact table as a foreign key but more often than not, it is not hardcoded into the database. Rather, it is managed through the process that manages loading and maintaining data, such as Extract, Transform, and Load (ETL). This data model is business-user-friendly and is used for analytical reporting and analysis...

OBT

There has been some discussion in the community about storing information in one large table. Essentially, this is a denormalized table with all the benefits and issues mentioned before with denormalization. Personally, I believe there can be huge benefits to the OBT approach, but with the increase in complexity as your needs grow, storing everything in one table can become very problematic. The more columns that grow in your table, the more likely you will have inserts in your table, and this can grow to the point where it’s impossible to maintain. An alternative approach might be to create data products once your OBT is becoming too much to handle. That data product would be a single table based on specific use cases that are pulled from a dimensional model. This allows for the best of both worlds and will negate all the negatives of each option. The only downside to this approach is you have increased complexity and storage.

Practical lab

We have a bronze table being loaded into our data lake using a third-party tool. There has been a request to clean up the data and resolve known issues. Your task is to write the needed Python code that will address each of the following issues.

The following are the issues present:

Wrong column name: The date column is spelled wrong
Nulls not correctly identified: The sales_id column has null values as NA strings
Data with missing values is unwanted: Any data with a null in sales_id should be dropped
Duplicate sales_id: Take the first value of any duplicate rows
Date column not DateType: The date column is not a DateType

Loading the problem data

The following code will create our bronze table:

bronze_sales = spark.createDataFrame(data = [
    ("1", "LA", "2000-01-01",5, 1400),
    ("2", "LA", "1998-2-01",4, 1500),
   ...

Summary

Wow, we have gone through so many concepts in this chapter! First, we reviewed the basics of data modeling in SQL and NoSQL. We then looked at the fundamental Spark APIs for data cleansing and various data documentation tooling. Lastly, we looked at dimensional modeling for data warehousing. We have set up the building blocks for the data analytics engineer role with these techniques and tools. Once you understand the needs of your users, you can now model and cleanse your data.

In the next chapter, we will explore Spark further, as well as some cloud computing techniques.

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architectures with Python

Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages