Reader small image

You're reading from  Modern Data Architectures with Python

Product typeBook
Published inSep 2023
Reading LevelExpert
PublisherPackt
ISBN-139781801070492
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Right arrow

Understanding Data Analytics

A new discipline called analytics engineering has emerged. An analytics engineer is primarily focused on taking the data once it’s been delivered and crafting it into consumable data products. An analytics engineer is expected to document, clean, and manipulate whatever users need, whether they are data scientists or business executives. The process of curating and shaping this data can abstractly be understood as data modeling.

In this chapter, we will go over several approaches to data modeling and documentation. We will, at the same time, start looking into PySpark APIs, as well as working with tools for code-based documentation.

By the end of the chapter, you will have built the fundamental skills to start any data analytics project.

In this chapter, we’re going to cover the following main topics:

  • Graphviz and diagrams
  • Covering critical PySpark APIs for data cleaning and preparation
  • Data modeling for SQL and NoSQL...

Technical requirements

I suggest the following:

  • PyCharm
  • A Databricks notebook
  • A Python environment with the following packages installed:
    • PySpark
    • diagrams
  • Graphviz

Setting up your environment

In this chapter, we will mostly work locally in an IDE; I would like to help walk you through setting that environment up.

Python

You have several options for installing Python. Some directions are the following:

Type the following:

python --version

You will be presented with a version of Python 3.5 or above, ideally.

venv

We can install and set up our virtual environment using the following:

python3 -m pip install virtualenv
python3 -m venv chapter2
chapter2\Scripts\activate
pip install <package>
deactivate

Graphviz

Graphviz is a very common and widely used graphic drawing tool. Its commonly used to create programmatically generated charts and diagrams.

Installations are available for every major OS here: https://graphviz.org/.

Workflow initialization

For this chapter, all Python code...

Cleaning and preparing your data

The history of data processing is long and has had several unique innovations. When you pull up data in Excel or any other data processing tool, you often see issues with the data that require fixes and changes. Data issues are extremely common, even with robust data practices. We will now go through several fundamental techniques using Apache Spark for cleansing and wrangling your data.

Duplicate values

Here, we set up our example DataFrame:

data_frame = spark.createDataFrame(data = [("Brian," "Engr," 1),
    ("Nechama", "Engr", 2),
    ("Naava", "Engr", 3),
    ("Miri", "Engr", 4),
    ("Brian", "Engr", 1),
    ("Miri", "Engr", 3),
  ], schema = ["name", "div", "ID"])

The first method...

Data documentation

Data can get complex and often documentation is non-existent. We will look into two areas that can be documented with code. Using code to document GUI tools allows for code to be stored and reviewed effectively like any other code written.

diagrams

The diagrams library is very useful for creating solution diagrams. A solution diagram is often a broad picture of the architecture and key components. It can be organized in a way that explains key interactions.

Here, we are creating a small example document using the diagrams package:

from diagrams import Cluster, Diagram
from diagrams.aws.analytics import Quicksight, EMR
with Diagram("Data Platform", show=False):
    with Cluster("Dev"):
        dashboards = Quicksight("Tableau")
        spark_clusters = [EMR("Notebook_cluster"), EMR("Jobs_cluster")]
 ...

Data modeling patterns

Data modeling is simply defining the shapes and forms in which our data will be persisted or accessed via. Modeling can work for structured data and semi-structured data. Modeling typically starts with understanding what the user wants and finding an appropriate shape or form to meet that need.

Relational

Relational modeling for transactional databases is a common set of techniques for making your data reliable, fast to write, and not duplicated.

First normal form

The first normal form is the process of making sure each column has only atomic values. That translates into conforming arrays and other types of semi-structured data.

An example of something not in first normal form is as follows:

|             fruit| id|
+------------------+---+
|    {apple -> red}|  1|

Here is the result of it being in first normal form:

| fruit| id|color|
+-...

Dimensional modeling

A dimensional model is traditionally seen in OLAP techniques such as data warehouses and data lakes using Apache Spark. The goal of dimensional modeling is to reduce duplication and create a central source of truth. One reason for the reduction of data duplication is to save on storage costs, which isn’t as much of a factor in modern cloud storage. This data model consists of dimensions and facts. The dimension is the entity that we are trying to model in the real world, such as CUSTOMER, PRODUCT, DATE, or LOCATION, and the fact holds the numerical data such as REVENUE, PROFIT, SALES $ VALUE, and so on. The primary key of the dimensions flows to the fact table as a foreign key but more often than not, it is not hardcoded into the database. Rather, it is managed through the process that manages loading and maintaining data, such as Extract, Transform, and Load (ETL). This data model is business-user-friendly and is used for analytical reporting and analysis...

OBT

There has been some discussion in the community about storing information in one large table. Essentially, this is a denormalized table with all the benefits and issues mentioned before with denormalization. Personally, I believe there can be huge benefits to the OBT approach, but with the increase in complexity as your needs grow, storing everything in one table can become very problematic. The more columns that grow in your table, the more likely you will have inserts in your table, and this can grow to the point where it’s impossible to maintain. An alternative approach might be to create data products once your OBT is becoming too much to handle. That data product would be a single table based on specific use cases that are pulled from a dimensional model. This allows for the best of both worlds and will negate all the negatives of each option. The only downside to this approach is you have increased complexity and storage.

Practical lab

We have a bronze table being loaded into our data lake using a third-party tool. There has been a request to clean up the data and resolve known issues. Your task is to write the needed Python code that will address each of the following issues.

The following are the issues present:

  • Wrong column name: The date column is spelled wrong
  • Nulls not correctly identified: The sales_id column has null values as NA strings
  • Data with missing values is unwanted: Any data with a null in sales_id should be dropped
  • Duplicate sales_id: Take the first value of any duplicate rows
  • Date column not DateType: The date column is not a DateType

Loading the problem data

The following code will create our bronze table:

bronze_sales = spark.createDataFrame(data = [
    ("1", "LA", "2000-01-01",5, 1400),
    ("2", "LA", "1998-2-01",4, 1500),
   ...

Summary

Wow, we have gone through so many concepts in this chapter! First, we reviewed the basics of data modeling in SQL and NoSQL. We then looked at the fundamental Spark APIs for data cleansing and various data documentation tooling. Lastly, we looked at dimensional modeling for data warehousing. We have set up the building blocks for the data analytics engineer role with these techniques and tools. Once you understand the needs of your users, you can now model and cleanse your data.

In the next chapter, we will explore Spark further, as well as some cloud computing techniques.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp