Reader small image

You're reading from  Modern Data Architectures with Python

Product typeBook
Published inSep 2023
Reading LevelExpert
PublisherPackt
ISBN-139781801070492
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Right arrow

Data and Information Visualization

The core use for much of an enterprise’s data is data visualization. Data visualization can be used for anything, from data analysis and BI dashboards to charts in web apps. A data platform must be flexible enough to effectively handle all data visualization needs. We have chosen tooling that supports various languages, including SQL, R, Java, Scala, Python, and even point-and-click data visualizations. We will focus on Python, SQL, and point-and-click, but feel free to explore other tools. R, for example, has a wide range of excellent data visualizations.

In this chapter, we will cover the following main topics:

  • Python-based charts using plotly.express
  • Point-and-click-based charts in Databricks notebooks
  • Tips and tricks for Databricks notebooks
  • Creating Databricks SQL analytics dashboards
  • Connecting other BI tooling to Databricks

Technical requirements

The tooling used in this chapter is tied to the tech stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

  • Databricks
  • AWS
  • Tableau Desktop
  • Python
  • SQL
  • dbt

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Python, AWS, and Databricks

As we have with previous chapters, this chapter assumes you have a working version of Python of 3.6 or above release installed in your development environment. We will also assume you have set up an AWS account and Databricks with that account.

Databricks CLI

The Databricks CLI is used to create our Databricks infrastructure; before we can create anything, we must first make sure it’s set up correctly.

Installation and setup

The first step is to install the databricks-cli tool using the pip python package manager:

pip install databricks-cli

Let...

Principles of data visualization

Data is present everywhere. It’s in every aspect of our lives, and one of the most profound methods of understanding that data is through our visual senses. We can better summarize, explain, and predict our data using charts and dashboards. However, before we go through some possible ways to create data visualizations, we will delve into some background knowledge on data visualization.

Understanding your user

When creating data visualizations, it’s essential to understand how they will be used. Understanding who’s using your data visualization is one of the most fundamental requirements. What is their purpose for your data visualization? Is this for strategic decision-making, or is it used for operational usage? Are you making an analytical data visualization? Understanding that a dashboard is used for critical decisions or a site reliability engineer’s operational dashboard will allow you to focus on accomplishing...

Data visualization using notebooks

Here, we will discuss the main types of visualization charts and give examples for each, using plotly.express and the Databricks notebook GUI.

Line charts

Line charts are used with varying data points that move over an endless plain. A perfect example of data on a continuous plain is time-series data. One thing to consider with line charts is that it’s best to show small changes over a more extended period.

Bar charts

Bar charts help compare significant changes and show differences between groups of data. A key detail to remember is that bar charts are not used for contiguous data and typically represent categorical data.

Histograms

Histograms can be thought of as bar charts for continuous data. Histograms are often used with frequency over, for example, sales data.

Scatter plots

Scatter plots are essential charts showing relationships between two datasets and the correlation between data.

Pie charts

Pie charts...

Tips and tricks with Databricks notebooks

Since we will go through notebooks, it makes sense to mention a few tricks you can do with Databricks notebooks.

Magic

Databricks notebooks can use magic, which involves mixing in some type of non-Python component using the % syntax.

Markdown

Markdown is an advantageous way to format text, much like HTML, but it’s much simpler to write and learn. To invoke a Markdown cell, simply type %md at the start of your cell.

Other languages

When working with notebooks, it can be handy to run a command in a language other than what the notebook is set up to run. The language magic works by typing %[language] at the start of the notebook cell. For example, you can invoke %sql, %r, %scala, and %python using the language magically. Keep in mind that it is impossible to pass variables between languages, and the context of the language magic is limited to the cell itself.

Terminal

To gain terminal access to the driver node, use...

Databricks SQL analytics

Databricks SQL analytics is an evolved section of Databricks adapted for SQL-only analysis and BI access. What makes SQL analytics unique is the tight integration between all the other tooling. So, when your Databricks pipelines publish tables, SQL analytics will be able to access all those artifacts.

Accessing SQL analytics

At the time of writing, Databricks SQL analytics is offered with only premium-tier accounts. Once you have switched to the premium tier, you will see SQL in the area drop-down menu.

Once you have enabled the premium tier or higher, to access SQL analytics, use the dropdown at the top left of the screen. You will be given three choices – Data Science & Engineering, Machine Learning, and SQL. For now, we will use SQL analytics, but switch to Data Science & Engineering if you need access to your notebooks anytime.

Figure 7.10: The SQL menu option

Figure 7.10: The SQL menu option

SQL Warehouses

Databricks has minimized...

Connecting BI tools

Although the dashboards for SQL analytics are excellent, you will often find that other BI tooling is needed for some workflows. Here, I will show you how to connect Tableau Desktop to Databricks SQL analytics. Tableau is one of the most common BI dashboarding tools found on the market. However, the setup process is typically very similar if your situation requires a different tool:

  1. The first step is to click the Partner Connect button on the toolbar. The Partner Connect section lists automated and simplified connectors for common BI tooling.
Figure 7.17: Partner Connect

Figure 7.17: Partner Connect

  1. You will be presented with the Tableau Connect screen, as shown here. On this screen, you will be able to choose your SQL warehouse.
Figure 7.18: Tableau

Figure 7.18: Tableau

  1. Now, you will be given the connection file for Tableau Desktop to use.
Figure 7.19: The connection file

Figure 7.19: The connection file

  1. Once you run that...

Practical lab

In this practical lab, we will explore adding the tool DBT and creating several dashboards.

Loading problem data

We will use the following three datasets for our labs – error_codes.json, factory.json, and factory_errors.json. For this lab, we will use the web GUI to load the data; in a production environment, we would have a pipeline to handle this process:

  1. First, click Create on the toolbar to load data using the web GUI.
Figure 7.22: Create

Figure 7.22: Create

  1. Now, we will click the Create table button, and then we must select the cluster to use; any available cluster is acceptable.
Figure 7.23: Selecting a cluster

Figure 7.23: Selecting a cluster

  1. We will use the GUI to load the table and not use a notebook this time. You will be presented with the following menu. Be sure to name your tables consistently, use the default database/schema, and select the JSON file type, Infer schema, and Multi-line.
Figure 7.24: Create Table ...

Summary

We covered a lot of material. To summarize, we talked about the importance of data visualizations and how to create them using a variety of tooling. We went over tips and tricks for notebooks in Databricks. We delved into SQL analytics and connecting BI Tools such as Tableau and DBT. With the knowledge you now possess, you should be able to design and implement complex data visualization systems.

In the upcoming chapter, we will see how to organize our data projects and build them into continuous integration projects using Jenkins and GitHub. As we write our code, we will look at techniques for automating checks of our code. We will then discuss how to deploy our code into production.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp