You're reading from Modern Data Architectures with Python

Product typeBook

Published inSep 2023

Reading LevelExpert

PublisherPackt

ISBN-139781801070492

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Brian Lipp

Data and Information Visualization

The core use for much of an enterprise’s data is data visualization. Data visualization can be used for anything, from data analysis and BI dashboards to charts in web apps. A data platform must be flexible enough to effectively handle all data visualization needs. We have chosen tooling that supports various languages, including SQL, R, Java, Scala, Python, and even point-and-click data visualizations. We will focus on Python, SQL, and point-and-click, but feel free to explore other tools. R, for example, has a wide range of excellent data visualizations.

In this chapter, we will cover the following main topics:

Python-based charts using plotly.express
Point-and-click-based charts in Databricks notebooks
Tips and tricks for Databricks notebooks
Creating Databricks SQL analytics dashboards
Connecting other BI tooling to Databricks

Technical requirements

The tooling used in this chapter is tied to the tech stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

Databricks
AWS
Tableau Desktop
Python
SQL
dbt

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Python, AWS, and Databricks

As we have with previous chapters, this chapter assumes you have a working version of Python of 3.6 or above release installed in your development environment. We will also assume you have set up an AWS account and Databricks with that account.

Databricks CLI

The Databricks CLI is used to create our Databricks infrastructure; before we can create anything, we must first make sure it’s set up correctly.

Installation and setup

The first step is to install the databricks-cli tool using the pip python package manager:

pip install databricks-cli

Let...

Principles of data visualization

Data is present everywhere. It’s in every aspect of our lives, and one of the most profound methods of understanding that data is through our visual senses. We can better summarize, explain, and predict our data using charts and dashboards. However, before we go through some possible ways to create data visualizations, we will delve into some background knowledge on data visualization.

Understanding your user

When creating data visualizations, it’s essential to understand how they will be used. Understanding who’s using your data visualization is one of the most fundamental requirements. What is their purpose for your data visualization? Is this for strategic decision-making, or is it used for operational usage? Are you making an analytical data visualization? Understanding that a dashboard is used for critical decisions or a site reliability engineer’s operational dashboard will allow you to focus on accomplishing...

Data visualization using notebooks

Here, we will discuss the main types of visualization charts and give examples for each, using plotly.express and the Databricks notebook GUI.

Line charts

Line charts are used with varying data points that move over an endless plain. A perfect example of data on a continuous plain is time-series data. One thing to consider with line charts is that it’s best to show small changes over a more extended period.

Bar charts

Bar charts help compare significant changes and show differences between groups of data. A key detail to remember is that bar charts are not used for contiguous data and typically represent categorical data.

Histograms

Histograms can be thought of as bar charts for continuous data. Histograms are often used with frequency over, for example, sales data.

Scatter plots

Scatter plots are essential charts showing relationships between two datasets and the correlation between data.

Pie charts

Pie charts...

Tips and tricks with Databricks notebooks

Since we will go through notebooks, it makes sense to mention a few tricks you can do with Databricks notebooks.

Magic

Databricks notebooks can use magic, which involves mixing in some type of non-Python component using the % syntax.

Markdown

Markdown is an advantageous way to format text, much like HTML, but it’s much simpler to write and learn. To invoke a Markdown cell, simply type %md at the start of your cell.

Other languages

When working with notebooks, it can be handy to run a command in a language other than what the notebook is set up to run. The language magic works by typing %[language] at the start of the notebook cell. For example, you can invoke %sql, %r, %scala, and %python using the language magically. Keep in mind that it is impossible to pass variables between languages, and the context of the language magic is limited to the cell itself.

Terminal

To gain terminal access to the driver node, use...

Databricks SQL analytics

Databricks SQL analytics is an evolved section of Databricks adapted for SQL-only analysis and BI access. What makes SQL analytics unique is the tight integration between all the other tooling. So, when your Databricks pipelines publish tables, SQL analytics will be able to access all those artifacts.

Accessing SQL analytics

At the time of writing, Databricks SQL analytics is offered with only premium-tier accounts. Once you have switched to the premium tier, you will see SQL in the area drop-down menu.

Once you have enabled the premium tier or higher, to access SQL analytics, use the dropdown at the top left of the screen. You will be given three choices – Data Science & Engineering, Machine Learning, and SQL. For now, we will use SQL analytics, but switch to Data Science & Engineering if you need access to your notebooks anytime.

Figure 7.10: The SQL menu option

SQL Warehouses

Databricks has minimized...

Connecting BI tools

Although the dashboards for SQL analytics are excellent, you will often find that other BI tooling is needed for some workflows. Here, I will show you how to connect Tableau Desktop to Databricks SQL analytics. Tableau is one of the most common BI dashboarding tools found on the market. However, the setup process is typically very similar if your situation requires a different tool:

The first step is to click the Partner Connect button on the toolbar. The Partner Connect section lists automated and simplified connectors for common BI tooling.

Figure 7.17: Partner Connect

You will be presented with the Tableau Connect screen, as shown here. On this screen, you will be able to choose your SQL warehouse.

Figure 7.18: Tableau

Now, you will be given the connection file for Tableau Desktop to use.

Figure 7.19: The connection file

Once you run that...

Practical lab

In this practical lab, we will explore adding the tool DBT and creating several dashboards.

Loading problem data

We will use the following three datasets for our labs – error_codes.json, factory.json, and factory_errors.json. For this lab, we will use the web GUI to load the data; in a production environment, we would have a pipeline to handle this process:

First, click Create on the toolbar to load data using the web GUI.

Figure 7.22: Create

Now, we will click the Create table button, and then we must select the cluster to use; any available cluster is acceptable.

Figure 7.23: Selecting a cluster

We will use the GUI to load the table and not use a notebook this time. You will be presented with the following menu. Be sure to name your tables consistently, use the default database/schema, and select the JSON file type, Infer schema, and Multi-line.

...

Summary

We covered a lot of material. To summarize, we talked about the importance of data visualizations and how to create them using a variety of tooling. We went over tips and tricks for notebooks in Databricks. We delved into SQL analytics and connecting BI Tools such as Tableau and DBT. With the knowledge you now possess, you should be able to design and implement complex data visualization systems.

In the upcoming chapter, we will see how to organize our data projects and build them into continuous integration projects using Jenkins and GitHub. As we write our code, we will look at techniques for automating checks of our code. We will then discuss how to deploy our code into production.

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architectures with Python

Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages