Business Intelligence with Databricks SQL

Introduction to Databricks

Databricks is one of the most recognizable names in the big data industry. They are the providers of the lakehouse platform for data analytics and artificial intelligence (AI). This book is about Databricks SQL, a product within the Databricks Lakehouse platform that powers data analytics and business intelligence.

Databricks SQL is a rapidly evolving product. It is not a traditional data warehouse, yet its users are the traditional data warehouse and business intelligence users. It claims to provide all the functionality of data warehouses on what is essentially a data lake. This concept can be a bit jarring. It can create resistance in adoption as you might be wondering if your skills are transferrable, or if your work might be disrupted as a result of a new learning curve.

Hence, I am writing this book.

The primary intent of this book is to help you learn the fundamental concepts of Databricks SQL in a fun, follow-along interactive manner. My aim is that by the time you complete this book, you will be confident in your adoption of Databricks SQL as the enabler of your business intelligence.

This book does not intend to be a definitive guide or a complete reference, nor does it intend to be a replacement for the official documentation. It is too early for either of those. This book is your initiation into business intelligence on the data lakehouse, the Databricks SQL way.

Let’s begin!

In this chapter, we’ll cover the following topics:

An overview of Databricks, the company
An overview of the Lakehouse architecture
An overview of the Databricks Lakehouse platform

An overview of Databricks, the company

Databricks was founded in 2013 by seven researchers at the University of California, Berkeley.

This was the time when the world was learning how the Meta, Amazon, Netflix, Google, and Apple (MANGA) companies had built their success by scaling up their use of AI techniques in all aspects of their operations. Of course, they could do this because they invested heavily in talent and infrastructure to build their data and AI systems. Databricks was founded with the mission to enable everyone else to do the same – use data and AI in service of their business, irrespective of their size, scale, or technological prowess.

The mission was to democratize AI. What started as a simple platform, leveraging the open source technologies that the co-founders of Databricks had created, has now evolved into the lakehouse platform, which unifies data, analytics, and AI in one place.

As an interesting side note, and my opinion: To this date, I meet people and organizations that equate Databricks with Apache Spark. This is not correct. The platform indeed debuted with a cloud service for running Apache Spark. However, it is important to understand that Apache Spark was the enabling technology for the big data processing platform. It was not the product. The product is a simple platform that enables the democratization of data and AI.

Databricks is a strong proponent of the open source community. A lot of popular open source projects trace their roots to Databricks, including MLflow, Koalas, and Delta Lake. The profile of these innovations demonstrates the commitment to Databricks’s mission statement of democratizing data and AI. MLflow is an open source technology that enables machine learning (ML) operations or MLOps. Delta Lake is the key innovation that brings reliability, governance, and simplification to data engineering and business intelligence operations on the data lake. It is the key to building the lakehouse on top of cloud storage systems such as Amazon Web Service’s Simple Storage Service (S3), Microsoft Azure’s Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS), as well as on-premises HDFS systems.

Within the Databricks platform, these open source technologies are firmed up for enterprise readiness. They are blended with platform innovations for various data personas such as data engineers, data scientists, and data analysts. This means that MLflow within the Databricks Lakehouse platform powers enterprise-grade MLOps. Delta Lake within the Databricks Lakehouse platform powers enterprise-grade data engineering and data governance. With the Databricks SQL product, the Databricks Lakehouse platform can power all the business intelligence needs for the enterprise as well!

Technologies and Trademarks

Throughout this book we will refer to trademarked technologies and products. Some notable examples are Apache Spark™, Hive™, Delta Lake™, Power BI™, Tableau™ and others that are inadvertently mentioned.

All such trademarks are implied whenever we mention them in the book. For the sake of brevity and readability, I will omit the use of the ™ symbol in the rest of the book.

An overview of the Lakehouse architecture

If, at this point, you are a bit confused with so many terms such as databricks, lakehouse, Databricks SQL, and more – worry not. We are just at the beginning of our learning journey. We will unpack all of these throughout this book.

First, what is Databricks?

Databricks is a platform that enables enterprises to quickly build their Data Lakehouse infrastructure and enable all data personas – data engineers, data scientists, and business intelligence personnel – in their organization to extract and deliver insights from the data. The platform provides a curated experience for each data persona, enabling them to execute their daily workflows. The foundational technologies that enable these experiences are open source – Apache Spark, Delta lake, MLflow, and more.

So, what is the Lakehouse architecture and why do we need it?

The Lakehouse architecture was formally presented at the Conference on Innovative Data Systems Research (CIDR) in January 2021. You can download it from https://databricks.com/research/lakehouse-a-new-generation-of-open-platforms-that-unify-data-warehousing-and-advanced-analytics. This is an easily digestible paper that I encourage you to read for the full details. That said, I will now summarize the salient points from this paper.

Attribution, Where it is Due

In my summary of the said research paper, I am recreating the images that were originally provided. Therefore, they are the intellectual property of the authors of the research paper.

According to the paper, most of the present-day data analytics infrastructures look like a two-tier system, as shown in the following diagram:

Figure 1.1 – Two-tier data analytics infrastructures

In this two-tier system, first, data from source systems is brought onto a data lake. Examples of source systems could be your web or mobile application, transactional databases, ERP systems, social media data, and more. The data lake is typically an on-premises HDFS system or cloud object storage. Data lakes allow you to store data in big data-optimized file formats such as Apache Parquet, ORC, and Avro. The use of these open file formats enables flexibility in writing to the data lake (due to schema-on-read semantics). This flexibility enables faster ingestion of data, which, in turn, enables faster access to data for end users. It also enables more advanced analytics use cases in ML and AI.

Of course, this architecture still needs to support the traditional BI workloads and decision support systems. Hence, a second process, typically in the form of Extract, Transform, and Load (ETL), is built to copy data from the data lake to a dedicated data warehouse.

Close inspection of the two-tier architecture reveals several systemic problems:

Duplication of data: This architecture requires the same data to be present in two different systems. This results in an increased cost of storage. Constant reconciliation between these two systems is of utmost importance. This results in increased ETL operations and its associated costs.
Security and governance: Data lakes and data warehouses have very different approaches to the security of data. This results in different security mechanisms for the same data that must always be in synchronization to avoid data security violations.
Latency in data availability: In the two-tier architecture, the data is only moved to the warehouse by a secondary process, which introduces latency. This means analysts do not get access to fresh data. This also makes it unsuitable for tactical decision support such as operations.
Total cost of ownership: Enterprises end up paying double for the same data. There are two storage systems, two ETL processes, two engineering debts, and more.

As you can see, this is unintuitive and unsustainable.

Hence, the paper presents the Lakehouse architecture as the way forward.

Simply put, the data lakehouse architecture is a data management system that implements all the features of data warehouses on data lakes. This makes the data lakehouse a single unified platform for business intelligence and advanced analytics.

This means that the lakehouse platform will implement data management features such as security controls, ACID transaction guarantees, data versioning, and auditing. It will implement query performance features such as indexing, caching, and query optimizations. These features are table stakes for data warehouses. The Lakehouse architecture brings these features to you in the flexible, open format data storage of data lakes. A Lakehouse is a platform that provides data warehousing capabilities and advanced analytics capabilities for the same platform, with cloud data lake economics.

What is the Formal Definition of the Lakehouse?

Section 3 in the CIDR paper officially defines the Lakehouse. Check it out.

The following is a visual depiction of the Lakehouse:

Figure 1.2 – Lakehouse architecture

The idea of the Lakehouse is deceptively simple – as all good things in life are! The Lakehouse architecture immediately solves the problems we highlighted about present-day two-tier architectures:

A single storage layer means no duplication of data and no extra effort to reconcile data. Reduced ETL requirements and ACID guarantees equate to the stability and reliability of the system.
A single storage layer means a single model of security and governance for all data assets. This reduces the risk of security breaches.
A single storage layer means the availability of the freshest data possible for the consumers of the data.
Cheap cloud storage with elastic, on-demand cloud compute reduces the total cost of ownership.
Open source technologies in the storage layer reduce the chances of vendor lock-in and make it easy to integrate with other tools.

Of course, any implementation of the Lakehouse will have to ensure the following:

Reliable data management: The Lakehouse proposes to eliminate (or reduce) data warehouses. Hence, the Lakehouse implementation must efficiently implement data management and governance – features that are table stakes in data warehouses.
SQL performance: The Lakehouse will have to provide state-of-the-art SQL performance on top of the open-access filesystems and file formats typical in data lakes.

This is where the Databricks Lakehouse platform, and within it, the Databricks SQL product, comes in.

Filter reviews by

All

Amazon verified reviews

AL Oct 25, 2022

Reading this book is like learning from a private tutor. The book covers the Databricks Lakehouse architecture, data security and governance. The author uses simple, easy to understand diagrams and examples to illustrate how various components such as Delta Lake, Unity Catalog, SQL Warehouse, Photon etc fits within the Lakehouse architecture. Whether you are new or an experienced Databricks users, you will find something useful and refreshing. The lab exercises on Databricks SQL is fun to follow as it uses real world data sets and the benchmarking using TPC-DS dataset is a bonus!

Amazon Verified review

amazonVerifiedBuyer Dec 10, 2022

It is well written, with good explanations, and concrete examples, and it covers a large variety of topics. The author knows what he is talking about.The writing and examples are all very clear, and there are plenty of figures and code snippets to help you follow along, as well as some diagrams to explain certain concepts. What is particularly good about this book, it's that it is written from experience on the fields. It's not academic book describing how things should work. It is actually coming from experience with real projects.I had the opportunity to review the book Business Intelligence with Databricks SQL by Vihag Gupta, a master practitioner of the Databricks platform.This book is great for both newcomers and experienced users of the Databricks platform. It starts with a nice overview that can get anyone up to speed in no time. However, as you progress, Vihag will start to dig deeper into traditional SQL and Databricks SQL (including Photon).If you have been curious about Photon, this book will allow you to better understand the internals and know how to write efficient, optimal queries.

Kieran O'Driscoll Nov 13, 2022

This book walks through each component of the Databricks Lakehouse platform and covers the basics along with more technical aspects. The section on the Photon engine was extremely helpful in understanding how Databricks SQL is so powerful.A great read for anyone getting started on the Lakehouse journey!

vfortier Oct 27, 2022

Great book about Databricks SQL, how to set it up, how to use it and how to make the best of it. Highly recommended

Hales Sep 20, 2022

This book provides both high-level context and detailed examples and explanations, which are useful for novice to experienced BI practitioners and data engineers supporting BI workloads. As the book does not cover set-up, it will be most beneficial for those with access to an existing environment and some experience with SQL. The book is written in simple, easy-to-understand explanations and loaded with screenshots, code snippets, and diagrams to support the concepts covered.The book's first section is helpful if you are new to cloud architecture and jump into the product with a tour. The data and security chapters cover storing and accessing data and the latest Unity catalog components. While the book doesn’t go deep on user provisioning, it has enough detail to understand privileges to secure your tables, views, and dashboards. The SQL warehouse chapter dives into the architecture in the cloud and includes case studies around how to scale your data and manage performance. The BI section has code and screenshots using built-in datasets that you can follow along to build a dashboard or connect to an external tool. The second section goes deep into the nuts and bolts of Databricks SQL, optimizations and performance features, and the underlying engines. It ends with an example implementation of a cloud data warehouse using the lakehouse architecture and is geared towards people familiar with data modeling and warehouse design.The third section has SQL commands with many code examples, including CDC commands, time travel, and dealing with nested data. The last section has exercises around a TPC-DS dataset, using IDEs, more case studies, and FAQs. Note that you will need the jar file from GitHub. The last two sections are dense with information but are still easy to follow.One thing worth mentioning is that the features outlined in this book are not available in the free Databricks Community edition, so you will need a standard or enterprise account to follow along with the examples and create the dashboard, alerts, etc.

Business Intelligence with Databricks SQL: Concepts, tools, and techniques for scaling business intelligence on the data lakehouse

What do you get with Print?

Business Intelligence with Databricks SQL

Introduction to Databricks

Technical requirements

An overview of Databricks, the company

An overview of the Lakehouse architecture

An overview of the Databricks Lakehouse platform

Summary

Page 1 of 6

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Business Intelligence with Databricks SQL: Concepts, tools, and techniques for scaling business intelligence on the data lakehouse

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access