Home Data Business Intelligence with Databricks SQL

Business Intelligence with Databricks SQL

By Vihag Gupta
books-svg-icon Book
eBook $37.99 $25.99
Print $46.99 $27.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $37.99 $25.99
Print $46.99 $27.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introduction to Databricks
About this book
In this new era of data platform system design, data lakes and data warehouses are giving way to the lakehouse – a new type of data platform system that aims to unify all data analytics into a single platform. Databricks, with its Databricks SQL product suite, is the hottest lakehouse platform out there, harnessing the power of Apache Spark™, Delta Lake, and other innovations to enable data warehousing capabilities on the lakehouse with data lake economics. This book is a comprehensive hands-on guide that helps you explore all the advanced features, use cases, and technology components of Databricks SQL. You’ll start with the lakehouse architecture fundamentals and understand how Databricks SQL fits into it. The book then shows you how to use the platform, from exploring data, executing queries, building reports, and using dashboards through to learning the administrative aspects of the lakehouse – data security, governance, and management of the computational power of the lakehouse. You’ll also delve into the core technology enablers of Databricks SQL – Delta Lake and Photon. Finally, you’ll get hands-on with advanced SQL commands for ingesting data and maintaining the lakehouse. By the end of this book, you’ll have mastered Databricks SQL and be able to deploy and deliver fast, scalable business intelligence on the lakehouse.
Publication date:
September 2022
Publisher
Packt
Pages
348
ISBN
9781803235332

 

Introduction to Databricks

Databricks is one of the most recognizable names in the big data industry. They are the providers of the lakehouse platform for data analytics and artificial intelligence (AI). This book is about Databricks SQL, a product within the Databricks Lakehouse platform that powers data analytics and business intelligence.

Databricks SQL is a rapidly evolving product. It is not a traditional data warehouse, yet its users are the traditional data warehouse and business intelligence users. It claims to provide all the functionality of data warehouses on what is essentially a data lake. This concept can be a bit jarring. It can create resistance in adoption as you might be wondering if your skills are transferrable, or if your work might be disrupted as a result of a new learning curve.

Hence, I am writing this book.

The primary intent of this book is to help you learn the fundamental concepts of Databricks SQL in a fun, follow-along interactive manner. My aim is that by the time you complete this book, you will be confident in your adoption of Databricks SQL as the enabler of your business intelligence.

This book does not intend to be a definitive guide or a complete reference, nor does it intend to be a replacement for the official documentation. It is too early for either of those. This book is your initiation into business intelligence on the data lakehouse, the Databricks SQL way.

Let’s begin!

In this chapter, we’ll cover the following topics:

  • An overview of Databricks, the company
  • An overview of the Lakehouse architecture
  • An overview of the Databricks Lakehouse platform
 

Technical requirements

There are no technical requirements for this chapter. However, familiarity with the concept of databases, data warehouses, and data lakes will help.

 

An overview of Databricks, the company

Databricks was founded in 2013 by seven researchers at the University of California, Berkeley.

This was the time when the world was learning how the Meta, Amazon, Netflix, Google, and Apple (MANGA) companies had built their success by scaling up their use of AI techniques in all aspects of their operations. Of course, they could do this because they invested heavily in talent and infrastructure to build their data and AI systems. Databricks was founded with the mission to enable everyone else to do the same – use data and AI in service of their business, irrespective of their size, scale, or technological prowess.

The mission was to democratize AI. What started as a simple platform, leveraging the open source technologies that the co-founders of Databricks had created, has now evolved into the lakehouse platform, which unifies data, analytics, and AI in one place.

As an interesting side note, and my opinion: To this date, I meet people and organizations that equate Databricks with Apache Spark. This is not correct. The platform indeed debuted with a cloud service for running Apache Spark. However, it is important to understand that Apache Spark was the enabling technology for the big data processing platform. It was not the product. The product is a simple platform that enables the democratization of data and AI.

Databricks is a strong proponent of the open source community. A lot of popular open source projects trace their roots to Databricks, including MLflow, Koalas, and Delta Lake. The profile of these innovations demonstrates the commitment to Databricks’s mission statement of democratizing data and AI. MLflow is an open source technology that enables machine learning (ML) operations or MLOps. Delta Lake is the key innovation that brings reliability, governance, and simplification to data engineering and business intelligence operations on the data lake. It is the key to building the lakehouse on top of cloud storage systems such as Amazon Web Service’s Simple Storage Service (S3), Microsoft Azure’s Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS), as well as on-premises HDFS systems.

Within the Databricks platform, these open source technologies are firmed up for enterprise readiness. They are blended with platform innovations for various data personas such as data engineers, data scientists, and data analysts. This means that MLflow within the Databricks Lakehouse platform powers enterprise-grade MLOps. Delta Lake within the Databricks Lakehouse platform powers enterprise-grade data engineering and data governance. With the Databricks SQL product, the Databricks Lakehouse platform can power all the business intelligence needs for the enterprise as well!

Technologies and Trademarks

Throughout this book we will refer to trademarked technologies and products. Some notable examples are Apache Spark™, Hive™, Delta Lake™, Power BI™, Tableau™ and others that are inadvertently mentioned.

All such trademarks are implied whenever we mention them in the book. For the sake of brevity and readability, I will omit the use of the ™ symbol in the rest of the book.

 

An overview of the Lakehouse architecture

If, at this point, you are a bit confused with so many terms such as databricks, lakehouse, Databricks SQL, and more – worry not. We are just at the beginning of our learning journey. We will unpack all of these throughout this book.

First, what is Databricks?

Databricks is a platform that enables enterprises to quickly build their Data Lakehouse infrastructure and enable all data personas – data engineers, data scientists, and business intelligence personnel – in their organization to extract and deliver insights from the data. The platform provides a curated experience for each data persona, enabling them to execute their daily workflows. The foundational technologies that enable these experiences are open source – Apache Spark, Delta lake, MLflow, and more.

So, what is the Lakehouse architecture and why do we need it?

The Lakehouse architecture was formally presented at the Conference on Innovative Data Systems Research (CIDR) in January 2021. You can download it from https://databricks.com/research/lakehouse-a-new-generation-of-open-platforms-that-unify-data-warehousing-and-advanced-analytics. This is an easily digestible paper that I encourage you to read for the full details. That said, I will now summarize the salient points from this paper.

Attribution, Where it is Due

In my summary of the said research paper, I am recreating the images that were originally provided. Therefore, they are the intellectual property of the authors of the research paper.

According to the paper, most of the present-day data analytics infrastructures look like a two-tier system, as shown in the following diagram:

Figure 1.1 – Two-tier data analytics infrastructures

Figure 1.1 – Two-tier data analytics infrastructures

In this two-tier system, first, data from source systems is brought onto a data lake. Examples of source systems could be your web or mobile application, transactional databases, ERP systems, social media data, and more. The data lake is typically an on-premises HDFS system or cloud object storage. Data lakes allow you to store data in big data-optimized file formats such as Apache Parquet, ORC, and Avro. The use of these open file formats enables flexibility in writing to the data lake (due to schema-on-read semantics). This flexibility enables faster ingestion of data, which, in turn, enables faster access to data for end users. It also enables more advanced analytics use cases in ML and AI.

Of course, this architecture still needs to support the traditional BI workloads and decision support systems. Hence, a second process, typically in the form of Extract, Transform, and Load (ETL), is built to copy data from the data lake to a dedicated data warehouse.

Close inspection of the two-tier architecture reveals several systemic problems:

  • Duplication of data: This architecture requires the same data to be present in two different systems. This results in an increased cost of storage. Constant reconciliation between these two systems is of utmost importance. This results in increased ETL operations and its associated costs.
  • Security and governance: Data lakes and data warehouses have very different approaches to the security of data. This results in different security mechanisms for the same data that must always be in synchronization to avoid data security violations.
  • Latency in data availability: In the two-tier architecture, the data is only moved to the warehouse by a secondary process, which introduces latency. This means analysts do not get access to fresh data. This also makes it unsuitable for tactical decision support such as operations.
  • Total cost of ownership: Enterprises end up paying double for the same data. There are two storage systems, two ETL processes, two engineering debts, and more.

As you can see, this is unintuitive and unsustainable.

Hence, the paper presents the Lakehouse architecture as the way forward.

Simply put, the data lakehouse architecture is a data management system that implements all the features of data warehouses on data lakes. This makes the data lakehouse a single unified platform for business intelligence and advanced analytics.

This means that the lakehouse platform will implement data management features such as security controls, ACID transaction guarantees, data versioning, and auditing. It will implement query performance features such as indexing, caching, and query optimizations. These features are table stakes for data warehouses. The Lakehouse architecture brings these features to you in the flexible, open format data storage of data lakes. A Lakehouse is a platform that provides data warehousing capabilities and advanced analytics capabilities for the same platform, with cloud data lake economics.

What is the Formal Definition of the Lakehouse?

Section 3 in the CIDR paper officially defines the Lakehouse. Check it out.

The following is a visual depiction of the Lakehouse:

Figure 1.2 – Lakehouse architecture

Figure 1.2 – Lakehouse architecture

The idea of the Lakehouse is deceptively simple – as all good things in life are! The Lakehouse architecture immediately solves the problems we highlighted about present-day two-tier architectures:

  • A single storage layer means no duplication of data and no extra effort to reconcile data. Reduced ETL requirements and ACID guarantees equate to the stability and reliability of the system.
  • A single storage layer means a single model of security and governance for all data assets. This reduces the risk of security breaches.
  • A single storage layer means the availability of the freshest data possible for the consumers of the data.
  • Cheap cloud storage with elastic, on-demand cloud compute reduces the total cost of ownership.
  • Open source technologies in the storage layer reduce the chances of vendor lock-in and make it easy to integrate with other tools.

Of course, any implementation of the Lakehouse will have to ensure the following:

  • Reliable data management: The Lakehouse proposes to eliminate (or reduce) data warehouses. Hence, the Lakehouse implementation must efficiently implement data management and governance – features that are table stakes in data warehouses.
  • SQL performance: The Lakehouse will have to provide state-of-the-art SQL performance on top of the open-access filesystems and file formats typical in data lakes.

This is where the Databricks Lakehouse platform, and within it, the Databricks SQL product, comes in.

 

An overview of the Databricks Lakehouse platform

The Databricks Lakehouse platform enables enterprises to build their Lakehouse by providing simplified data engineering and data management techniques. The Databricks Lakehouse platform also provides one of the best ML experiences for data scientists and ML engineers.

Finally, Databricks SQL brings in the last piece of the puzzle – a home for the business intelligence and data analyst personas with a first-class workbench that allows query editing, building visualizations, and publishing dashboards. It also allows plug-and-play with downstream business intelligence tools such as Power BI, Tableau, Looker, and more. All of this is backed by state-of-the-art SQL query performance.

The following diagram represents the Databricks Lakehouse platform:

Figure 1.3 – The Databricks Lakehouse platform

Figure 1.3 – The Databricks Lakehouse platform

The lakehouse platform by Databricks is a simple, open, and collaborative platform that combines the reliability, performance, and governance capabilities of data warehouses with the openness, flexibility, and economies of cloud data lakes.

 

Summary

In this chapter, we learned about Databricks as a company and the Databricks Lakehouse platform as the product of this company, which enables the democratization of data and AI for all organizations. We are now ready to begin exploring Databricks SQL.

In the next chapter, Chapter 2, The Databricks Product Suite A Visual Tour, we will start with a tour of the Databricks Lakehouse platform.

About the Author
  • Vihag Gupta

    Vihag Gupta is a solutions architect with a specialization in cloud data platform architecture and design. He has a background in data engineering and a professional interest in machine learning. He loves getting hands-on and solving real business problems with technology. He graduated with a degree in information technology from PES University, Bengaluru, in 2011 and earned a degree in information systems management from Carnegie Mellon University, Pittsburgh, in 2016. He has worked at companies including Deloitte Consulting, DataSpark, and Qubole. He currently works at Databricks, helping clients bring their lakehouse platforms for analytics to life. Originally from Jharkhand, India, Vihag currently lives in Singapore with his wife and dog.

    Browse publications by this author