You're reading from Distributed Data Systems with Azure Databricks

Product typeBook

Published inMay 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781838647216

Edition1st Edition

Languages

Python

Tools

Azure Functions

Concepts

Data Science

Author (1)

Alan Bernardo Palacio

Chapter 6: Introducing Structured Streaming

Many organizations have a need to consume large amounts of data continuously in their everyday processes. Therefore, in order to be able to extract insights and use the data, we need to be able to process this information as it arrives, resulting in a need for continuous data ingestion processes. These continuous applications create a need to overcome challenges such as creating a reliable process that ensures the correctness of the data, despite possible failures such as traffic spikes, data not arriving in time, upstream failures, and so on, which are common when working with continuously incoming data or transforming data without consistent file formats that have different structure levels or need to be aggregated before being used.

The most traditional way of dealing with these issues was to work with batches of data executed in periodic tasks, which processed raw streams and data and stored them into more efficient formats to allow...

Technical requirements

This chapter will require you to have an Azure Databricks subscription available to work on the examples, as well as a notebook attached to a running cluster.

Let's start by looking into Structured Streaming models in more detail to find out which alternatives are available to work with streams of data in Azure Databricks.

Structured Streaming model

A Structured Streaming model is based on a simple but powerful premise: any query executed on the data will yield the same results as a batch job at a given time. This model ensures consistency and reliability by processing data as it arrives in the data lake, within the engine, and when working with external systems.

As seen before in the previous chapters, to use Structured Streaming we just need to use Spark dataframes and the API, stating which are the I/O locations.

Structured Streaming models work by treating all new data that arrives as a new row appended to an unbound table, thereby giving us the opportunity to run batch jobs on data as if all the input were being retained, without having to do so. We then can query the streaming data as a static table and output the result to a data sink.

Structured Streaming is able to do this, thanks to a feature called Incrementalization, which plans a streaming execution every time that we run a query...

Using the Structured Streaming API

Structured Streaming is integrated into the PySpark API and embedded in the Spark DataFrame API. It provides ease of use when working with streaming data and, in most cases, it requires very small changes to migrate from a computation on static data to a streaming computation. It provides features to perform windowed aggregation and for setting the parameters of the execution model.

As we have discussed in previous chapters, in Azure Databricks, streams of data are represented as Spark dataframes. We can verify that the data frame is a stream of data by checking that the isStreaming property of the data frame is set as true. In order to operate with Structured Streaming, we can summarize the steps as read, process, and write, as exemplified here:

We can read streams of data that are being dumped in, for example, an S3 bucket. The following example code shows how we can use the readStream method, specifying that we are reading a comma-separated...

Using different sources with continous streams

Streams of data can come from a variety of sources. Structured Streaming provides support from extracting data from sources such as Delta tables, publish/subscribe (pub/sub) systems such as Azure Event Hubs, and more. We will review some of these sources in the next sections to learn how we can connect these streams of data into our jobs running in Azure Databricks.

Using a Delta table as a stream source

As mentioned in the previous chapter, you can use Structured Streaming with Delta Lake using the readStream and writeStream Spark methods, with a particular focus on overcoming issues related to handling and processing small files, managing batch jobs, and detecting new files efficiently.

When a Delta table is used as a data stream source, all the queries done on that table will process the information on that table as well as any data that has arrived since the stream started.

In the next example, we will load both the path...

Triggering streaming query executions

Triggers are a way in which we define events that will lead to an operation being executed on a portion of data, so they handle the timing of streaming data processing. These triggers are defined by intervals of time in which the system checks if new data has arrived. If this interval of time is too small this will lead to unnecessary use of resources, so it should always be an amount of time customized according to your specific process.

The parameters of the triggers of the streaming queries will define if this query is to be executed as a micro-batch query on a fixed batch interval or as a continuous processing query.

Different kinds of triggers

There are different kinds of triggers available in Azure Databricks that we can use to define when our streaming queries will be executed. The available options are outlined here:

Unspecified trigger: This is the default option and means that unless specified otherwise, the query will...

Visualizing data on streaming data frames

When working with streams of data in Structured Streaming data frames, we can visualize real-time data using the display function. This function is different from other visualizing functions because it allows us to specify options such as processingTime and checkpointLocation due to the real-time nature of the data. These options are set in order to manage the exact point in time we are visualizing and should be always be set in production in order to know exactly the state of the data that we are seeing.

In the following code example, we first define a Structured Streaming dataframe, and then we use the display function to show the state of the data every 5 seconds of processing time, on a specific checkpoint location:

streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count(), processingTime = "5 seconds", checkpointLocation = "<checkpoint-path>")

Specifically...

Example on Structured Streaming

In this example, we will be looking at how we can leverage knowledge we have acquired on Structured Streaming throughout the previous sections. We will simulate an incoming stream of data by using one of the example datasets in which we have small JSON files that, in real scenarios, could be the incoming stream of data that we want to process. We will use these files in order to compute metrics such as counts and windowed counts on a stream of timestamped actions. Let's take a look at the contents of the structured-streaming example dataset, as follows:

%fs ls /databricks-datasets/structured-streaming/events/

You will find that there are about 50 JSON files in the directory. You can see some of these in the following screenshot:

Figure 6.3 – The structured-streaming dataset's JSON files

We can see what one of these JSON files contains by using the fs head option, as follows:

%fs head /databricks-datasets...

Summary

Throughout this chapter, we have reviewed different features of Structured Streaming and looked at how we can leverage them in Azure Databricks when dealing with streams of data from different sources.

These sources can be data from Azure Event Hubs or data derived using Delta tables as streaming sources, using Auto Loader to manage file detection, reading from Apache Kafka, using Avro format files, and through dealing with data sinks. We have also described how Structured Streaming provides fault tolerance while working with streams of data and looked at how we can visualize these streams using the display function. Finally, we have concluded with an example in which we have simulated JSON files arriving in the storage.

In the next chapter, we will dive more deeply into how we can use the PySpark API to manipulate data, how we can use Python popular libraries in Azure Databricks and the nuances of installing them on a distributed system, how we can easily migrate from...

The rest of the chapter is locked

You have been reading a chapter from

Distributed Data Systems with Azure Databricks

Published in: May 2021Publisher: PacktISBN-13: 9781838647216

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Other recommended products

Related to this chapter

Azure Databricks Cookbook

The Azure Databricks Cookbook shows you how to work with the latest as well as older versions of Apache Spark and integrate with various Azure resources for orchestrating, deploying, and monitoring big data solutions. You'll use Azure Databricks to build end-to-end solutions and address challenges in securing, productionizing, and monitoring them.

BookSep 2021452 pages

Machine Learning Engineering with MLflow

Machine Learning Engineering with MLflow is a step-by-step guide that will have you up and running, and productive in no time with MLflow using the most effective machine learning engineering approach. You will also learn how to scale MLflow in big data environments and for high computing demands.

BookAug 2021248 pages2

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Cloud Scale Analytics with Azure Data Services

This book will help you to understand the architectural components of a modern data warehouse and select those suitable for your requirements. You’ll learn everything from how to integrate your source data into Azure Data Lake at scale to how to structure your analytical data estate and more.

BookJul 2021520 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Mastering Azure Machine Learning

This book will help you learn how to build a scalable end-to-end machine learning pipeline in Azure from experimentation and training to optimization and deployment. By the end of this book, you will learn to build complex distributed systems and scalable cloud infrastructure using powerful machine learning algorithms to compute insights.

BookApr 2020436 pages

Mastering Machine Learning on AWS

This book will help you master your skills in various artificial intelligence and machine learning services available on AWS. Through practical hands-on examples, you’ll learn how to use these services to generate impressive results. You will have a tremendous understanding of how to use a wide range of AWS services in your own organization.

BookMay 2019306 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages