You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Chapter 10: Designing and Developing a Stream Processing Solution

Welcome to the next chapter in the data transformation series. This chapter deals with stream processing solutions, also known as real-time processing systems. Similar to batch processing, stream processing is another important segment of data pipelines. This is also a very important chapter for your certification.

This chapter will focus on introducing the concepts and technologies involved in building a stream processing system. You will be learning about technologies such as Azure Stream Analytics (ASA), Azure Event Hubs, and Spark (from a streaming perspective). You will learn how to build end-to-end streaming solutions using these technologies. Additionally, you will learn about important streaming concepts such as checkpointing, windowed aggregates, replaying older stream data, handling drift, and stream management concepts such as distributing streams across partitions, scaling resources, handling errors, and...

Technical requirements

For this chapter, you will need the following:

An Azure account (this could be either free or paid)
The ability to read basic Python code (don't worry, it is very easy)

Let's get started!

Designing a stream processing solution

Stream processing systems or real-time processing systems are systems that perform data processing in near real time. Think of stock market updates, real-time traffic updates, real-time credit card fraud detection, and more. Incoming data is processed as and when it arrives with very minimal latency, usually in the range of milliseconds to seconds. In Chapter 2, Designing a Data Storage Structure, we learned about the Data Lake architecture, where we saw two branches of processing: one for streaming and one for batch processing. In the previous chapter, Chapter 9, Designing and Developing a Batch Processing Solution, we focused on the batch processing pipeline. In this chapter, we will focus on stream processing. The blue boxes in the following diagram show the streaming pipeline:

Figure 10.1 – The stream processing architecture

Stream processing systems consist of four major components:

An Event Ingestion...

Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs

In this section, we will look at two examples: one with ASA as the streaming engine and another with Spark as the streaming engine. We will use a dummy event generator to continuously generate trip events. We will configure both ASA and Azure Databricks Spark to perform real-time processing and publish the results. First, let's start with ASA.

A streaming solution using Event Hubs and ASA

In this example, we will be creating a streaming pipeline by creating an Event Hubs instance, an ASA instance, and linking them together. The pipeline will then read a sample stream of events, process the data, and display the result in Power BI:

First, let's create an Event Hub instance. From the Azure portal, search for Event Hubs and click on the Create button, as shown in the following screenshot:

Figure 10.3 – The Event Hubs creation screen

...

Processing data using Spark Structured Streaming

Structured Streaming is a feature in Apache Spark where the incoming stream is treated as an unbounded table. The incoming streaming data is continuously appended to the table. This feature makes it easy to write streaming queries, as we can now write streaming transformations in the same way we handle table-based transformations. Hence, the same Spark batch processing syntax can be applied here, too. Spark treats the Structured Streaming queries as incremental queries on an unbounded table and runs them at frequent intervals to continuously process the data.

Spark supports three writing modes for the output of Structured Streaming:

Complete mode: In this mode, the entire output (also known as the result table) is written to the sink. The sink could be a blob store, a data warehouse, or a BI tool.
Append mode: In this mode, only the new rows from the last time are written to the sink.
Update mode: In this mode, only...

Monitoring for performance and functional regressions

Let's explore the monitoring options available in Event Hubs, ASA, and Spark for streaming scenarios.

Monitoring in Event Hubs

The Event Hubs Metric tab provides metrics that can be used for monitoring. Here is a sample screenshot of the metric options that are available:

Figure 10.20 – The metrics screen of Event Hubs

We can get useful metrics such as the number of Incoming Messages, the number of Outgoing Messages, Server Errors, CPU and memory utilization, and more. You can use all of these metrics to plot graphs and dashboards as required.

Next, let's look at the monitoring options in ASA.

Monitoring in ASA

The ASA Overview page provides high-level monitoring metrics, as shown in the following screenshot:

Figure 10.21 – The ASA Overview page with metrics

Similar to the Event Hubs metric page, ASA also provides a rich set of metrics that...

Processing time series data

Time series data is nothing but data recorded continuously over time. Examples of time series data could include stock prices recorded over time, IoT sensor values, which show the health of machinery over time, and more. Time series data is mostly used to analyze historic trends and identify any abnormalities in data such as credit card fraud, real-time alerting, and forecasting. Time series data will always be appended heavily with very rare updates.

Time series data is a perfect candidate for real-time processing. The stream processing solutions that we discussed earlier in this chapter, in the Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs section, would perfectly work for time series data. Let's look at some of the important concepts of time series data.

Types of timestamps

The central aspect of any time series data is the time attribute. There are two types of time in time series data:

...

Designing and creating windowed aggregates

In this section, let's explore the different windowed aggregates that are available in ASA. ASA supports the following five types of windows:

Tumbling windows
Hopping windows
Sliding windows
Session windows
Snapshot windows

Let's look at each of them in detail. We will be using the following sample event schema in our examples.

eventSchema = StructType()
  .add("tripId", StringType())
  .add("createdAt", TimestampType())
  .add("startLocation", StringType())
  .add("endLocation", StringType())
  .add("distance", IntegerType())
  .add("fare", IntegerType())

Let us start with Tumbling windows.

Tumbling windows

Tumbling windows are non-overlapping time windows. All the windows are of the same size. Here is a depiction of how they look:

Figure 10.26 ...

Configuring checkpoints/watermarking during processing

Let's look at the checkpointing options available in ASA, Event Hubs, and Spark.

Checkpointing in ASA

ASA does internal checkpointing periodically. Users do not need to do explicit checkpointing. The checkpointing process is used for job recoveries during system upgrades, job retries, node failures, and more.

During node failures or OS upgrades, ASA automatically restores the failed node state on a new node and continues processing.

Note

During ASA service upgrades (not OS upgrades), the checkpoints are not maintained, and the stream corresponding to the downtime needs to be replayed.

Next, let's look at how to checkpoint in Event Hubs.

Checkpointing in Event Hubs

Checkpointing or watermarking in Event Hubs refers to the process of marking the offset within a stream or partition to indicate the point up to where the processing is complete. Checkpointing in Event Hubs is the responsibility of the...

Replaying archived stream data

Event Hubs stores up to 7 days of data, which can be replayed using the EventHub consumer client libraries. Here is a simple Python example:

Consumer_client = EventHubConsumerClient.from_connection_string(
    conn_str=CONNECTION_STR,
    consumer_group='$Default',
    eventhub_name=EVENTHUB_NAME,
)
consumer_client.receive(
    on_event=on_event,
    partition_id="0",
    starting_position="-1" # "-1" is the start of the partition.
)

You can specify offsets or timestamps for the starting_position value.

You can learn more about the Python EventHub APIs at https://azuresdkdocs.blob.core.windows.net/$web/python/azure-eventhub/latest/azure.eventhub.html.

Let's take a look at some of the common data transformations that are possible using streaming analytics.

Transformations using streaming analytics

One of the common themes that you might notice in streaming queries is that if there is any kind of transformation involved, there will always be windowed aggregation that has to be specified. Let's take the example of counting the number of distinct entries in a time frame.

The COUNT and DISTINCT transformations

This type of transformation can be used to count the number of distinct events that have occurred in a time window. Here is an example to count the number of unique trips in the last 10 seconds:

SELECT
    COUNT(DISTINCT tripId) AS TripCount,
    System.TIMESTAMP() AS Time
INTO [Output]
FROM [Input] TIMESTAMP BY createdAt
GROUP BY TumblingWindow(second, 10)

Next, let's look at an example where we can cast the type of input in a different format.

CAST transformations

The CAST transformation can be used to convert the data type on the fly. Here is an example to convert...

Handling schema drifts

A schema drift refers to the changes in schema over time due to changes happening in the event sources. This could be due to newer columns or fields getting older, columns getting deleted, and more.

Handling schema drifts using Event Hubs

If an event publisher needs to share schema details with the consumer, they have to serialize the schema along with the data, using formats such as Apache Avro, and send it across Event Hubs. Here, the schema has to be sent with every event, which is not a very efficient approach.

If you are dealing with statically defined schemas on the consumer side, any schema changes on the producer side would spell trouble.

Event Hubs provides a feature called Azure Schema Registry to handle schema evolution and schema drift. It provides a central repository to share the schemas between event publishers and consumers. Let's examine how to create and use Azure Schema Registry.

Registering a schema with schema registry...

Processing across partitions

Before we look at how to process data across partitions, first, let's understand partitions.

What are partitions?

Event Hubs can distribute incoming events into multiple streams so that they can be accessed, in parallel, by the consumers. These parallel streams are called partitions. Each partition stores the actual event data and metadata of the event such as its offset in the partition, its server-side timestamp when the event was accepted, its number in the stream sequence, and more. Partitioning helps in scaling real-time processing, as it increases the parallelism by providing multiple input streams for downstream processing engines. Additionally, it improves availability by redirecting the events to other healthy partitions if some of the partitions fail.

You can learn more about Event Hubs partitions at https://docs.microsoft.com/en-in/azure/event-hubs/event-hubs-scalability.

Now, let's look at how to send data across partitions...

Processing within one partition

Similar to the previous example, where we learned how to process across partitions, we can use the EventHubConsumerClient class to process data within single partitions, too. All we have to do is specify the partition ID in the client.receive call, as demonstrated in the following code snippet. The rest of the code will remain the same as the previous example:

With client:
    client.receive(
        on_event=on_event,
        partition_id='0', # To read only partition 0
    )

This is how we can programmatically process the data from specific Event Hubs partitions.

Next, let's look at how to scale resources for stream processing.

Scaling resources

Let's look at how to scale resources in Event Hubs, ASA, and Azure Databricks Spark.

Scaling in Event Hubs

There are two ways in which Event Hubs supports scaling:

Partitioning: We have already learned how partitioning can help scale our Event Hubs instance by increasing the parallelism with which the event consumers can process data. Partitioning helps reduce contention if there are too many producers and consumers, which, in turn, makes it more efficient.
Auto-inflate: This is an automatic scale-up feature of Event Hubs. As the usage increases, EventHub adds more throughput units to your Event Hubs instance, thereby increasing its capacity. You can enable this feature if you have already saturated your quota using the partitioning technique that we explored earlier, in the Processing across partitions section.

Next, let's explore the concept of throughput units.

What are throughput units?

Throughput units are units of capacity...

Handling interruptions

Interruptions to stream processing might occur due to various reasons such as network connectivity issues, background service upgrades, intermittent bugs, and more. Event Hubs and ASA provide options to handle such interruptions natively using the concept of Azure Availability zones. Availability zones are physically isolated locations in Azure that help applications become resilient to local failures and outages. Azure lists the regions that are paired together to form availability zones.

Services that support availability zones deploy their applications to all the locations within the availability zone to improve fault tolerance. Additionally, they ensure that service upgrades are always done one after the other for the availability zone locations. Therefore, they ensure that at no point will all the locations suffer an outage due to service upgrade bugs. Both Event Hubs and ASA support availability zones. Let's look at how to enable this feature for...

Designing and configuring exception handling

Event Hubs' exceptions provide very clear information regarding the reason for errors. All EventHub issues throw an EventHubsException exception object.

The EventHubsException exception object contains the following information:

IsTransient: This indicates whether the exceptions can be retried.
Reason: This indicates the actual reason for the exception. Some example reasons could include timeouts, exceeding quota limits, exceeding message sizes, client connection disconnects, and more.

Here is a simple example of how to catch exceptions in .NET:

try
{
    // Process Events
}
catch (EventHubsException ex) where 
(ex.Reason == EventHubsException.FailureReason.MessageSizeExceeded)
{
    // Take action for the oversize messages
}

You can learn more about exception handling at https://docs.microsoft.com/en-us/azure/event-hubs/exceptions-dotnet.

Next, let's look at...

Upserting data

Upserting refers to the INSERT or UPDATE activity in a database or any analytical data store that supports it. We have already seen UPSERT as part of the batch activity, in the Upserting data section of Chapter 9, Designing and Developing a Batch Processing Solution. ASA supports UPSERT with CosmosDB. CosmosDB is a fully managed, globally distributed No-SQL database. We will learn more about CosmosDB in Chapter 14, Optimizing and Troubleshooting Data Storage and Data Processing, in the Implementing HTAP using Synapse Link and CosmosDB section.

ASA has two different behaviors based on the compatibility level that is set. ASA supports three different compatibility levels. You can think of compatibility levels as API versions. As and when ASA evolved, the compatibility levels increased. 1.0 was the first compatibility version, and 1.2 is the latest compatibility version. The main change in version 1.2 is the support for the AMQP messaging protocol.

You can set the...

Designing and creating tests for data pipelines

This section has already been covered in Chapter 9, Designing and Developing a Batch Processing Solution, in the Designing and creating tests for data pipelines section. Please refer to that section for details.

Optimizing pipelines for analytical or transactional purposes

We will be covering this topic in Chapter 14, Optimizing and Troubleshooting Data Storage and Data Processing, under the Optimizing pipelines for analytical or transactional purposes section, as that entire chapter deals with optimizations.

Summary

That brings us to the end of this chapter. This is one of the important chapters both from a syllabus perspective and a data engineering perspective. Batch and streaming solutions are fundamental to building a good big data processing system.

So, let's recap what we learned in this chapter. We started with designs for streaming systems using Event Hubs, ASA, and Spark Streaming. We learned how to monitor such systems using the monitoring options available within each of those services. Then, we learned about time series data and important concepts such as windowed aggregates, checkpointing, replaying archived data, handling schema drifts, how to scale using partitions, and adding processing units. Additionally, we explored the upsert feature, and towards the end, we learned about error handling and interruption handling.

You should now be comfortable with creating streaming solutions in Azure. As always, please go through the follow-up links that have been provided...

The rest of the chapter is locked

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 10: Designing and Developing a Stream Processing Solution

Technical requirements

Designing a stream processing solution

Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs

A streaming solution using Event Hubs and ASA

Processing data using Spark Structured Streaming

Monitoring for performance and functional regressions

Monitoring in Event Hubs

Monitoring in ASA

Processing time series data

Types of timestamps

Designing and creating windowed aggregates

Tumbling windows

Configuring checkpoints/watermarking during processing

Checkpointing in ASA

Checkpointing in Event Hubs

Replaying archived stream data

Transformations using streaming analytics

The COUNT and DISTINCT transformations

CAST transformations

Handling schema drifts

Handling schema drifts using Event Hubs

Registering a schema with schema registry...

Processing across partitions

What are partitions?

Processing within one partition

Scaling resources

Scaling in Event Hubs

What are throughput units?

Handling interruptions

Designing and configuring exception handling

Upserting data

Designing and creating tests for data pipelines

Optimizing pipelines for analytical or transactional purposes

Summary

Authors (1)

Personalised recommendations for you

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 10: Designing and Developing a Stream Processing Solution

Technical requirements

Designing a stream processing solution

Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs

A streaming solution using Event Hubs and ASA

Processing data using Spark Structured Streaming

Monitoring for performance and functional regressions

Monitoring in Event Hubs

Monitoring in ASA

Processing time series data

Types of timestamps

Designing and creating windowed aggregates

Tumbling windows

Configuring checkpoints/watermarking during processing

Checkpointing in ASA

Checkpointing in Event Hubs

Replaying archived stream data

Transformations using streaming analytics

The COUNT and DISTINCT transformations

CAST transformations

Handling schema drifts

Handling schema drifts using Event Hubs

Registering a schema with schema registry...

Processing across partitions

What are partitions?

Processing within one partition

Scaling resources

Scaling in Event Hubs

What are throughput units?

Handling interruptions

Designing and configuring exception handling

Upserting data

Designing and creating tests for data pipelines

Optimizing pipelines for analytical or transactional purposes

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you