The popularity of stream data platforms is increasing significantly in recent times. Due to the requirement of real-time access to information. Enterprises are transitioning parts of their data infrastructure to a streaming paradigm due to changing business needs. The streaming model presents a significant shift by moving from point queries against stationary data to a standing temporal query that consumes moving data. Fundamentally, we enable insight on the data before it is stored in the analytics repository. This introduces a new paradigm in thinking. Before going deep into stream processing, we have to cover a couple of key basic concepts related to events and stream. In this chapter, we'll explore the basics of the following points:
- Publish/Subscribe (Pub/Sub)
- Stream processing
- Real-Time Insights
The core theme of this book is the Azure Streaming Service. Before diving deeper into Azure Streaming Service, we should take a moment to consider why we need stream processing, or Real-Time Insights, and why it is a tool worth adding to your repertoire.
So what is stream processing and why is it important? In traditional data processing, data is typically processed in batch mode. The data will be dealt with on a regular schedule. One fundamental challenge with conventional data processing is it's inherently reactive because it focuses on ageing information. Stream processing, on the other hand, processes data as it flows through in real time.
The following are some of the highlights of why stream processing is critical:
- Response time is critical:
- Reducing decision latency can unlock business value
- Need to ask questions about data in motion
- Can't wait for data to get to rest before running computation
- Actions by human actors:
- See and seize insights
- Live visualization
- Alerts and alarms
- Dynamic aggregation
- Machine-to-machine interactions:
- Data movement with enrichment
- Kick-off workflows for automation
Before one goes into stream analytics, it is essential to understand the core basics around events and different models of publishing and consuming events. Let's get more familiar with queues, Pub/Sub, and events, which will surely help you understand the later chapters better. In the following sections, we will explore queues, Pub/Sub, and events.
In this section, we will review two key concepts—queues and Publish/Subscribe models, followed by event-based messaging models.
A queue implements a one-way communication, where the sender places a message on the queue and a receiver will collect the message asynchronously. Features such as dead letter queues, paired namespaces, active/passive replication, and auto-forwarding to a chain queue that's part of the same name provide the rich feature set for message flowing between an application and providing a highly available solution.
A queue consists of three key elements:
- Sender: Sends the message to the receiver through a durable entity.
- Durable entity: Stores the received durable message and offers persistence. The messages are stored until they are collected by the receiver.
- Receiver: The final recipient of the message.
The key advantages of a queue are as follows:
- Queues operate on the principle of first in, first out (FIFO): For example, consider a simple queue where, at one end, you put messages, and on the other end you will receive them in the same respective order. For example, service bus queue implements the FIFO pattern.
- Point-to-point: The fundamental concept of Queues is, they are point-to-point messaging; even though there may be multiple senders of messages, there is only one receiver of the messages.
- Asynchronous communication: This implies that endpoint addresses are connected directly. A static structure may exist where senders and receivers communicate through named channels. Asynchronous communication helps with building decoupled architecture and allows higher resilience to add and process messages when either the publisher or consumer of messages has downtime.
- Security: Due to the mutual knowledge of senders and receivers from the security point of view, senders know where the data will land, and it's easier to enforce security policies.
The following figure illustrates the preceding concept:
Publish/Subscribe is a communication paradigm for a large-scale system. It enables loose coupling between mutually anonymous components and supports many-to-many communication.
The core concept of the Publish/Subscribe model is very simple. A Publisher publishes information on some topic, and anyone that is interested in the information will be able to find that information at the same time, simply by subscribing to that information. Well known example of this pattern is News Feeds and end user that are interested can subscribe to the type of news feeds they like to listen. Let's review key components in the Publish/Subscribe paradigm:
- Publisher (message sender):
- Middleware connects the Publish/Subscribe middleware to communicate
- Publishers produce events without any dependence on subscribers
- Publishers advertise the events they are prepared to publish
- The publisher announces an event without having any understanding of the potential subscriber
- The topic is conceptually similar to the queue, but the topic can have a copy of a given message that is forwarded to multiple subscriptions
- Topics and subscriptions provide a one-to-many form of communication-based on the Publish/Subscribe pattern
- Subscriber (receiver):
- Subscribers register their interest in receiving events through a subscription that the middleware handles
- The subscriber can subscribe and unsubscribe to events
- The subscriber has to express interest in one or more events and only receive events related to their interest, without any knowledge of which publishers can provide that given event
- Provided once the event is received and a subscriber consumes it, the same event cannot be replaced, and new subscribers will not see the event to eliminate duplicate processing of events
- Subscription is similar to a virtual queue that receives copies of the message that were sent to the topic; you can optionally include filter rules for a topic on a per-subscription basis, which allows you to filter messages as illustrated:
Key benefits of the Pub/Sub model are as follows:
- Decoupling (loose coupling): Space, time, synchronization decoupling:
- Space: The publisher and subscriber don't need to know each other either by name or IP address, for instance.
- Time: The publisher and subscriber don't need to run at the same time.
- Synchronization: Operations can continue at both ends of the spectrum (publish and receiving).
- Highly parallel:
- The model is highly parallel in that subscribers can process events and, at the same time, the publisher can keep publishing events.
- Due to the decoupling nature and parallel nature of the model, the Pub/Sub model is highly scalable.
- To achieve higher velocity, events can cache and smarter routing to the subscriber can be configured to scale.
- The key challenge with the Pub/Sub model is scaling to millions of publishers and subscribers.
In addition to the preceding, there are two key challenges with the Pub/Sub model:
- No guarantee of message delivery because of the decoupled nature of the model
- For applications that totally depend on guaranteed message delivery, the queue-based model will not fit
TIBCO is one of the pioneers that preached the Publish/Subscriber model during the time of centralized batch processing. The TIBCO approach changed the paradigm on the stock trading floor.
RSS Feeds use the Pub/Sub model; you subscribe to an RSS feed, to one or more forums on a discussions platform, or follow someone on Twitter--in each case, there is one publisher and multiple subscribers involved.
Companies like IBM build protocols like Message Queue Telemetry Transport (MQTT). Some example of products that are built on the Pub/Sub model:
- IBM MQ is one of the early deployers of the Pub/Sub model
- Wormhole Pub/Sub system from Facebook
- Google Cloud Pub/Sub
In the next section, we will look how Azure implements queues and Pub/Sub models.
Azure supports two types of queuing mechanism:
- Storage queues
- Service bus queues
Storage queues are part of the Azure storage infrastructure and offer a simple REST-based GET/PUT/PEEK interface. This does provide a reliable persistent messaging within and between the services.
Service bus queues are part of the enterprise offering of the Azure messaging infrastructure. As a part of the offering, it includes queues, Pub/Sub, and advanced integration patterns.
Both of these queuing technologies exist in parallel, to cater different type of use cases. Storage queues were first offered as a service to exist on top of the Azure storage service. Service bus queues came after storage queues and support wider use cases and scenarios. For example, if your components span multiple communication protocols, data contracts, trust domains, and network environments Azure Service Bus queues is the ideal solution.
There are a couple of key technical differences between Azure Storage queues and Azure Service Bus queues.
You should consider Azure Storage queues when:
- Your queue size requires over 80 GB and messages will have a lifetime of shorter than seven days.
- You are building on Azure worker roles and you want to preserve messages between worker roles crashes.
- Server-side logs are required for all transactions executed against your queues.
You should consider Azure Service Bus queues:
- When you want to be guaranteed FIFO ordered delivery.
- When your solution requires duplicate detection.
- When your (Time to live (TTL) ) can exceed more than seven days and your message sizes are greater than 80 GB.
If you would like to understand the detailed differences between Azure Storage queues versus Azure Service Bus queues, visit https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted.
Azure Service Bus messaging is an implementation message queuing concept implemented in the Microsoft Azure as a Platform as a Service (PaaS) offering. All Azure PaaS services are built with high resiliency and high availability.
In this section, we briefly reviewed Azure implements queues and Pub/Sub models. Since the goal of this book is Stream event processing, we will explore deeper into Azure Events in the next section.
An event is made of two parts, the event header, and the event body. The event header will have a name, the timestamp of the event, and type of event. The event body will have details of the event.
Events can be triggered by the business process or by many different types of activities.
Events are written to a common log. One of the key characteristics of event streaming is that they are strictly ordered (within a partition) and durable. One key difference between Azure Service Bus and events is that the clients don't subscribe to the stream. The client has the flexibility to read from any part of the stream and this opens multiple possibilities.
One major advantage is that a client can read from any part of the stream and the client is solely responsible for advancing their position in the stream. This enables the client to join at any given time and to replay events.
Event correlation is the process of trying to identify the cause of a situation or condition when massive amounts of data points (potentially related to the situation) exist.
If you require receiving and processing millions of events per second, Azure Event Hub is the ideal solution. Typical use cases include tracking and monitoring telemetry collected from an industrial machine, mobile devices, and connected vehicles. For example, in-game events capture in-console applications.
Event Hubs work with low latency and at a massive scale, and serves as the on-ramp for big data:
The following screenshot is a canonical implementation of event processing on Azure:
The Advanced Message Queuing Protocol 1.0 is a standardized framing and transfer protocol for asynchronously, securely, and reliably transferring messages between two parties. It is the primary protocol for Azure Service Bus Messaging and Azure Event Hubs. Both services also support HTTPS. The proprietary SBMP protocol that is also supported is being phased out in favor of AMQP.
AMQP 1.0 is the result of broad industry collaboration that brought together middleware vendors, such as Microsoft and Red Hat, with many messaging middleware users such as JP Morgan Chase representing the financial services industry. The technical standardization forum for the Advanced Message Queuing Protocol (AMQP) protocol and extension specifications is OASIS, and it has achieved formal approval as an international standard as ISO/IEC 19494.
Event Hubs contain the following key elements:
- Event producers/publishers: The event can be published via AMQP or HTTPS.
- Capture: Azure Storage Blob item is used as a data storage repository for the events.
- Partitions: If a consumer wants to read a specific subset or partition of the event stream, partitions will provide the required options for the consumer.
- SAS tokens: Identity and authentication for the event publisher are provided by SAS tokens.
- Event consumers (receiver): Event consumers connect using AMQP 1.0. Any entity can read event data from an Event Hub.
- Consumer groups: Consumer groups provide a scale by providing separate views of the event stream. This provides each multiple consuming application with a separate view of the event stream, enabling those consumers to act independently.
- Throughput units: A throughput event provides scaling options. The customer can pre-purchase units of capacity. A single partition has a max scale of one throughput unit.
Azure Service Bus works on the competing consumer pattern. In the competing consumer pattern scenario, multiple consumers will process the messages as illustrated in the image shown as following.
These increases improve scalability and availability, on the same note, this pattern is useful for asynchronous message processing:
Event Hubs, on the other hand, work on the concept of partitions. Event Hub is composed of multiple partitions that will receive messages from publishers. As the volume of messages increase the number of partitions can be increased to handle the additional load.
Having partitions will increase the capacity to handle more messages and also have high throughput:
In summary, real-time streaming is all around us, be it a simple thermostat, your car telemetry, household electric meter data. Data is constantly streamed without anyone realizing it. For instance, when you are driving a car, the onboard computer is constantly doing the calculation on some telemetry data on the fly.
The final decision maker when it comes to the car is the driver that's in the driving seat. The same may not be true in other scenarios with modern day collision avoidance systems. If the onboard computer has enough data points that the car will collide with the car in front, it will decide to slow you down. That's where the real-time decision making comes into play.
The key objective of this book is to get you started on a very strong basis with event processing using Azure; as a reader, you can go on to do bigger and better things using this technology.
Before we dive further into this broad topic, let's see some of the core basics you need to know to get started.
An event-driven architecture can involve building a Publish and Subscribe, Event streaming model and a processing system:
- Publish/Subscribe: The underlying message infrastructure keeps track of subscriptions. Each subscriber will receive an event when it gets published.
- After the event is received, it cannot be replayed and new subscribers cannot see the event. In other words, you get only one opportunity to process the message. There is no way to go back to message to re-process or retry.
- Event streaming: In event streaming, clients are independent of the event producers, and they read from a common logging system.
- The client can read from any part of the system and they are responsible for advancing their posting in the stream.
- It also gives them the flexibility to join at any time and replay events as they want. One key feature set of event streaming is that, in a given partition, they are sequentially ordered and durable.
- If you look at message or event data they are simply data with a timestamp. This data need be processed by applying business logic or rule to derive or create an outcome. There are 3 well-known processing systems:
- Simple event processing
- Event stream processing
- Complex event processing
An event immediately triggers into action in the consumer. For instance, you can use Azure Functions that can execute when it receives a message on the Service Bus topic.
Simple event processing (SEP) is used whenever you need to handle events in a simple way. There are not many differences between the events, the system will just process all of them.
In simple event processing, multiple single events will land into the processing engine. The events will be filtered, transformed, split, and routed. A classic example could be URL matching in a web server. Let's say you have a shopping portal and users can select different products, or click on different or the same products. In that instance, the web server will route the request, or the web server filters the request based on the URL it receives from the interactions and routes it accordingly.
The key characteristic of simple event processing is that a single event is processed without looking at other events. Events are processed at a time.
The following are the stages of the SEP:
- Filter: Filtering the event stream for a specific type of event
- Transform: Transforming events schema from one form to another
- Enrich: Augmenting the event payload with additional data
- Split: Splitting the events into multiple events and processing them
- Route: Moving the event from one channel or stream to another
Continuous streams of data are processed in real time by applying a series of operations (stream processors) on each data point. The event stream processors (ESP) will act to process or transform the stream of data.
For example, one can use data streaming platforms, such as Azure IoT Hub or Apache Kafka, to act as a pipeline to ingest events and feed them to stream processors as showcased in the following illustration. Depending on the scale and complexity, there will be more than one stream processor to work on various subsystems of that given application. This approach is a good fit for clickstream analytics, IoT and device telemetry, credit fraud detection:
Forrester defines a CEP platform as, a software infrastructure that can detect patterns of events (and expected events that didn’t occur) by filtering, correlating, contextualizing, and analyzing data captured from disparate live data sources to respond as defined using the platform’s development tools.
Complex event processing (CEP) is a subset of event stream processing. CEP enables you to gain insights from large volumes of data in near real-time by monitoring, analyzing, and acting on data while it is in motion. Data is typically generated by business or system events such as placing an order or adding a message to a queue. CEP is the continuous monitoring and processing of events from multiple sources on a near real-time basis. Since CEP enables the analysis of data in real-time, it lends itself to predictive scenarios to enable more proactive decisions.
Typical scenarios may include:
- Monitoring the effectiveness of key performance indicators (KPIs) by using data from event streams
- Monitoring the health and availability of servers, networks and service level threshold compliance
- Fraud detection
- Stock ticker analysis—taking action when certain events occur or price points are achieved
- Performance history—predicting spikes
- Buying patterns (what product/pricing combinations are most popular)
The concept behind CEP is the aggregation of information over a time window or looking for a pattern and generating a notification when the aggregation of data or pattern breaches a defined condition. The emphasis is placed on detection of the event.
CEP has its origins in the stock market and, because of this fact, it is tuned for low latency and often responds in a few milliseconds or sub-milliseconds. Some of the events can be ignored without impact.
Internet of things (IoT) applications are very good to use cases for CEP since they are time series data, auto-correlated. IoT use cases are usually complex and they go beyond aggregation and calculation of data. These types of cases need complex operations such as time windows and temporal query patterns. Due to the availability of temporal operators, it's easy to process time series data efficiently. The following figure illustrations showcase CEP Flow: