Reader small image

You're reading from  Modern Data Architecture on AWS

Product typeBook
Published inAug 2023
PublisherPackt
ISBN-139781801813396
Edition1st Edition
Concepts
Right arrow
Author (1)
Behram Irani
Behram Irani
author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Right arrow

Streaming Data Ingestion

In this chapter, we will look at the following key topics:

  • The need for streaming architectures and its challenges
  • Streaming data ingestion using Amazon Kinesis
  • Streaming data ingestion using Amazon MSK
  • Streaming services usage patterns

Chapter 3, Batch Data Ingestion, was all about batch data ingestion, where we saw multiple ways of ingesting data in batches. Batch data ingestion is still the bedrock of many data pipelines since it helps to serve so many business use cases. For many such use cases, data analytics can be performed with data that’s not fresh – that is, data is not available for consumption in the analytics environment as soon as it’s produced in the source system. For a very long time, deriving reactive insights from data was fine as OLAP systems were meant to perform analytics on data that was typically a day old.

However, data in these modern times gets generated in large volumes and moves...

The need for streaming architectures and its challenges

Many times, as time passes by, the value of insights from data diminishes. Figure 4.1 represents the value of data to the decision-making process where, as time passes by, its value decreases:

Figure 4.1 – Time value of data toward decision making

Figure 4.1 – Time value of data toward decision making

For organizations to do real-time analytics, data needs to be ingested from the source, processed immediately, and stored in the destination as soon as the event occurs. This allows organizations to derive insights from the data in real time. The need to get data in real time has many advantages:

  • Getting data in real time for analytics helps businesses make faster decisions and stay ahead of the competition
  • Analyzing real-time data allows early detection of security threats and anomalies in the data
  • IoT systems continuously send data in the form of events, and all this data needs to be captured and stored for analytics
  • Log data...

Streaming data ingestion using Amazon Kinesis

Amazon Kinesis was created specifically to alleviate all the pain points associated with setting up and operating a streaming platform. Organizations want to build real-time streaming data pipelines that make it easy to collect, process, and analyze data in real time. That’s what Kinesis brings to the table. It’s a serverless, fully managed, and scalable service for handling real-time streaming use cases. It seamlessly integrates with other AWS services and you only pay for what you use.

There are a lot of use cases that do different things with streaming data. Some use cases require the data to be processed and analyzed with the lowest latency possible; some use cases can withstand some delays in getting the data but expect the data to be compacted and aggregated for query efficiency; and some use cases require the data to be analyzed as it’s passing through the stream itself.

As you may recall, one of the tenants...

Streaming data ingestion using Amazon MSK

Apache Kafka is a very popular open source distributed event streaming platform. For years now, organizations of all kinds have been using Kafka to power their event-driven systems. Kafka provides low sub-second latency and is a highly scalable framework.

One downside of using this open source framework is that you have to set up, manage, and operationalize production-grade infrastructure. This means making sure the system is highly resilient and scalable, is always patched with software updates, has all the bells and whistles, such as logging, monitoring, and notification setup, and, of course, is performant and cost-effective. Doing all this is sometimes error-prone and complex to manage.

In the era of cloud computing, organizations want all the advantages of Kafka but don’t want to deal with managing all the infrastructure behind the scenes. This is where Amazon Managed Streaming (MSK) for Apache Kafka comes to the rescue. MSK...

Streaming services usage patterns

Any architecture pattern you come up with for your organization’s use case has many dimensions to it. Some of the factors that influence these decisions are overall costs, the specifics of functional and non-functional requirements, people skillsets, future use cases, preference for a specific service, and so forth. Let’s get into some other use cases that can be solved using a combination of the AWS streaming services we covered in this chapter.

Use case for streaming change data in S3 data lakes

The IT team likes using AWS DMS to capture change data from relational databases into the raw zone of the data lake. However, DMS creates tons of tiny files that then need to be consolidated into the conformed layer of the data lake in S3. For many data sources, this setup works well and the data pipeline is performant and cost-effective. However, for certain extremely large ERP systems, the volume of CDC data generates millions of tiny...

Summary

In this chapter, we looked at how businesses benefit by leveraging real-time data for analytics. We introduced Amazon Kinesis Streams, Amazon Data Firehose, and Kinesis Data Analytics as the streaming services we would use in modern data architecture and how customers leverage these services in their data platform. We also looked at Apache Kafka as an open source framework for supporting streaming use cases and how Amazon MSK provides a scalable, secure, easy-to-manage, and cost-effective platform for using Kafka as the data streaming engine. We also looked at some other use cases that leverage a combination of streaming services put together to create a seamless data pipeline using purpose-built AWS services. Streaming use cases always come up in everything we do, so look out for some more design patterns later in this book. Enjoy doing the hands-on workshops in the next chapter!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani