You're reading from Modern Data Architecture on AWS

Product typeBook

Published inAug 2023

PublisherPackt

ISBN-139781801813396

Edition1st Edition

Concepts

Data Science

Author (1)

Behram Irani

Streaming Data Ingestion

In this chapter, we will look at the following key topics:

The need for streaming architectures and its challenges
Streaming data ingestion using Amazon Kinesis
Streaming data ingestion using Amazon MSK
Streaming services usage patterns

Chapter 3, Batch Data Ingestion, was all about batch data ingestion, where we saw multiple ways of ingesting data in batches. Batch data ingestion is still the bedrock of many data pipelines since it helps to serve so many business use cases. For many such use cases, data analytics can be performed with data that’s not fresh – that is, data is not available for consumption in the analytics environment as soon as it’s produced in the source system. For a very long time, deriving reactive insights from data was fine as OLAP systems were meant to perform analytics on data that was typically a day old.

However, data in these modern times gets generated in large volumes and moves...

The need for streaming architectures and its challenges

Many times, as time passes by, the value of insights from data diminishes. Figure 4.1 represents the value of data to the decision-making process where, as time passes by, its value decreases:

Figure 4.1 – Time value of data toward decision making

For organizations to do real-time analytics, data needs to be ingested from the source, processed immediately, and stored in the destination as soon as the event occurs. This allows organizations to derive insights from the data in real time. The need to get data in real time has many advantages:

Getting data in real time for analytics helps businesses make faster decisions and stay ahead of the competition
Analyzing real-time data allows early detection of security threats and anomalies in the data
IoT systems continuously send data in the form of events, and all this data needs to be captured and stored for analytics
Log data...

Streaming data ingestion using Amazon Kinesis

Amazon Kinesis was created specifically to alleviate all the pain points associated with setting up and operating a streaming platform. Organizations want to build real-time streaming data pipelines that make it easy to collect, process, and analyze data in real time. That’s what Kinesis brings to the table. It’s a serverless, fully managed, and scalable service for handling real-time streaming use cases. It seamlessly integrates with other AWS services and you only pay for what you use.

There are a lot of use cases that do different things with streaming data. Some use cases require the data to be processed and analyzed with the lowest latency possible; some use cases can withstand some delays in getting the data but expect the data to be compacted and aggregated for query efficiency; and some use cases require the data to be analyzed as it’s passing through the stream itself.

As you may recall, one of the tenants...

Streaming data ingestion using Amazon MSK

Apache Kafka is a very popular open source distributed event streaming platform. For years now, organizations of all kinds have been using Kafka to power their event-driven systems. Kafka provides low sub-second latency and is a highly scalable framework.

One downside of using this open source framework is that you have to set up, manage, and operationalize production-grade infrastructure. This means making sure the system is highly resilient and scalable, is always patched with software updates, has all the bells and whistles, such as logging, monitoring, and notification setup, and, of course, is performant and cost-effective. Doing all this is sometimes error-prone and complex to manage.

In the era of cloud computing, organizations want all the advantages of Kafka but don’t want to deal with managing all the infrastructure behind the scenes. This is where Amazon Managed Streaming (MSK) for Apache Kafka comes to the rescue. MSK...

Streaming services usage patterns

Any architecture pattern you come up with for your organization’s use case has many dimensions to it. Some of the factors that influence these decisions are overall costs, the specifics of functional and non-functional requirements, people skillsets, future use cases, preference for a specific service, and so forth. Let’s get into some other use cases that can be solved using a combination of the AWS streaming services we covered in this chapter.

Use case for streaming change data in S3 data lakes

The IT team likes using AWS DMS to capture change data from relational databases into the raw zone of the data lake. However, DMS creates tons of tiny files that then need to be consolidated into the conformed layer of the data lake in S3. For many data sources, this setup works well and the data pipeline is performant and cost-effective. However, for certain extremely large ERP systems, the volume of CDC data generates millions of tiny...

Summary

In this chapter, we looked at how businesses benefit by leveraging real-time data for analytics. We introduced Amazon Kinesis Streams, Amazon Data Firehose, and Kinesis Data Analytics as the streaming services we would use in modern data architecture and how customers leverage these services in their data platform. We also looked at Apache Kafka as an open source framework for supporting streaming use cases and how Amazon MSK provides a scalable, secure, easy-to-manage, and cost-effective platform for using Kafka as the data streaming engine. We also looked at some other use cases that leverage a combination of streaming services put together to create a seamless data pipeline using purpose-built AWS services. Streaming use cases always come up in everything we do, so look out for some more design patterns later in this book. Enjoy doing the hands-on workshops in the next chapter!

References

Amazon Kinesis workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/2300137e-f2ac-4eb9-a4ac-3d25026b235f/en-US
Amazon MSK workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/c2b72b6f-666b-4596-b8bc-bafa5dcca741/en-US

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architecture on AWS

Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages