Reader small image

You're reading from  Modern Data Architecture on AWS

Product typeBook
Published inAug 2023
PublisherPackt
ISBN-139781801813396
Edition1st Edition
Concepts
Right arrow
Author (1)
Behram Irani
Behram Irani
author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Right arrow

Data Processing

In this chapter, we will look at the following key topics:

  • Challenges with data processing platforms
  • Data processing using Amazon EMR
  • Data processing using AWS Glue
  • Data processing using AWS Glue DataBrew

Let’s quickly recap what we have covered so far in this book. We set the foundation by creating the layers of a data lake on Amazon S3. The layers represent distinct storage areas where all the data can exist in a centralized location. The next piece of the puzzle we solved was to get data from disparate sources into the raw layer of the data lake in S3. Then, we spent the whole of Chapter 3 looking at batch data ingestion mechanisms, followed by Chapter 4, where we discussed streaming data ingestion mechanisms.

So, till this point, all the data is in the raw layer of S3; of course, it can also go directly to the conformed layer, if you have processed and optimized the data on the fly during the ingestion process. If you recall...

Challenges with data processing platforms

Data processing or data transformation is an essential part of any data pipeline, and data engineers play a big role in making sure that the data reaches its final destination, where it’s ready for consumption. In the recent decade, the volume, velocity, and variety of data have made data processing challenging. Data turned into big data, and processing all this data in a sequential manner using powerful monolithic systems turned out to be inefficient. Data processing techniques took a positive direction when a horizontal scaling framework using Apache Hadoop was created. Hadoop was able to process big data much more efficiently using many commodities’ hardware.

Even though Hadoop was promising, the MapReduce way of processing big data was not fast enough for many organizations. The creation of Apache Spark changed the way we process data, and even today, many modern data processing systems and platforms primarily use Spark...

Data processing using Amazon EMR

Amazon EMR is a platform that enables big data processing at a scale. It’s a managed service that contains over 20 open source frameworks, including popular data processing engines such as Hadoop, Spark, Hive, Presto, and Trino. It was specifically created keeping in mind all the challenges we went through with a data processing platform.

There is so much information about EMR that a separate book exists that describes in detail each and every aspect of EMR. However, the purpose of this book is not to explain all of these aspects in detail but to understand when EMR can be used and which use cases it helps to solve. But let’s first get an overview of EMR.

Amazon EMR overview

EMR provides all the necessary tools that are required to process data at scale. EMR manages the underlying software and hardware needed to provide a cost-effective, scalable, and easy-to-manage data platform. The best way to get an overview of a service is...

Data processing using AWS Glue

If you recall our conversations from the last few chapters, we kept bringing up AWS Glue for multiple use cases, including for data catalogs, crawlers, classifiers, and batch ingestion using connectors. Now, we come to Glue ETL, which is the most distinct feature of Glue. Since Glue is a fully managed and serverless service, it excels in data transformation types of tasks, usually undertaken by data engineering personas in an organization. You can create Glue ETL jobs using Spark, Python, or Ray. Spark is a common platform for creating distributed computing-based ETL jobs. Since EMR also provides Spark and Glue also has Spark, in the following table, let’s try to simplify certain scenarios where you would prefer to use one over the other:

Data processing using AWS Glue DataBrew

In the quest to build an end-to-end data platform, IT teams in organizations spend a significant amount of time creating data processing ETL pipelines. Typically, data processing is the responsibility of data engineers, who have to understand the rules of data transformations and then implement them. This means that other personas in the organization, such as data scientists or data analysts, have to rely on data engineers to help them with the structure of data they are looking for in their day-to-day tasks. The change cycles involve ETL, normalizing, cleaning the data, and finally, orchestrating and deploying in automated data pipelines. The whole process takes weeks and sometimes months. This creates a bottleneck and delays the final business outcomes.

AWS Glue DataBrew solves this exact problem by providing a serverless, no-code data preparation service, specifically targeted at data scientists and data analysts. With DataBrew, end users...

Summary

In this chapter, we covered a major topic around data processing in the modern data architecture journey. We looked at how you can use Amazon EMR to solve many big-data processing use cases. EMR provides a fully managed platform for many open source projects, including the most popular ones—Spark, Hive, and Presto. We also revisited AWS Glue and looked at how Glue Studio assists data engineers in creating complex ETL jobs for data processing. We also covered a Glue streaming use case and how it complements the other streaming services that AWS provides. Finally, we looked at AWS Glue DataBrew and how it assists data scientists and data analysts to quickly profile data and apply data processing rules in an intuitive manner.

There are many more use cases that can be solved using some of these services, but at least what we covered in this chapter gives a basic understanding of solving typical use cases for these services. As always, the best way to learn is to be hands...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

EMR typical usage

Glue ETL typical usage

Since EMR alleviates all the infrastructure and operational heavy...