Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Modern Data Architecture on AWS

You're reading from  Modern Data Architecture on AWS

Product type Book
Published in Aug 2023
Publisher Packt
ISBN-13 9781801813396
Pages 420 pages
Edition 1st Edition
Languages
Author (1):
Behram Irani Behram Irani
Profile icon Behram Irani

Table of Contents (24) Chapters

Preface 1. Part 1: Foundational Data Lake
2. Prologue: The Data and Analytics Journey So Far 3. Chapter 1: Modern Data Architecture on AWS 4. Chapter 2: Scalable Data Lakes 5. Part 2: Purpose-Built Services And Unified Data Access
6. Chapter 3: Batch Data Ingestion 7. Chapter 4: Streaming Data Ingestion 8. Chapter 5: Data Processing 9. Chapter 6: Interactive Analytics 10. Chapter 7: Data Warehousing 11. Chapter 8: Data Sharing 12. Chapter 9: Data Federation 13. Chapter 10: Predictive Analytics 14. Chapter 11: Generative AI 15. Chapter 12: Operational Analytics 16. Chapter 13: Business Intelligence 17. Part 3: Govern, Scale, Optimize And Operationalize
18. Chapter 14: Data Governance 19. Chapter 15: Data Mesh 20. Chapter 16: Performant and Cost-Effective Data Platform 21. Chapter 17: Automate, Operationalize, and Monetize 22. Index 23. Other Books You May Enjoy

Data Processing

In this chapter, we will look at the following key topics:

  • Challenges with data processing platforms
  • Data processing using Amazon EMR
  • Data processing using AWS Glue
  • Data processing using AWS Glue DataBrew

Let’s quickly recap what we have covered so far in this book. We set the foundation by creating the layers of a data lake on Amazon S3. The layers represent distinct storage areas where all the data can exist in a centralized location. The next piece of the puzzle we solved was to get data from disparate sources into the raw layer of the data lake in S3. Then, we spent the whole of Chapter 3 looking at batch data ingestion mechanisms, followed by Chapter 4, where we discussed streaming data ingestion mechanisms.

So, till this point, all the data is in the raw layer of S3; of course, it can also go directly to the conformed layer, if you have processed and optimized the data on the fly during the ingestion process. If you recall...

Challenges with data processing platforms

Data processing or data transformation is an essential part of any data pipeline, and data engineers play a big role in making sure that the data reaches its final destination, where it’s ready for consumption. In the recent decade, the volume, velocity, and variety of data have made data processing challenging. Data turned into big data, and processing all this data in a sequential manner using powerful monolithic systems turned out to be inefficient. Data processing techniques took a positive direction when a horizontal scaling framework using Apache Hadoop was created. Hadoop was able to process big data much more efficiently using many commodities’ hardware.

Even though Hadoop was promising, the MapReduce way of processing big data was not fast enough for many organizations. The creation of Apache Spark changed the way we process data, and even today, many modern data processing systems and platforms primarily use Spark...

Data processing using Amazon EMR

Amazon EMR is a platform that enables big data processing at a scale. It’s a managed service that contains over 20 open source frameworks, including popular data processing engines such as Hadoop, Spark, Hive, Presto, and Trino. It was specifically created keeping in mind all the challenges we went through with a data processing platform.

There is so much information about EMR that a separate book exists that describes in detail each and every aspect of EMR. However, the purpose of this book is not to explain all of these aspects in detail but to understand when EMR can be used and which use cases it helps to solve. But let’s first get an overview of EMR.

Amazon EMR overview

EMR provides all the necessary tools that are required to process data at scale. EMR manages the underlying software and hardware needed to provide a cost-effective, scalable, and easy-to-manage data platform. The best way to get an overview of a service is...

Data processing using AWS Glue

If you recall our conversations from the last few chapters, we kept bringing up AWS Glue for multiple use cases, including for data catalogs, crawlers, classifiers, and batch ingestion using connectors. Now, we come to Glue ETL, which is the most distinct feature of Glue. Since Glue is a fully managed and serverless service, it excels in data transformation types of tasks, usually undertaken by data engineering personas in an organization. You can create Glue ETL jobs using Spark, Python, or Ray. Spark is a common platform for creating distributed computing-based ETL jobs. Since EMR also provides Spark and Glue also has Spark, in the following table, let’s try to simplify certain scenarios where you would prefer to use one over the other:

Data processing using AWS Glue DataBrew

In the quest to build an end-to-end data platform, IT teams in organizations spend a significant amount of time creating data processing ETL pipelines. Typically, data processing is the responsibility of data engineers, who have to understand the rules of data transformations and then implement them. This means that other personas in the organization, such as data scientists or data analysts, have to rely on data engineers to help them with the structure of data they are looking for in their day-to-day tasks. The change cycles involve ETL, normalizing, cleaning the data, and finally, orchestrating and deploying in automated data pipelines. The whole process takes weeks and sometimes months. This creates a bottleneck and delays the final business outcomes.

AWS Glue DataBrew solves this exact problem by providing a serverless, no-code data preparation service, specifically targeted at data scientists and data analysts. With DataBrew, end users...

Summary

In this chapter, we covered a major topic around data processing in the modern data architecture journey. We looked at how you can use Amazon EMR to solve many big-data processing use cases. EMR provides a fully managed platform for many open source projects, including the most popular ones—Spark, Hive, and Presto. We also revisited AWS Glue and looked at how Glue Studio assists data engineers in creating complex ETL jobs for data processing. We also covered a Glue streaming use case and how it complements the other streaming services that AWS provides. Finally, we looked at AWS Glue DataBrew and how it assists data scientists and data analysts to quickly profile data and apply data processing rules in an intuitive manner.

There are many more use cases that can be solved using some of these services, but at least what we covered in this chapter gives a basic understanding of solving typical use cases for these services. As always, the best way to learn is to be hands...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023 Publisher: Packt ISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}

EMR typical usage

Glue ETL typical usage

Since EMR alleviates all the infrastructure and operational heavy...