You're reading from Simplify Big Data Analytics with Amazon EMR

Product typeBook

Published inMar 2022

PublisherPackt

ISBN-139781801071079

Edition1st Edition

Tools

AWS

Concepts

Big Data

Author (1)

Sakti Mishra

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

In the previous two chapters, we learned how to implement a batch ETL pipeline with Amazon EMR and real-time streaming with Spark Streaming. In this chapter, we will learn how to implement UPSERT or merge on your Amazon S3 data lake using the Apache Hudi framework integrated with Apache Spark.

Amazon S3 is immutable by default, which means you cannot update the content of an object or file in S3. Instead, you have to read its content, then modify it and write a new object. Currently, as data lake and lake house architectures are becoming popular, organizations look for update capability on Amazon S3 or other object stores. Frameworks such as Apache Hudi, Apache Iceberg, and AWS Lake Formation Governed Tables have started offering ACID transactions and UPSERT capabilities on data lakes.

Apache Hudi is a popular open source framework that is integrated into Amazon EMR and AWS Glue and is also very...

Technical requirements

In this chapter, we will showcase interactive development using an EMR notebook and the Apache Spark and Apache Hudi frameworks. So, before getting started, make sure you have the following:

An AWS Account with the ability to create Amazon S3, Amazon EMR, Amazon Athena, and AWS Glue Catalog resources
An IAM user that can create IAM roles, which will be used to trigger or execute jobs
Access to the Jupyter notebook that is available in our GitHub repository here: https://github.com/PacktPublishing/Simplify-Big-Data-Analytics-with-Amazon-EMR-/tree/main/chapter_11

Now, let's dive deep into the use case and hands-on implementation steps starting with the overview of Apache Hudi.

Check out the following video to see the Code in Action at https://bit.ly/3svY3i9

Apache Hudi overview

Apache Hudi is an open source framework, which is popular for providing record-level transaction support on top of data lakes. The Hudi framework supports integration with open file formats such as Parquet and stores additional metadata for its operations.

Apache Hudi provides several capabilities and the following are the most popular ones:

UPSERT on top of data lakes
Support for transactions and rollbacks
Integration with popular distributed processing engines such as Spark, Hive, Presto, and Trino
Automatic file compaction in data lakes
The option to query recent update views or past transaction snapshots

Hudi supports both read and write-heavy workloads. When you write data to an Amazon S3 data lake using Hudi APIs, you have the option to specify either of the following storage types:

Copy on Write (CoW): This is the default storage type, which creates a new version of the file and stores the output in Parquet format...

Creating an EMR cluster and an EMR notebook

Before getting started with our use case, we need to create an EMR cluster and then create an EMR notebook that points to the EMR cluster we have created. Let's assume this EMR cluster is a long-running cluster that is active to support your development workloads as you plan to do interactive development with EMR notebooks.

Now let's learn how to create the EMR cluster and notebook.

Creating an EMR cluster

As explained in Chapter 5, Setting Up and Configuring EMR Clusters, to create an EMR cluster, follow these steps:

Navigate to Amazon EMR's Create cluster screen at https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#quick-create.
Select Go to advanced options and, from the advanced options screen, select the latest stable release. We have selected the emr-6.4.0 release because that was the latest stable release while writing this chapter. From the Applications list, make sure you select...

Interactive development with Spark and Hudi

Our EMR cluster and notebook are now ready for use. Let's learn how to do interactive development using an EMR notebook.

For interactive development, we are considering a use case where we will integrate the Hudi framework with Spark to do UPSERT (update/merge) operations on top of an S3 data lake.

Let's navigate to our EMR notebook to get started.

Creating a PySpark notebook for development

To get started, in Jupyter Notebook, choose New and then PySpark, as shown in the following screenshot:

Figure 11.8 – The Jupyter Notebook landing page

This will create a new PySpark notebook. In every cell, you can write scripts and execute them line by line for easy development or debugging.

Next, we will learn how to integrate Hudi libraries with the notebook.

Integrating Hudi with our PySpark notebook

By default, Hudi libraries are not available in our EMR notebook. To make them...

Summary

Over the course of this chapter, we have dived deep into Apache Hudi and looked at its features, use cases, and how it is integrated with AWS and Amazon EMR.

We have covered how to create an EMR notebook that points to a long-running EMR cluster and how to use the notebook for interactive development. To showcase interactive development, we explained a small use case using Spark and Hudi, which can enable you to do UPSERT transactions on top of a data lake.

That concludes this chapter! Hopefully, this has helped you get an idea of how to use EMR notebooks for interactive development. In the next chapter, we will explain how to build a workflow to build a data pipeline using Amazon EMR.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume your data science team is using EMR notebooks for their interactive development and they are primarily using Python 3 for machine learning model development. When they started executing the Python code, they found some of their scripts are not getting executed; they get an error stating that the Python module does not exist. How would you make the additional Python modules available in the EMR notebook so that your data science team can continue executing their scripts for machine learning model development?
Assume you have an S3 data lake on top of which you have created Hudi tables for ACID transactions and UPSERT. You are updating records as they change, which creates multiple versions of the records in the Hudi table. You have received a business requirement to find the value of a specific column at a specific time. How would you fulfill that requirement...

Apache Hudi documentation: https://hudi.apache.org/
EMR and Hudi integration: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi.html
Considerations and limitations while using Hudi with Amazon EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-considerations.html
Learn more about EMR Notebooks: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

The rest of the chapter is locked

You have been reading a chapter from

Simplify Big Data Analytics with Amazon EMR

Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Simplify Big Data Analytics with Amazon EMR

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Technical requirements

Apache Hudi overview

Creating an EMR cluster and an EMR notebook

Creating an EMR cluster

Interactive development with Spark and Hudi

Creating a PySpark notebook for development

Integrating Hudi with our PySpark notebook

Summary

Test your knowledge

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook