You're reading from Simplify Big Data Analytics with Amazon EMR

Product typeBook

Published inMar 2022

PublisherPackt

ISBN-139781801071079

Edition1st Edition

Tools

AWS

Concepts

Big Data

Author (1)

Sakti Mishra

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Throughout the previous chapters, we have explained what Amazon EMR is, what its features are, how it integrates with AWS services, and how you can integrate a few of the batch or streaming ETL pipelines using EMR. If you are about to start your big data analytics journey, then you can get started with Amazon EMR and other AWS analytics services right away, but there are lot of customers who are already using Hadoop and Spark in their on-premises environments and are in the planning stage to migrate to the AWS cloud.

If you have Hive, Spark, or Hadoop workloads running in an on-premise Hadoop cluster, then there are several factors you need to consider before migrating to AWS, such as support for the Hadoop services you are using, their versions, how security will work in AWS, and what your migration strategy should be.

In this chapter, we will walk through possible migration approaches, options for migrating...

Understanding migration approaches

Migrating from an on-premise environment to the AWS cloud provides several benefits including decoupling your compute and storage to provide independent scaling, better security with the AWS infrastructure, the flexibility to design pipelines by integrating other AWS analytics services, and saving resources that would be spent managing infrastructure and instead focusing on application development.

When you plan to migrate your on-premise Hadoop cluster to EMR, you need to analyze how your cluster will work in AWS, compare this with your on-premise environment, and then plan for the migration accordingly. The following are a few of the things you need to analyze:

Which Hadoop ecosystem services are you using, and are they all supported in AWS?
If the Hadoop services that are supported are available, then which EMR release version is the closest to your on-premise Hadoop version?
Does your on-premise cluster use HDFS as a persistent...

Migrating data and metadata catalogs

As we learned earlier, using Amazon S3 as the persistent data store is the recommended approach when migrating your workloads to AWS or Amazon EMR. If your on-premise environment does not use Amazon S3 as the persistent data store or your existing cluster has Hive Metastore tables, then you need to plan for migrating both data and metadata.

Let's understand what options we have when planning to migrate on-premises cluster data and/or metadata catalogs.

Migrating data

To migrate your on-premises datasets to Amazon S3 or other storage solutions in AWS, you can consider the following tools and services AWS offers:

Offline data movement using AWS Snowball and Snowmobile, which helps to migrate petabyte- and exabyte-scale datasets.
For faster online data movement, integrate AWS Direct Connect, which provides dedicated internet bandwidth for data transfers.
Use Hadoop's distcp command to do a distributed copy from on...

Migrating ETL jobs and Oozie workflows

If you are doing lift and shift and your ETL scripts are configured to read from and write to HDFS, then your existing ETL scripts such as Hive, MapReduce, and Spark will work just fine in EMR without substantial changes. But if, while migrating to AWS, you re-architected to use Amazon S3 as your persistent layer instead of HDFS, then you will have to change your scripts to interact with Amazon S3 (s3://) using EMRFS.

Important Note

Prior to the release of Amazon EMR 5.22.0, EMR supported the s3a:// and s3n:// prefixes to interact with EMRFS. These prefixes haven't been deprecated and still work, but it is now recommended to use the new s3://, which provides a higher level of security and easier integration with Amazon S3.

Apart from your Hive and Spark scripts, if you are using Apache Oozie for workflow orchestration of your ETL jobs, then you need to plan for its migration too. Let's understand what options you have for...

Testing and validation

In the previous section, we learned how we can migrate data, metadata, ETL jobs, and workflows, but after the migration is complete, it's very much essential to validate the migration with a proper testing strategy.

Your options for data validation will vary based on the methodology you used for your migration. We previously explained the different phases of a migration where you migrate the data and metadata separately. So, let's now understand how you can validate the data quality for each of those phases.

Validating metadata quality

We discussed about migrating Hive and Oozie Metastore, and you can apply same knowledge to migrate Hue Metastore too. All of them integrate a relational database as their Metastore, which means we have the option of executing standard SQL statements to count records or validate data.

Let's look at a few of the options we can consider to validate our metadata migration:

Relational data migration...

Best practices for migration

The following are some of the best practices you should follow when onboarding your solutions into a cloud-native architecture:

Split batch and interactive or streaming workloads: Look for opportunities to build transient EMR workloads so that your persistent cluster resources are not idle when you don't have any processes running. Of course, you might have other workloads where a persistent cluster is required, such as for interactive development or real-time streaming workloads, so it's better to identify which workloads need the persistent cluster, and then move the other workloads to transient job-specific EMR clusters.
DevOps automation: For the launching of clusters or other AWS resources, consider integrating AWS CloudFormation to automate the creation of the required infrastructure resources. This increases efficiency when you plan to launch the same set of resources and configurations in multiple environments, such as development...

Summary

Over the course of this chapter, we have looked at an overview of different migration strategies you can follow while migrating your on-premises Hadoop workloads to AWS, and how you can migrate data, metadata, and ETL jobs. We then covered a few testing and validation strategies you can follow to check the quality of your data, and also discussed some of the best practices you can follow during the migration process.

That concludes this chapter! Hopefully, this helped you get an idea of how you can plan your migration and some of the aspects you should be considering. In the next chapter, we will examine some of the best practices for EMR, along with how you can optimize costs while integrating your ETL flows.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume you have several on-premises Hadoop workloads, out of which a few are subject to sensitive customer SLAs, and your organization has decided to move all workloads to AWS. Which migration strategy do you think is ideal for your use case?
Assume you have around 100 petabytes of data in your on-premise environment and you are planning to migrate the data to Amazon S3. Looking at the volume of data, which data migration strategy or tool do you think is best for your use case?
Assume you have completed the migration of your on-premise environment that included several Hadoop workloads and 100s of terabytes of data. Now you are looking for ways to validate the data quality in Amazon S3. Which tool or utility will be helpful to check the quality of the data?

Amazon EMR Compliance: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-compliance.html
AWS Risk and Compliance whitepaper: https://docs.aws.amazon.com/whitepapers/latest/aws-risk-and-compliance/welcome.html
A look at the EMR migration program offered by AWS:
A look at AWS Data Lab, which can help to architect your solution in AWS:
Data profiling with AWS Glue DataBrew:
The Deequ framework for data quality checks:
Amazon S3 Intelligent Tiering: https://aws.amazon.com/s3/storage-classes/

The rest of the chapter is locked

You have been reading a chapter from

Simplify Big Data Analytics with Amazon EMR

Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Simplify Big Data Analytics with Amazon EMR

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Understanding migration approaches

Migrating data and metadata catalogs

Migrating data

Migrating ETL jobs and Oozie workflows

Testing and validation

Validating metadata quality

Best practices for migration

Summary

Test your knowledge

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook