Reader small image

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781801071079
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sakti Mishra
Sakti Mishra
author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Right arrow

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Throughout the previous chapters, we have explained what Amazon EMR is, what its features are, how it integrates with AWS services, and how you can integrate a few of the batch or streaming ETL pipelines using EMR. If you are about to start your big data analytics journey, then you can get started with Amazon EMR and other AWS analytics services right away, but there are lot of customers who are already using Hadoop and Spark in their on-premises environments and are in the planning stage to migrate to the AWS cloud.

If you have Hive, Spark, or Hadoop workloads running in an on-premise Hadoop cluster, then there are several factors you need to consider before migrating to AWS, such as support for the Hadoop services you are using, their versions, how security will work in AWS, and what your migration strategy should be.

In this chapter, we will walk through possible migration approaches, options for migrating...

Understanding migration approaches

Migrating from an on-premise environment to the AWS cloud provides several benefits including decoupling your compute and storage to provide independent scaling, better security with the AWS infrastructure, the flexibility to design pipelines by integrating other AWS analytics services, and saving resources that would be spent managing infrastructure and instead focusing on application development.

When you plan to migrate your on-premise Hadoop cluster to EMR, you need to analyze how your cluster will work in AWS, compare this with your on-premise environment, and then plan for the migration accordingly. The following are a few of the things you need to analyze:

  • Which Hadoop ecosystem services are you using, and are they all supported in AWS?
  • If the Hadoop services that are supported are available, then which EMR release version is the closest to your on-premise Hadoop version?
  • Does your on-premise cluster use HDFS as a persistent...

Migrating data and metadata catalogs

As we learned earlier, using Amazon S3 as the persistent data store is the recommended approach when migrating your workloads to AWS or Amazon EMR. If your on-premise environment does not use Amazon S3 as the persistent data store or your existing cluster has Hive Metastore tables, then you need to plan for migrating both data and metadata.

Let's understand what options we have when planning to migrate on-premises cluster data and/or metadata catalogs.

Migrating data

To migrate your on-premises datasets to Amazon S3 or other storage solutions in AWS, you can consider the following tools and services AWS offers:

  • Offline data movement using AWS Snowball and Snowmobile, which helps to migrate petabyte- and exabyte-scale datasets.
  • For faster online data movement, integrate AWS Direct Connect, which provides dedicated internet bandwidth for data transfers.
  • Use Hadoop's distcp command to do a distributed copy from on...

Migrating ETL jobs and Oozie workflows

If you are doing lift and shift and your ETL scripts are configured to read from and write to HDFS, then your existing ETL scripts such as Hive, MapReduce, and Spark will work just fine in EMR without substantial changes. But if, while migrating to AWS, you re-architected to use Amazon S3 as your persistent layer instead of HDFS, then you will have to change your scripts to interact with Amazon S3 (s3://) using EMRFS.

Important Note

Prior to the release of Amazon EMR 5.22.0, EMR supported the s3a:// and s3n:// prefixes to interact with EMRFS. These prefixes haven't been deprecated and still work, but it is now recommended to use the new s3://, which provides a higher level of security and easier integration with Amazon S3.

Apart from your Hive and Spark scripts, if you are using Apache Oozie for workflow orchestration of your ETL jobs, then you need to plan for its migration too. Let's understand what options you have for...

Testing and validation

In the previous section, we learned how we can migrate data, metadata, ETL jobs, and workflows, but after the migration is complete, it's very much essential to validate the migration with a proper testing strategy.

Your options for data validation will vary based on the methodology you used for your migration. We previously explained the different phases of a migration where you migrate the data and metadata separately. So, let's now understand how you can validate the data quality for each of those phases.

Validating metadata quality

We discussed about migrating Hive and Oozie Metastore, and you can apply same knowledge to migrate Hue Metastore too. All of them integrate a relational database as their Metastore, which means we have the option of executing standard SQL statements to count records or validate data.

Let's look at a few of the options we can consider to validate our metadata migration:

  • Relational data migration...

Best practices for migration

The following are some of the best practices you should follow when onboarding your solutions into a cloud-native architecture:

  • Split batch and interactive or streaming workloads: Look for opportunities to build transient EMR workloads so that your persistent cluster resources are not idle when you don't have any processes running. Of course, you might have other workloads where a persistent cluster is required, such as for interactive development or real-time streaming workloads, so it's better to identify which workloads need the persistent cluster, and then move the other workloads to transient job-specific EMR clusters.
  • DevOps automation: For the launching of clusters or other AWS resources, consider integrating AWS CloudFormation to automate the creation of the required infrastructure resources. This increases efficiency when you plan to launch the same set of resources and configurations in multiple environments, such as development...

Summary

Over the course of this chapter, we have looked at an overview of different migration strategies you can follow while migrating your on-premises Hadoop workloads to AWS, and how you can migrate data, metadata, and ETL jobs. We then covered a few testing and validation strategies you can follow to check the quality of your data, and also discussed some of the best practices you can follow during the migration process.

That concludes this chapter! Hopefully, this helped you get an idea of how you can plan your migration and some of the aspects you should be considering. In the next chapter, we will examine some of the best practices for EMR, along with how you can optimize costs while integrating your ETL flows.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

  1. Assume you have several on-premises Hadoop workloads, out of which a few are subject to sensitive customer SLAs, and your organization has decided to move all workloads to AWS. Which migration strategy do you think is ideal for your use case?
  2. Assume you have around 100 petabytes of data in your on-premise environment and you are planning to migrate the data to Amazon S3. Looking at the volume of data, which data migration strategy or tool do you think is best for your use case?
  3. Assume you have completed the migration of your on-premise environment that included several Hadoop workloads and 100s of terabytes of data. Now you are looking for ways to validate the data quality in Amazon S3. Which tool or utility will be helpful to check the quality of the data?

Further reading

The following are a few resources you can refer to for further reading:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra