Reader small image

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781801071079
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sakti Mishra
Sakti Mishra
author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Right arrow

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

In the previous few chapters, we explained how you can leverage the EMR cluster for on-demand ETL jobs or long-running clusters that either execute a real-time streaming application or serve as a backend for interactive development using notebooks. But when we build a data pipeline to automate data ingestion, cleansing, or transformations, we look for orchestration tools with which we can build workflows that either get kicked off through a schedule or through an event.

There are two primary orchestration tools – AWS Step Functions and Apache Airflow, which are very popular in building data pipelines with Amazon EMR. AWS also provides a managed offering of Airflow, called Amazon Managed Workflows for Apache Airflow (MWAA).

In this chapter, we will provide an overview of AWS Step Functions and MWAA services and then explain how you can leverage them to orchestrate a data pipeline that...

Technical requirements

In this chapter, we will showcase the features of AWS Step Functions and MWAA and demonstrate how you can integrate them to trigger EMR jobs. So, before getting started, make sure you have the following requirements to hand.

  • An AWS account with access to create Amazon S3, Amazon EMR, AWS Step Functions, and MWAA resources
  • An IAM user with access to create IAM roles, which will be used to trigger or execute jobs

Now let's get an overview of these orchestration tools and learn how we can integrate them.

Overview of AWS Step Functions

AWS Step Functions is a serverless workflow service that provides integration with several AWS services natively, which means you can create a workflow that is able to integrate or invoke actions of all the supported AWS services.

AWS Step Functions provides both a visual interface and a JSON based-definition approach to design workflows. With the visual interface, you can drag and drop different AWS service actions and modify their parameters to integrate a workflow. In addition to the visual interface, Step Functions also provides the option to code your workflow with a JSON-based definition called a state machine, where each step is referred to as a state. Step Functions also provides a few sample projects that are frequently in use, which you can inherit and modify for your use case.

You can integrate AWS Step Functions to automate IT business processes or build data or machine learning pipelines, or can integrate it to design real-time, event...

Integrating AWS Step Functions to orchestrate EMR jobs

AWS Step Functions supports createCluster, createCluster.sync, terminateCluster, terminateCluster.sync, addStep, cancelStep, setClusterTerminationProtection, modifyInstanceFleetByName, and modifyInstanceGroupByName EMR actions, which provides a great flexibility to build workflows on top of EMR.

Let's assume that you would like to build a workflow that gets triggered as soon as a file arrives in S3 and the objective of the workflow is to execute a Spark + Hudi job to process the input file. The workflow is supposed to create a transient EMR cluster, submit a Spark job that does ETL transforms, and then, upon completion of the job, terminate the cluster. You can easily build this workflow using AWS Step Functions' createCluster, addStep, and terminateCluster actions.

The following JSON definition is an example of a Step Functions' step that is of the Task type and invokes the EMR createCluster action with parameters...

Overview of Apache Airflow and MWAA

Apache Airflow is an open source workflow management framework that allows you to build workflows using the Python programming language. It has the following fundamental differences compared to AWS Step Functions:

  • Being an Apache open source project, Airflow not only supports AWS services, but also supports integration with other public cloud providers and open source projects such as Apache Sqoop, Apache Spark, and many more.
  • AWS Step Functions provides a low-code, JSON-based definition, whereas Airflow is more popular with programmers as you need to design a workflow by writing Python scripts.
  • AWS Step Functions provides a serverless offering, whereas Airflow needs infrastructure provisioned to act as a cluster on top of which you can run multiple jobs.

From a use case perspective, Airflow is a great fit when your workflow involves AWS and non-AWS services. For example, not all your applications are in AWS; a few are on...

Integrating Airflow to trigger EMR jobs

Airflow provides the following API functions to interact with the Amazon EMR cluster:

  • EmrCreateJobFlowOperator: This method enables you to create an EMR cluster.
  • EmrJobFlowSensor: This helps to check the status of the EMR cluster.
  • EmrAddStepsOperator: With this, you can add a step to the EMR cluster.
  • EmrStepSensor: This helps to check the status of an existing step in your EMR cluster.
  • EmrModifyClusterOperator: This is used to modify an existing cluster.
  • EmrTerminateJobFlowOperator: This enables you to terminate an existing cluster.

As explained, you can design a workflow in Airflows using the Python programming language, where you can define each action and then define the sequence of execution. The following is sample Python code that executes the EmrCreateJobFlowOperator method of Airflow that triggers an EMR create cluster action:

cluster_create_action = EmrCreateJobFlowOperator(
    ...

Summary

Over the course of this chapter, we have provided an overview of AWS Step Functions, Apache Airflow, and MWAA. In addition, we have shared example code blocks to explain how you can define Step Functions' state machine or write Python code to design a workflow for Airflow.

That concludes this chapter! Hopefully, this helped you get an idea of how you integrate these services to design workflows and will provide a starting point to design more complex data or machine learning pipelines. In the next chapter, we will explain how you can migrate your on-premises Hadoop workloads to Amazon EMR.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

  1. Assume you are designing a data pipeline that needs to process two input files as two parallel steps and then invoke a common ETL process to aggregate the output of these parallel steps. You have decided to leverage AWS Step Functions to orchestrate the pipeline. Which Task types will you be integrating and how?
  2. Assume you have a few Hadoop workloads running on-premises and a few Spark ETL jobs running in Amazon EMR. To simplify orchestration and monitoring, you are looking for an orchestration tool. While comparing different options, you found that AWS Step Functions and MWAA are the two best options. Which of them is better suited to your workload and why?

Further reading

The following are a few resources you can refer to for further reading:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra