Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Simplify Big Data Analytics with Amazon EMR

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product type Book
Published in Mar 2022
Publisher Packt
ISBN-13 9781801071079
Pages 430 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Sakti Mishra Sakti Mishra
Profile icon Sakti Mishra

Table of Contents (19) Chapters

Preface Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
Chapter 1: An Overview of Amazon EMR Chapter 2: Exploring the Architecture and Deployment Options Chapter 3: Common Use Cases and Architecture Patterns Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR Section 2: Configuration, Scaling, Data Security, and Governance
Chapter 5: Setting Up and Configuring EMR Clusters Chapter 6: Monitoring, Scaling, and High Availability Chapter 7: Understanding Security in Amazon EMR Chapter 8: Understanding Data Governance in Amazon EMR Section 3: Implementing Common Use Cases and Best Practices
Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR Chapter 14: Best Practices and Cost-Optimization Techniques Other Books You May Enjoy

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

From previous chapters, you got an overview of Amazon EMR (Elastic MapReduce), its architecture, and reference architecture for a few common use cases. This chapter will help you learn more about a few of the popular big data applications and distributed processing components of the Hadoop ecosystem that are available in EMR, such as Hive, Presto, Spark, HBase, Hue, Ganglia, and so on. Apart from that, it will also provide an overview of a few machine learning frameworks available in EMR, such as TensorFlow and MXNet.

At the end of the chapter, you will learn about notebook options available in EMR for interactive development that include EMR Notebook, JupyterHub, EMR Studio, and Zeppelin notebooks.

The following topics will be covered in this chapter:

  • Understanding popular big data applications in EMR
  • Understanding machine learning frameworks available in EMR
  • Understanding notebook options...

Technical requirements

In this chapter, we will cover different big data applications available in EMR and how you can access or configure them. Please make sure you have access to the following resources before continuing:

  • An AWS account
  • An IAM user, which has permission to create EMR clusters, EC2 instances, and dependent IAM roles

Now let's dive deep into each of the big data applications and machine learning frameworks available in EMR.

Understanding popular big data applications in EMR

There are several big data applications available in the Hadoop ecosystem and open source community, and EMR includes a few very popular ones that are very commonly used in big data use cases. The availability of different big data applications or components in your cluster depends on the EMR release you choose while launching the cluster. Each EMR release includes a different version of these applications and makes sure they are compatible with each other for smooth execution of the cluster and jobs.

EMR does include the most common or popular Hadoop interfaces in its recent releases and also does continuous updates to include new Hadoop interfaces as they gain popularity in the open source community. In addition to adding new big data applications or components, EMR also removes support for a few as they lack attention from the open source community or customers. For example, till EMR 3.11.x, you had the option to select Impala...

Machine learning frameworks available in EMR

There are several machine learning libraries or frameworks that you can configure in your EMR cluster. TensorFlow and MXNet are a couple of popular ones, which are available as applications that you can choose while creating the cluster.

Even though TensorFlow and MXNet are available as pre-configured machine learning frameworks in EMR, you do have the option to configure other alternatives such as PyTorch and Keras as custom libraries.

Now let's get an overview of the TensorFlow and MXNet applications in EMR.

TensorFlow

TensorFlow is an open source platform using which you can develop machine learning models. It provides tools, libraries, and a community of resources that will help researchers and data scientists to easily develop and deploy machine learning models.

TensorFlow has been available in EMR since the 5.17.0 release and the recent 6.3.0 release includes TensorFlow v2.4.1.

If you plan to configure TensorFlow...

Notebook options available in EMR

In today's world, usage of web-based notebooks for interactive development is very common and EMR provides a few options for integrating Jupyter and Zeppelin notebooks.

Jupyter Notebook is a very popular open source web application that allows developers and analysts to do interactive development by writing live code, executing it line by line for debugging, building visualizations on top of data, and also providing narratives on code. You can also share notebooks with others, who can import code into their notebook.

Within an EMR cluster, you have the option to use EMR Notebooks and JupyterHub, and outside of your EMR cluster, you have EMR Studio, which you can attach to your EMR cluster.

Now let's dive deep into each of these options.

EMR Notebooks

EMR Notebooks is available in the EMR console. Notebooks are serverless and can be attached to any EMR cluster running Hadoop, Spark, and Livy. Using EMR Notebooks, you can open...

Summary

Over the course of this chapter, we have dived deep into a few popular big data applications available in EMR, how they are set up in EMR, and what additional configuration options or features you get when you integrate with Amazon S3. Then we provided an overview of the TensorFlow and MXNet applications, which are the machine learning and deep learning libraries available in EMR. These applications are the primary building blocks when you implement a data analytics pipeline using EMR.

Finally, we covered the different notebook options you have and how you can configure and use them for your interactive development.

That concludes this chapter! Hopefully, you have got a good overview of these distributed applications and are ready to dive deep into EMR cluster creation and configuration in the next chapter.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

  1. You have terabyte-scale data available in Amazon S3 and your data analysts are looking for a query engine using which they can interactively query the data using SQL. You already have a persistent EMR cluster, which is being used for multiple ETL workloads, and to save costs you are looking for an application within EMR that can provide the interactive query engine needed. Which big data application in EMR best fits your need?
  2. Your team is using EMR with Spark for multiple ETL workloads and it uses Amazon S3 as the persistent data store. For one of the use cases, you receive data that does not have a fixed schema and you are looking for a NoSQL solution that can provide data update capabilities and also can provide fast lookup. Which EMR big data application can support this technical requirement?
  3. Your data scientists are looking for a web-based notebook that...

Further reading

Here are a few resources you can refer to for further reading:

lock icon The rest of the chapter is locked
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022 Publisher: Packt ISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}