You're reading from Simplify Big Data Analytics with Amazon EMR

Product type Book

Published in Mar 2022

Publisher Packt

ISBN-13 9781801071079

Pages 430 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Sakti Mishra

Table of Contents (19) Chapters

Preface

1. Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

2. Chapter 1: An Overview of Amazon EMR

3. Chapter 2: Exploring the Architecture and Deployment Options

4. Chapter 3: Common Use Cases and Architecture Patterns

5. Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

6. Section 2: Configuration, Scaling, Data Security, and Governance

7. Chapter 5: Setting Up and Configuring EMR Clusters

8. Chapter 6: Monitoring, Scaling, and High Availability

9. Chapter 7: Understanding Security in Amazon EMR

10. Chapter 8: Understanding Data Governance in Amazon EMR

11. Section 3: Implementing Common Use Cases and Best Practices

12. Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

13. Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

14. Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

15. Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

16. Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

17. Chapter 14: Best Practices and Cost-Optimization Techniques

18. Other Books You May Enjoy

Optimization techniques for data processing and storage

We have recommended using Amazon S3 as the EMR cluster's persistent storage as it provides better reliability, support for transient clusters, and it is cost-effective. But there are several best practices we can follow while storing the data in Amazon S3 or an HDFS cluster.

Let's understand some of the general best practices that you can follow to get better performance and save costs from a storage and processing perspective.

Best practices for cluster persistent storage

As part of your cluster storage, there are some general best practices that apply to both Amazon S3 and HDFS cluster storage. The following are a few of the most important ones.

Choosing the right file format

You might be receiving files in CSV, JSON, or as TXT files, but after processing through the ETL process, when you write to a data lake based on S3 or HDFS, you should choose the right file format to get the best performance...