Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Serverless ETL and Analytics with AWS Glue

You're reading from  Serverless ETL and Analytics with AWS Glue

Product type Book
Published in Aug 2022
Publisher Packt
ISBN-13 9781800564985
Pages 434 pages
Edition 1st Edition
Languages
Authors (6):
Vishal Pathak Vishal Pathak
Profile icon Vishal Pathak
Subramanya Vajiraya Subramanya Vajiraya
Profile icon Subramanya Vajiraya
Noritaka Sekiyama Noritaka Sekiyama
Profile icon Noritaka Sekiyama
Tomohiro Tanaka Tomohiro Tanaka
Profile icon Tomohiro Tanaka
Albert Quiroga Albert Quiroga
Profile icon Albert Quiroga
Ishan Gaur Ishan Gaur
Profile icon Ishan Gaur
View More author details

Table of Contents (20) Chapters

Preface Section 1 – Introduction, Concepts, and the Basics of AWS Glue
Chapter 1: Data Management – Introduction and Concepts Chapter 2: Introduction to Important AWS Glue Features Chapter 3: Data Ingestion Section 2 – Data Preparation, Management, and Security
Chapter 4: Data Preparation Chapter 5: Data Layouts Chapter 6: Data Management Chapter 7: Metadata Management Chapter 8: Data Security Chapter 9: Data Sharing Chapter 10: Data Pipeline Management Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases
Chapter 11: Monitoring Chapter 12: Tuning, Debugging, and Troubleshooting Chapter 13: Data Analysis Chapter 14: Machine Learning Integration Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases Other Books You May Enjoy

Chapter 5: Data Layouts

Data analysis is a common practice to make data-driven decisions to accelerate business and grow your company, organization, teams, and more. In a typical analysis process, queries that process and aggregate records in your datasets will be run for your data to understand their business trends. The queries are commonly run from Business Intelligence (BI) dashboard tools, web applications, automated tools, and more. Then, you will be able to get the results you need such as user subscriptions, marketing reports, sales trends, and more.

For their analytic queries, it’s important to consider analytic query performance because they need to timely utilize the analysis data and to quickly make a business decision for their business growth. To accelerate the query performance to quickly obtain the analysis data, you need to care about your dashboard tools, computation engine that processes the large amount of your data, data layout design of your data and...

Technical requirements

For this chapter, if you wish to follow some of the walk-throughs, you will require the following:

  • Access to GitHub, S3, and the AWS console (specifically AWS Glue, AWS Lake Formation, and Amazon S3)
  • A computer with the Chrome, Firefox, Safari, or Microsoft Edge browser installed and the AWS Command-Line Interface (AWS CLI):
  • An AWS account and an accompanying IAM user (or IAM role) with sufficient privileges to complete this chapter’s activities. We recommend using a minimally scoped IAM policy to avoid unnecessary usage and making operational mistakes. You can get the IAM policy for this chapter from the relevant GitHub repository, which is shown at https://github...

Why do we need to pay attention to data layout?

As we discussed earlier, it’s important to maximize query performance for your analytic workloads because they need to quickly understand for their situation for quick decisions based on the query results. To achieve the most optimal analytics workloads, one of the most important phases is data extraction process that a computation engine retrieves your data from the data location (Relational database, Distributed storage and so on) and reads records. It’s because many operations on our analytic workloads are reading data and processing them into what we want based on our running queries. These days, many computation engines that process data are effectively optimized their computation by their community, company and more. However, the data extraction process, especially retrieving and reading data from an external location highly depends on our data layout such as the file number, file format and so on, network speed, and...

Key techniques to optimally storing data

As mentioned earlier, the data extraction process is one of the most important phases to consider when optimizing your analytic workloads. In the usual process of data retrieval, users such as data analysts, business intelligence engineers, and data engineers run queries to a distributed analytics engine such as Apache Spark and Trino. Then, the distributed analytics engine gets information about the data, such as each file location and metadata. Usually, this kind of data is stored in distributed storage such as Amazon S3, HDFS, and more. After getting all the information about the data, the computing engine actually accesses and reads the data that you specify in the queries. Finally, it returns query results to the users.

To make the data retrieval process faster for further analysis, it’s important to consider how you store data. In particular, you can optimize workloads for analysis by storing data in the most suitable condition...

Optimizing the number of files and each file size

The number of files and each file size are also related to the performance of your analytic workloads. In particular, the number of files and file sizes are related to the performance of the data retrieval phase by using an analytic engine in your analytic workloads. To understand the relationship between the number of files and the file size and the performance of the data retrieval process by an analytic engine, we’ll look at how the engine generally retrieves data and returns the result as follows.

The basic process of data retrieval and returning a result is firstly getting a list of files, reading each file, processing the contents of the files based on your queries, and then returning the result. In particular, when processing data in Amazon S3, the analytic engine lists objects in your specified S3 bucket, gets objects, reads the contents, then processes and returns the result. When you use an AWS Glue ETL Spark job...

Optimizing your storage with Amazon S3

So far, we’ve seen how we should store data optimally and how we can manage data to optimize data retrieval and accelerate the analytic workloads. The techniques primarily work on the data itself, such as storing data with columnar formats, data compaction, and more. Not only does it handle data itself optimally, but it’s also important to think about optimization on the storage side. 

Our data, such as logs of web access, device data, and so on, is continuously reported, and that data size grows over time. As the storage usage increases, the cost increases, too. To reduce the cost of storage usage, usually, we archive data that is not frequently or ever accessed. Generally, we can divide data into the following tiers based on the frequency of access to it:

  • Hot: This is data that you usually access.
  • Warm: This is data that you have relatively less access to or require less than hot data.
  • Cold: This is data...

Summary

In this chapter, we learned how to design the data layout to accelerate our analytic workloads. In particular, we learned about it by focusing on three parts, including how we store our data optimally, how we manage the number of files and each file size, and how we optimize our storage by working with Amazon S3.

In the first part, we learned techniques to store our data optimally. These techniques include choosing file formats and compression types, understanding file splitability, and partitioning/bucketing. Then, we learned about data compaction to manage the number of files and each file size and to enhance analytic query performance. In the last part, we learned how to optimize our storage with Amazon S3 and Glue DynamicFrames. You can effectively use your storage by archiving, expiring, and deleting your data with Amazon S3 Lifecycle configurations and the Glue DynamicFrame methods.

Managing the data in your data lake with techniques introduced in this chapter...

Further reading

To learn more about what we’ve touched on in this chapter, please refer to the following resources:

lock icon The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022 Publisher: Packt ISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}