You're reading from Serverless ETL and Analytics with AWS Glue

Product type Book

Published in Aug 2022

Publisher Packt

ISBN-13 9781800564985

Pages 434 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Table of Contents (20) Chapters

Preface

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

Chapter 1: Data Management – Introduction and Concepts

Chapter 2: Introduction to Important AWS Glue Features

Chapter 3: Data Ingestion

Section 2 – Data Preparation, Management, and Security

Chapter 4: Data Preparation

Chapter 5: Data Layouts

Chapter 6: Data Management

Chapter 7: Metadata Management

Chapter 8: Data Security

Chapter 9: Data Sharing

Chapter 10: Data Pipeline Management

Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

Chapter 11: Monitoring

Chapter 12: Tuning, Debugging, and Troubleshooting

Chapter 13: Data Analysis

Chapter 14: Machine Learning Integration

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

Other Books You May Enjoy

Chapter 5: Data Layouts

Data analysis is a common practice to make data-driven decisions to accelerate business and grow your company, organization, teams, and more. In a typical analysis process, queries that process and aggregate records in your datasets will be run for your data to understand their business trends. The queries are commonly run from Business Intelligence (BI) dashboard tools, web applications, automated tools, and more. Then, you will be able to get the results you need such as user subscriptions, marketing reports, sales trends, and more.

For their analytic queries, it’s important to consider analytic query performance because they need to timely utilize the analysis data and to quickly make a business decision for their business growth. To accelerate the query performance to quickly obtain the analysis data, you need to care about your dashboard tools, computation engine that processes the large amount of your data, data layout design of your data and...

Technical requirements

For this chapter, if you wish to follow some of the walk-throughs, you will require the following:

Access to GitHub, S3, and the AWS console (specifically AWS Glue, AWS Lake Formation, and Amazon S3)
A computer with the Chrome, Firefox, Safari, or Microsoft Edge browser installed and the AWS Command-Line Interface (AWS CLI):
- Regarding the AWS CLI, you can use not only the AWS CLI but also AWS CLI version 2. In this chapter, the AWS CLI (not version 2) is used. You can set up the AWS CLI (and version 2) from https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html.
An AWS account and an accompanying IAM user (or IAM role) with sufficient privileges to complete this chapter’s activities. We recommend using a minimally scoped IAM policy to avoid unnecessary usage and making operational mistakes. You can get the IAM policy for this chapter from the relevant GitHub repository, which is shown at https://github...

Why do we need to pay attention to data layout?

As we discussed earlier, it’s important to maximize query performance for your analytic workloads because they need to quickly understand for their situation for quick decisions based on the query results. To achieve the most optimal analytics workloads, one of the most important phases is data extraction process that a computation engine retrieves your data from the data location (Relational database, Distributed storage and so on) and reads records. It’s because many operations on our analytic workloads are reading data and processing them into what we want based on our running queries. These days, many computation engines that process data are effectively optimized their computation by their community, company and more. However, the data extraction process, especially retrieving and reading data from an external location highly depends on our data layout such as the file number, file format and so on, network speed, and...

Key techniques to optimally storing data

As mentioned earlier, the data extraction process is one of the most important phases to consider when optimizing your analytic workloads. In the usual process of data retrieval, users such as data analysts, business intelligence engineers, and data engineers run queries to a distributed analytics engine such as Apache Spark and Trino. Then, the distributed analytics engine gets information about the data, such as each file location and metadata. Usually, this kind of data is stored in distributed storage such as Amazon S3, HDFS, and more. After getting all the information about the data, the computing engine actually accesses and reads the data that you specify in the queries. Finally, it returns query results to the users.

To make the data retrieval process faster for further analysis, it’s important to consider how you store data. In particular, you can optimize workloads for analysis by storing data in the most suitable condition...

Optimizing the number of files and each file size

The number of files and each file size are also related to the performance of your analytic workloads. In particular, the number of files and file sizes are related to the performance of the data retrieval phase by using an analytic engine in your analytic workloads. To understand the relationship between the number of files and the file size and the performance of the data retrieval process by an analytic engine, we’ll look at how the engine generally retrieves data and returns the result as follows.

The basic process of data retrieval and returning a result is firstly getting a list of files, reading each file, processing the contents of the files based on your queries, and then returning the result. In particular, when processing data in Amazon S3, the analytic engine lists objects in your specified S3 bucket, gets objects, reads the contents, then processes and returns the result. When you use an AWS Glue ETL Spark job...

Optimizing your storage with Amazon S3

So far, we’ve seen how we should store data optimally and how we can manage data to optimize data retrieval and accelerate the analytic workloads. The techniques primarily work on the data itself, such as storing data with columnar formats, data compaction, and more. Not only does it handle data itself optimally, but it’s also important to think about optimization on the storage side.

Our data, such as logs of web access, device data, and so on, is continuously reported, and that data size grows over time. As the storage usage increases, the cost increases, too. To reduce the cost of storage usage, usually, we archive data that is not frequently or ever accessed. Generally, we can divide data into the following tiers based on the frequency of access to it:

Hot: This is data that you usually access.
Warm: This is data that you have relatively less access to or require less than hot data.
Cold: This is data...

Summary

In this chapter, we learned how to design the data layout to accelerate our analytic workloads. In particular, we learned about it by focusing on three parts, including how we store our data optimally, how we manage the number of files and each file size, and how we optimize our storage by working with Amazon S3.

In the first part, we learned techniques to store our data optimally. These techniques include choosing file formats and compression types, understanding file splitability, and partitioning/bucketing. Then, we learned about data compaction to manage the number of files and each file size and to enhance analytic query performance. In the last part, we learned how to optimize our storage with Amazon S3 and Glue DynamicFrames. You can effectively use your storage by archiving, expiring, and deleting your data with Amazon S3 Lifecycle configurations and the Glue DynamicFrame methods.

Managing the data in your data lake with techniques introduced in this chapter...