Reader small image

You're reading from  Data Engineering with AWS - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781804614426
Edition2nd Edition
Right arrow
Author (1)
Gareth Eagar
Gareth Eagar
author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Right arrow

Ad Hoc Queries with Amazon Athena

In Chapter 8, Identifying and Enabling Varied Data Consumers, we explored a variety of data consumers. Now in this chapter, we will start examining the AWS services that some of these different data consumers may want to use, starting with those that need to use SQL to run ad hoc queries on data in the data lake.

SQL syntax is widely used for querying data in a variety of databases, and there is a large number of people that know SQL, making it a skill that is fairly easy to find. As a result, there is significant demand from various data consumers for the ability to query data that is in the data lake using SQL, without having to first move the data into a dedicated traditional database.

Amazon Athena is a serverless, fully managed service that lets you use SQL and Spark to directly query data in the data lake, as well as query various other database sources. It requires no setup, and there are options to either pay for the service based...

Technical requirements

In the hands-on sections of this chapter, you will perform administrative tasks related to Amazon Athena (such as creating a new Athena workgroup) and run Athena queries. As mentioned at the start of this book, we strongly recommend that, for the exercises in this book, you use a sandbox account where you have full administrative permissions.

For this chapter, at a minimum, you will need permissions to manage Athena workgroups, permissions to run Athena queries, access to the AWS Glue Data Catalog for databases and tables to be queried, and read access to the relevant underlying S3 storage.

A user that has the AmazonAthenaFullAccess and AmazonS3ReadOnlyAccess policies attached should have sufficient permissions for the exercises in this chapter. However, note that a user with these roles will have access to all S3 objects in the account, all Glue resources, all Athena permissions, as well as various other permissions, so this should only be granted to...

An introduction to Amazon Athena

Amazon Athena was originally launched as a service that simply provided a way to run SQL queries against data in an S3-based data lake. However, over the years, AWS had added a lot of additional functionality to Athena, enabling features like running queries against other databases (not just S3-based data), and supporting the use of Spark based Notebooks for querying data (in addition to SQL queries).

Structured Query Language (SQL) was invented at IBM in the 1970s but has remained an extremely popular language for querying data throughout the decades. Every day, millions of people across the world use SQL directly to explore data in a variety of databases, and many more use applications (whether business applications, mobile applications, or others) that, under the covers, use SQL to query a database.

Facebook, the social media network, has very large datasets and complex data analysis requirements and found that existing tools in the Hadoop...

Tips and tricks to optimize Amazon Athena queries

When raw data is ingested into the data lake, we can immediately create a table for that data in the AWS Glue Data Catalog (either using a Glue crawler or by running DDL statements with Athena to define the table). Once the table has been created, we can start exploring the table by using Amazon Athena to run SQL queries against the data.

However, raw data is often ingested in plaintext formats such as CSV or JSON. And while we can query the data in this format for ad hoc data exploration, if we need to run complex queries against large datasets, these raw formats are not efficient to query. There are also ways that we can optimize the SQL queries that we write to make the best use of the underlying Athena query engine, which we will review in this chapter.

By default, Amazon Athena’s cost is based on the amount of compressed data that is scanned to resolve your SQL query, so anything that can be done to reduce the...

Exploring advanced Athena functionality

As we’ve discussed several times in this book, Athena lets you query data that has been loaded into the data lake using standard SQL semantics. But since the launch of Athena, AWS has added additional functionality to enhance Athena to make it an even more powerful query engine.

One of those major enhancements, which became available in 2021 with Athena query engine v2, was the ability to run federated queries, which we will look at next.

Querying external data sources using Athena Federated Query

Query federation, also sometimes referred to as data virtualization, is the process of querying multiple external data sources, in different database engines or other systems, through a single SQL query statement.

Data lakes are designed to collect data from multiple systems in an organization and bring it into centralized storage, where the data can be combined in ways that unlock value for the business. However, it is not...

Managing groups of users with Amazon Athena workgroups

Athena workgroups are a powerful mechanism for separating different groups and types of user queries (such as SQL based queries, and Spark Notebook queries), for applying cost controls, assigning provisioned capacity, and for implementing strong governance on Athena usage.

All queries within Athena are run in a specific workgroup, and you can apply a number of configuration settings to each workgroup to control costs and governance for the users in that workgroup. The following list shows the configuration items that can be controlled at the workgroup level:

  • The analytic engine that users in this workgroup use (either Athena SQL or Apache Spark)
  • Whether the Athena engine version used for this workgroup is selected and upgraded automatically, or whether you manually specify the engine version to use
  • The S3 location where the results of queries are written to, and whether these files are encrypted or...

Hands-on – creating an Amazon Athena workgroup and configuring Athena settings

In this section, we’re going to create and configure a new Athena workgroup, and set a per query data limit:

  1. Log into AWS Management Console and access the Athena service using this link: https://console.aws.amazon.com/athena.
  2. Expand the left-hand menu and click on Workgroups to access the workgroup management page.

Figure 11.2: Athena Console showing Workgroups

  1. On the workgroup management page, click on Create workgroup and enter the following values for our new workgroup. For the items not listed here, leave the defaults as-is:
    • Workgroup name: Provide a descriptive name for the workgroup, such as datalake-user-sandbox.
    • Description: Optionally, provide a description for this workgroup, such as Sandbox workgroup for new datalake-users.
    • Query result location (in the Query result configuration section): In the hands-on exercises...

Hands-on – switching workgroups and running queries

By default, all users operate in the primary workgroup, but users can switch between any workgroup that they have access to. You can control workgroup access via IAM policies, as detailed in the AWS documentation titled IAM Policies for Accessing Workgroups : https://docs.aws.amazon.com/athena/latest/ug/workgroups-iam-policy.html

In the previous section, we created and configured a new workgroup, so we can now run some SQL queries and explore Athena’s functionality further:

  1. In the left-hand menu, click on Query editor. Once in the Query editor, use the Workgroup drop-down list selector to change to your newly created sandbox workgroup.

Figure 11.4: Switching Workgroups in the Athena Console

  1. A pop-up dialog may appear for you to acknowledge that all the queries that are run in this workgroup will use the settings we configured previously. This is because we chose to Overwrite client...

Summary

In this chapter, we learned more about the Amazon Athena service, an AWS-managed service that builds on the Apache Presto and Trino solutions to enable you to run SQL or Spark based queries against your data. We also looked at how to optimize our data and SQL queries to increase query performance and reduce costs.

Then, we explored advanced Athena functionality, including how Athena can be used as a SQL query engine not only for data in an Amazon S3 data lake, but also for external data sources such as other database systems, data warehouses, and even CloudWatch logs, using Athena Query Federation.

We wrapped up the theory part of this chapter by looking at Athena workgroups, which let us manage governance and costs, and they can be used to enforce specific settings for different teams or projects, and can also be used to limit the amount of data that is scanned by queries. In the last section of this chapter, we got hands-on with Athena, first creating a new workgroup...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with AWS - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar