You're reading from Data Engineering with AWS - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781804614426

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Gareth Eagar

Ad Hoc Queries with Amazon Athena

In Chapter 8, Identifying and Enabling Varied Data Consumers, we explored a variety of data consumers. Now in this chapter, we will start examining the AWS services that some of these different data consumers may want to use, starting with those that need to use SQL to run ad hoc queries on data in the data lake.

SQL syntax is widely used for querying data in a variety of databases, and there is a large number of people that know SQL, making it a skill that is fairly easy to find. As a result, there is significant demand from various data consumers for the ability to query data that is in the data lake using SQL, without having to first move the data into a dedicated traditional database.

Amazon Athena is a serverless, fully managed service that lets you use SQL and Spark to directly query data in the data lake, as well as query various other database sources. It requires no setup, and there are options to either pay for the service based...

Technical requirements

In the hands-on sections of this chapter, you will perform administrative tasks related to Amazon Athena (such as creating a new Athena workgroup) and run Athena queries. As mentioned at the start of this book, we strongly recommend that, for the exercises in this book, you use a sandbox account where you have full administrative permissions.

For this chapter, at a minimum, you will need permissions to manage Athena workgroups, permissions to run Athena queries, access to the AWS Glue Data Catalog for databases and tables to be queried, and read access to the relevant underlying S3 storage.

A user that has the AmazonAthenaFullAccess and AmazonS3ReadOnlyAccess policies attached should have sufficient permissions for the exercises in this chapter. However, note that a user with these roles will have access to all S3 objects in the account, all Glue resources, all Athena permissions, as well as various other permissions, so this should only be granted to...

An introduction to Amazon Athena

Amazon Athena was originally launched as a service that simply provided a way to run SQL queries against data in an S3-based data lake. However, over the years, AWS had added a lot of additional functionality to Athena, enabling features like running queries against other databases (not just S3-based data), and supporting the use of Spark based Notebooks for querying data (in addition to SQL queries).

Structured Query Language (SQL) was invented at IBM in the 1970s but has remained an extremely popular language for querying data throughout the decades. Every day, millions of people across the world use SQL directly to explore data in a variety of databases, and many more use applications (whether business applications, mobile applications, or others) that, under the covers, use SQL to query a database.

Facebook, the social media network, has very large datasets and complex data analysis requirements and found that existing tools in the Hadoop...

Tips and tricks to optimize Amazon Athena queries

When raw data is ingested into the data lake, we can immediately create a table for that data in the AWS Glue Data Catalog (either using a Glue crawler or by running DDL statements with Athena to define the table). Once the table has been created, we can start exploring the table by using Amazon Athena to run SQL queries against the data.

However, raw data is often ingested in plaintext formats such as CSV or JSON. And while we can query the data in this format for ad hoc data exploration, if we need to run complex queries against large datasets, these raw formats are not efficient to query. There are also ways that we can optimize the SQL queries that we write to make the best use of the underlying Athena query engine, which we will review in this chapter.

By default, Amazon Athena’s cost is based on the amount of compressed data that is scanned to resolve your SQL query, so anything that can be done to reduce the...

Exploring advanced Athena functionality

As we’ve discussed several times in this book, Athena lets you query data that has been loaded into the data lake using standard SQL semantics. But since the launch of Athena, AWS has added additional functionality to enhance Athena to make it an even more powerful query engine.

One of those major enhancements, which became available in 2021 with Athena query engine v2, was the ability to run federated queries, which we will look at next.

Querying external data sources using Athena Federated Query

Query federation, also sometimes referred to as data virtualization, is the process of querying multiple external data sources, in different database engines or other systems, through a single SQL query statement.

Data lakes are designed to collect data from multiple systems in an organization and bring it into centralized storage, where the data can be combined in ways that unlock value for the business. However, it is not...

Managing groups of users with Amazon Athena workgroups

Athena workgroups are a powerful mechanism for separating different groups and types of user queries (such as SQL based queries, and Spark Notebook queries), for applying cost controls, assigning provisioned capacity, and for implementing strong governance on Athena usage.

All queries within Athena are run in a specific workgroup, and you can apply a number of configuration settings to each workgroup to control costs and governance for the users in that workgroup. The following list shows the configuration items that can be controlled at the workgroup level:

The analytic engine that users in this workgroup use (either Athena SQL or Apache Spark)
Whether the Athena engine version used for this workgroup is selected and upgraded automatically, or whether you manually specify the engine version to use
The S3 location where the results of queries are written to, and whether these files are encrypted or...

Hands-on – creating an Amazon Athena workgroup and configuring Athena settings

In this section, we’re going to create and configure a new Athena workgroup, and set a per query data limit:

Log into AWS Management Console and access the Athena service using this link: https://console.aws.amazon.com/athena.
Expand the left-hand menu and click on Workgroups to access the workgroup management page.

Figure 11.2: Athena Console showing Workgroups

On the workgroup management page, click on Create workgroup and enter the following values for our new workgroup. For the items not listed here, leave the defaults as-is:
- Workgroup name: Provide a descriptive name for the workgroup, such as datalake-user-sandbox.
- Description: Optionally, provide a description for this workgroup, such as Sandbox workgroup for new datalake-users.
- Query result location (in the Query result configuration section): In the hands-on exercises...

Hands-on – switching workgroups and running queries

By default, all users operate in the primary workgroup, but users can switch between any workgroup that they have access to. You can control workgroup access via IAM policies, as detailed in the AWS documentation titled IAM Policies for Accessing Workgroups : https://docs.aws.amazon.com/athena/latest/ug/workgroups-iam-policy.html

In the previous section, we created and configured a new workgroup, so we can now run some SQL queries and explore Athena’s functionality further:

In the left-hand menu, click on Query editor. Once in the Query editor, use the Workgroup drop-down list selector to change to your newly created sandbox workgroup.

Figure 11.4: Switching Workgroups in the Athena Console

A pop-up dialog may appear for you to acknowledge that all the queries that are run in this workgroup will use the settings we configured previously. This is because we chose to Overwrite client...

Summary

In this chapter, we learned more about the Amazon Athena service, an AWS-managed service that builds on the Apache Presto and Trino solutions to enable you to run SQL or Spark based queries against your data. We also looked at how to optimize our data and SQL queries to increase query performance and reduce costs.

Then, we explored advanced Athena functionality, including how Athena can be used as a SQL query engine not only for data in an Amazon S3 data lake, but also for external data sources such as other database systems, data warehouses, and even CloudWatch logs, using Athena Query Federation.

We wrapped up the theory part of this chapter by looking at Athena workgroups, which let us manage governance and costs, and they can be used to enforce specific settings for different teams or projects, and can also be used to limit the amount of data that is scanned by queries. In the last section of this chapter, we got hands-on with Athena, first creating a new workgroup...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with AWS - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages