Serverless Analytics with Amazon Athena

Chapter 1: Your First Query

This chapter is all about introducing you to the serverless analytics experience offered by Amazon Athena. Data is one of the most valuable assets you and your company generate. In recent years, we have seen a revolution in data retention, where companies are capturing all manner of data that was once ignored. Everything from logs to clickstream data, to support tickets are now routinely kept for years. Interestingly, the data itself is not what is valuable. Instead, the insights that are buried in that mountain of data are what we are after. Certainly, increased awareness and retention have made the information we need to power our businesses, applications, and decisions more available but the explosion in data sizes has made the insights we seek less accessible. What could once fit nicely in a traditional RDBMS, such as Oracle, now requires a distributed filesystem such as HDFS and an accompanying Massively Parallel Processing (MPP) engine such as Spark to run even the most basic of queries in a timely fashion.

Enter Amazon Athena. Unlike traditional analytics engines, Amazon Athena is a fully managed offering. You will never have to set up any servers or tune cryptic settings to get your queries running. This allows you to focus on what is most important: using data to generating insights that drive your business. You can just focus on getting the most out of your data. This ease of use is precisely why this first chapter is all about getting hands-on and running your first query. Whether you are a seasoned analytics veteran or a newcomer to the space, this chapter will give you the knowledge you need to be running your first Athena query in less than 30 minutes. For now, we will simplify things to demonstrate why so many people choose Amazon Athena for their workloads. This will help establish your mental model for the deeper discussions, features, and examples of later sections.

In this chapter, we will cover the following topics:

What is Amazon Athena?
Obtaining and preparing sample data
Running your first query

Technical requirements

Wherever possible, we will provide samples or instructions to guide you through the setup. However, to complete the activities in this chapter, you will need to ensure you have the following prerequisites available. Our command-line examples will be executed using Ubuntu, but most flavors of Linux should also work without modification.

You will need internet access to GitHub, S3, and the AWS Console.

You will also require a computer with the following installed:

Chrome, Safari, or Microsoft Edge
The AWS CLI

In addition, this chapter requires you to have an AWS account and accompanying IAM user (or role) with sufficient privileges to complete the activities in this chapter. Throughout this book, we will provide detailed IAM policies that attempt to honor the age-old best practice of "least privilege." For simplicity, you can always run through these exercises with a user that has full access, but we recommend that you use scoped-down IAM policies to avoid making costly mistakes and to learn more about how to best use IAM to secure your applications and data. You can find the suggested IAM policy for this chapter in this book's accompanying GitHub repository, listed as chapter_1/iam_policy_chapter_1.json:

https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/tree/main/chapter_1

This policy includes the following:

Read and Write access to one S3 bucket using the following actions:
- s3:PutObject: Used to upload data and also for Athena to write query results.
- s3:GetObject: Used by Athena to read data.
- s3:ListBucketMultipartUploads: Used by Athena to write query results.
- s3:AbortMultipartUpload: Used by Athena to write query results.
- s3:ListBucketVersions
- s3:CreateBucket: Used by you if you don't already have a bucket you can use.
- s3:ListBucket: Used by Athena to read data.
- s3:DeleteObject: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
- s3:ListMultipartUploadParts: Used by Athena to write a result.
- s3:ListAllMyBuckets: Used by Athena to ensure you own the results bucket.
- s3:ListJobs: Used by Athena to write results.
Read and Write access to one Glue Data Catalog database, using the following actions:
- glue:DeleteDatabase: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
- glue:GetPartitions: Used by Athena to query your data in S3.
- glue:UpdateTable: Used when we import our sample data.
- glue:DeleteTable: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
- glue:CreatePartition: Used when we import our sample data.
- glue:UpdatePartition: Used when we import our sample data.
- glue:UpdateDatabase: Used when we import our sample data.
- glue:CreateTable: Used when we import our sample data.
- glue:GetTables: Used by Athena to query your data in S3.
- glue:BatchGetPartition: Used by Athena to query your data in S3.
- glue:GetDatabases: Used by Athena to query your data in S3.
- glue:GetTable: Used by Athena to query your data in S3.
- glue:GetDatabase: Used by Athena to query your data in S3.
- glue:GetPartition: Used by Athena to query your data in S3.
- glue:CreateDatabase: Used to create a database if you don't already have one you can use.
- glue:DeletePartition: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
Access to run Athena queries.
Important Note
We recommend against using Firefox with the Amazon Athena console as we have found, and reported, a bug associated with switching between certain elements in the UX.

What is Amazon Athena?

Amazon Athena is a query service that allows you to run standard SQL over data stored in a variety of sources and formats. As you will see later in this chapter, Athena is serverless, so there is no infrastructure to set up or manage. You simply pay $5 per TB scanned for the queries you run without needing to worry about idle resources or scaling.

Note

AWS has a habit of reducing prices over time. For the latest Athena pricing, please consult the Amazon Athena product page at https://aws.amazon.com/athena/pricing/?nc=sn&loc=3.

Athena is based on Presto (https://prestodb.io/), a distributed SQL engine that's open sourced by Facebook. It supports ANSI SQL, as well as Presto SQL features ranging from geospatial functions to rough query extensions, which allow you to run approximating queries, with statistically bound errors, over large datasets in only a fraction of the time. Athena's commitment to open source also provides an interesting avenue to avoid lock-in concerns because you always have the option to download and manage your own Presto deployment from GitHub. Of course, you will lose many of Athena's enhancements and must manage the infrastructure yourself, but you can take comfort in knowing you are not beholden to potentially punitive licensing agreements as you might be with other vendors.

While Athena's roots are open source, the team at AWS have added several enterprise features to the service, including the following:

Federated Identity via SAML and Active Directory support
Table, column, and even row-level access control via Lake Formation
Workload classification and grouping for cost control via WorkGroups
Automated regression testing to take the pain out of upgrades

Later chapters will cover these topics in greater detail. If you feel compelled to do so, you can use the table of contents to skip directly to those chapters and learn more.

Let's look at some use cases for Athena.

Use cases

Amazon Athena supports a wide range of use cases and we have personally used it for several different patterns. Thanks to Athena's ease of use, it is extremely common to leverage Athena for ad hoc analysis and data exploration.

Later in this book, you will use Athena from within a Jupyter notebook for machine learning. Similarly, many analysts enjoy using Athena directly from BI Tools such as Looker and Tableau, courtesy of Athena's JDBC driver. Athena's robust SQL dialect and asynchronous API model also enables application developers to build analytics right into their applications, enabling features that would not previously have been practical due to scale or operational burden. In many cases, you can replace RDBMS-driven features with Athena at a fraction of the cost and lower operational burden.

Another emerging use case for Athena is in the ETL space. While Athena advertises itself as being an engine that avoids the need for ETL by being able to query the data in place, as it is, we have seen the benefits of replacing existing or building new ETL pipelines using Athena where cost and capacity management are key factors. Athena will not necessarily achieve the same scale or performance as Spark, for example, but if your ETL jobs do not require multi-TB joins, you might find Athena to be an interesting option.

Separation of storage and compute

If you are new to serverless analytics, you may be wondering where your data is stored. Amazon Athena builds on the concept of Separation of Storage and Compute to decouple the computational resources (for example, CPU, memory, network) that do the heavy lifting of executing your SQL queries from the responsibility of keeping your data safe and available. In short, this means Athena itself does not store your data. Instead, you are free to choose from several data stores with customers increasingly pairing with DynamoDB to rapidly mutate data with S3 for their bulk data. With Athena, you can easily write a query that spans both data stores.

Amazon's Simple Storage Service, or S3 for short, is easily the most recommended data store to use with Athena. When Athena launched in 2016, S3 was the first data store it supported. Unsurprisingly, Athena has been optimized to take advantage of S3's unique ability to deliver exabyte scale and throughput while still providing eleven nines (99.999999999%) of durability. In addition to effortless scaling from a few gigabytes of data up to many petabytes, S3 offers some of the lowest prices for performance that you can find. Depending on your replication requirements, storing 1 GB of data for a month will cost you between $0.01 and $0.023. Even the most cost-efficient enterprise hard drives cost around $0.21 per GB before you add on redundancy, the power to run them, or a server and data center to house them. As with most AWS services, you should consult S3's pricing page (https://aws.amazon.com/s3/pricing/) for the latest details since AWS has cut their prices more than 70 times in the last decade.

Metastore

In addition to accessing the raw 1s and 0s that represent your data, Athena also requires metadata that helps its SQL engine understand how to interpret the data you have stored in S3 or elsewhere. This supplemental information helps Athena map collections of files, or objects in the case of S3, to SQL constructs such as tables, columns, and rows. The repository for this data, about your data, is often called a metastore. Athena works with Hive-compliant metastores, including AWS's Glue Data Catalog service. In later chapters, we will look at AWS Glue Data Catalog in more detail, as well as how you can attach Athena to your own metastore, even a homegrown one. For now, all you need to know is that Athena requires the use of a metastore to discover key attributes of the data you wish to query. The most common pieces of information that are kept in the Metastore include the following:

A list of tables that exist
The storage location of each table (for example, the S3 path or DynamoDB table name)
The format of the files or objects that comprise the table (for example, CSV, Parquet, JSON)
The column names and data types in each table (for example, inventory column is an integer, while revenue is a decimal (10,2))

Now that we have a good overview of Amazon Athena, let's look at how to use it in practice.

Obtaining and preparing sample data

Before we can start running our first query, we will need some data that we would like to analyze. Throughout this book, we will try to make use of open datasets that you can freely access but that also contain interesting information that may mirror your real-world datasets. In this chapter, we will be making use of the NYC Taxi & Limousine Commission's (TLC's) Trip Record Data for New York City's iconic yellow taxis. Yellow taxis have been recording and providing ride data to TLC since 2009. Yellow taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (also known as a street hail). In recent years, yellow taxis have also started to use their own ride-hailing apps such as Curb and Arro to keep pace with emerging ride-hailing technologies from Uber and Lyft. However, yellow taxis remain the only vehicles permitted to respond to street hails from passengers in NYC. For that reason, the dataset often has interesting patterns that can be correlated with other events in the city, such as a concert or inclement weather.

Our exercise will focus on just one of the many datasets offered by the TLC. The yellow taxis data includes the following fields:

VendorID: A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime: The date and time when the meter was engaged.
tpep_dropoff_datetime: The date and time when the meter was disengaged.
Passenger_count: The number of passengers in the vehicle.
Trip_distance: The elapsed trip distance in miles reported by the taximeter.
RateCodeID: The final rate code in effect at the end of the trip. 1= Standard rate, 2= JFK, 3= Newark, 4= Nassau or Westchester, 5= Negotiated fare, 6= Group ride.
Store_and_fwd_flag: This flag indicates whether the trip record was held in the vehicle's memory before being sent to the vendor, also known as "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip, while N= not a store and forward trip.
pulocationid: Location where the meter was engaged.
dolocationid: Location where the meter was disengaged.
Payment_type: A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.
Fare_amount: The time-and-distance fare calculated by the meter.
Extra: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax: $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge: $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount: This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount: Total amount of all tolls paid in a trip.
Total_amount: The total amount charged to passengers. Does not include cash tips.
congestion_surcharge: Amount surcharges associated with time/traffic fees imposed by the city.

This dataset is easy to obtain and is relatively interesting to run analytics against. The inconsistency in field naming is difficult to overlook but we will normalize using a mixture of camel case and underscore conventions later:

Our first step is to download the Trip Record Data for June 2020. You can obtain this directly from the NYC TLC's website (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) or our GitHub repository using the following command:
```
wget https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/raw/main/chapter_1/yellow_tripdata_2020-06.csv.gz
```
If you choose to download it from the NYC TLC directly, please gzip the file before proceeding to the next step.
Now that we have some data, we can add it to our data lake by uploading it to Amazon S3. To do this, we must create an S3 bucket. If you already have an S3 bucket that you plan to use, you can skip creating a new bucket. However, we do encourage you to avoid completing these exercises in accounts that house production workloads. As a best practice, all experimentation and learning should be done in isolation.
Once you have picked a bucket name and the region that you would like to use for these exercises, you can run the following command:
```
aws s3api create-bucket \
--bucket packt-serverless-analytics \
--region us-east-1
```
Important Note
Be sure to substitute your bucket name and region. You can also create buckets directly from the AWS Console by logging in and navigating to S3 from the service list. Later in this chapter, we will use the AWS Console to edit and run our Athena queries. For simple operations, using the AWS CLI can be faster and easier to see what is happening since the AWS Console can hide multi-step operations behind a single button.
Now that our bucket is ready, we can upload the data we would like to query. In addition to the bucket, we will want to put our data into a subfolder to keep things organized as we proceed through later exercises. We have an entire chapter dedicated to organizing and optimizing the layout of your data in S3. For now, let's just upload the data to a subfolder called tables/nyc_taxi using the following AWS CLI command. Be sure to replace the bucket name, packt-serverless-analytics, in the following example command with the name of your bucket:
```
aws s3 cp ./yellow_tripdata_2020-06.csv.gz \
s3://packt-serverless-analytics/tables/nyc_taxi/yellow_tripdata_2020-06.csv.gz
```
This command may take a few moments to complete since it needs to upload our roughly 10 MB file over the internet to Amazon S3. If you get a permission error or message about access being denied, double-check you used the right bucket name.
If the command seems to have finished running without issue, you can use the following command to confirm the file is where we expect. Be sure to replace the example bucket with your actual bucket name:
```
aws s3 ls s3://packt-serverless-analytics/tables/nyc_taxi/
```
Now that we have confirmed our sample data is where we expect, we need to add this data to our Metastore, as described in the What is Amazon Athena? section. To do this, we will use AWS Glue Data Catalog as our Metastore by creating a database to house our table. Remember that Data Catalog will not store our data, just details about where engines such as Athena can find it (for example, S3) and what format was used to store the data (for example, CSV). Unlike Amazon S3, multiple accounts can have databases and tables with the same name so that you can use the following commands as-is, without the need to rename anything. If you already have a database that you would like to use, you can skip creating a new database, but be sure to substitute your database name into subsequent commands; otherwise, they will fail:
```
aws glue create-database \
--database-input "{\"Name\":\"packt_serverless_analytics\"}" \
--region us-east-1
```

Now that both our data and Metastore are ready, we can define our table right from Athena itself by running our first query.

Running your first query

Athena supports both Data Definition Language (DDL) and Data Manipulation Language (DML) queries. Queries where you SELECT data from a table are a common example of DML queries. Our first meaningful Athena query will be a DDL query that creates, or defines, our NYC Taxis data table:

Let's begin by ensuring our AWS account and IAM user/role are ready to use Athena. To do that, navigate to the Athena query editor in the AWS Console: https://console.aws.amazon.com/athena/home.
Be sure to use the same region that you uploaded your data and created your database in.
If this is your first time using Athena, you will likely be met by a screen like the following. Luckily, Athena is telling us that "Before you run your first query, you need to set up a query result location in Amazon S3…". Since Athena writes the results of all queries to S3, even DDL queries, we will need to configure this setting before we can proceed. To do so, click on the highlighted text in the AWS Console that's shown in the following screenshot:
Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena
After clicking on the modal's link, you will see the following prompt so that you can set your query result's location. You can use the same S3 bucket we used to upload our sample data, with results being used as the name of the folder that Athena will write query results to within that bucket. Be sure your location ends with a "/" to avoid errors:

Figure 1.2 – Athena's settings prompt for the query result's location

Next, let's learn how to create a table.

Creating your first table

It is now time to run our first Athena query. The following DDL query asks Athena to create a new table called nyc_taxi in the packt_serverless_analytics database, which is stored in the AWS Glue Data Catalog. The query also specifies the schema (columns), file format, and storage location of the table. For now, the other nuances of this create query are unimportant. You may find it easier to copy create table from the create_nyc_taxi.sql (http://bit.ly/3mXj3K0) file in the chapter_1 folder of this book's GitHub repository. Paste it into Athena's query editor, change LOCATION so that it matches your bucket name, and click Run query. It should complete in a few seconds:

CREATE EXTERNAL TABLE 'packt_serverless_analytics'.'nyc_taxi'(
  'vendorid' bigint, 
  'tpep_pickup_datetime' string, 
  'tpep_dropoff_datetime' string, 
  'passenger_count' bigint, 
  'trip_distance' double, 
  'ratecodeid' bigint, 
  'store_and_fwd_flag' string, 
  'pulocationid' bigint, 
  'dolocationid' bigint, 
  'payment_type' bigint, 
  'fare_amount' double, 
  'extra' double, 
  'mta_tax' double, 
  'tip_amount' double, 
  'tolls_amount' double, 
  'improvement_surcharge' double, 
  'total_amount' double, 
  'congestion_surcharge' double)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<YOUR_BUCKET_NAME>/tables/nyc_taxi/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',',
  'skip.header.line.count'='1', 
  'typeOfData'='file')

Once your table creation DDL query completes, the left navigation pane of the Athena console will refresh with the definition of your new table. If you have other databases and tables, you may need to choose your database from the dropdown before your new table will appear.

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

At this point, the significance of the query we just ran may not be entirely apparent, but rest assured we will go deeper into why serverless DDL queries are a powerful thing. Oh, and did we mention that Athena does not charge for DDL queries?

Running your first analytics queries

When working with a new or unfamiliar set of data, it can be helpful to view a sample of the rows before exploring the dataset in more meaningful ways. This allows you to understand the schema of your dataset, including verifying that the schema (for example, column names) match the values and types. There are a few ways to do this, including the following limit query:

SELECT * from packt_serverless_analytics.nyc_taxi limit 100

This works fine in most cases, but we can do better. Many query engines, Athena included, will end up returning all 100 rows requested in the preceding query from the same S3 object. If your dataset contains many objects or files, you are getting an extremely narrow view of the table. For that reason, I prefer using the following query to view data from a broader portion of the dataset:

SELECT *
FROM packt_serverless_analytics.nyc_taxi TABLESAMPLE BERNOULLI (1) 
limit 100

This query is like the earlier limit query but uses Athena's TABLESAMPLE feature to obtain our 100 requested rows using BERNOULLI sampling. When a table is sampled using the Bernoulli method, all the objects of the table may be scanned as opposed to likely stopping after the first object. This is because the probability of a row being included in the result is independent of any other row reducing the significance of the object scan order. In the following screenshot, we can see some of the rows that were returned using TABLESAMPLE with the BERNOULLI method:

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

While that query allowed us to confirm that Athena can indeed access our data and that the schema appears to match the data itself, we have not extracted any real insights from the data. For this, we will run our first real analytics query by generating a histogram of ride durations and distances. Our goal here is to learn how much time people are typically spending in taxis, but we'll also be able to gain insights into the quality of our data. The following query uses Athena's numeric_histogram function to approximate the distribution with 10 buckets according to the difference between tpep_pickup_datetime and tpep_dropoff_datetime. Since the dataset stores datetimes as strings, we are using the date_parse function to convert the values into actual timestamps that we can then use with Athena's date_diff function to generate the ride durations as minutes. Lastly, the query uses a CROSS JOIN with UNEST to turn the histogram into rows and columns. Normally, the numeric_histogram function returns a map containing the histogram, but this can be difficult to read. UNEST helps us turn it into a more intuitive tabular format. Do not worry about remembering all these functions and SQL techniques right now. Athena frequently adds new capabilities, and you can always consult a reference.

You can copy the following code from GitHub at http://bit.ly/2Jm6o5v:

SELECT ride_minutes, number_rides
    FROM (SELECT numeric_histogram(10,
        date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         )
    )
FROM packt_serverless_analytics.nyc_taxi ) AS x (ride_histogram)
CROSS JOIN 
    UNNEST(ride_histogram) AS t (ride_minutes, number_rides);

Once you run the query, the results will look as follows. You can experiment with the number of buckets that are generated by adjusting the parameters of the numeric_histogram function. Generating 100 or even 1,000 buckets can uncover patterns that were hidden with fewer buckets. Even with just 10 buckets, we can already see a strong correlation between the distance and the number of rides. I was surprised to see that such a large portion of the yellow cab rides lasted less than 7 minutes. From this query, we can also see some likely data quality issues in the dataset. Unless one of the June 2020 rides happened in a time-traveling DeLorean, we likely have an erroneous record. Less obvious is the fact that several hundred rides claim to have lasted longer than 24 hours:

Figure 1.5 – Ride duration histogram results

Let's try one more histogram query, but this time, we will target the trip distance of the rides that took less than 7 minutes. The following code block contains the modified histogram query you can run to understand that bucket of rides. You can download it from GitHub at http://bit.ly/3hkggJl:

SELECT trip_distance, number_rides
FROM 
    (SELECT numeric_histogram(5,trip_distance)
       FROM packt_serverless_analytics.nyc_taxi 
       WHERE date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         ) <= 6.328061
    ) AS x (ride_histogram)
CROSS JOIN UNNEST(ride_histogram) AS t (trip_distance , number_rides);

Considering that the average person can walk a mile in 15 minutes, New Yorkers must be in a serious hurry to opt for taxi rides instead of a 15-minute walk!

Figure 1.6 – Ride distance histogram results

With that, we've been through the basics of AWS Athena. Let's conclude by providing a recap of what we've learned.

Serverless Analytics with Amazon Athena: Query structured, unstructured, or semi-structured data in seconds without setting up any infrastructure

What do you get with eBook?

Product Details

Serverless Analytics with Amazon Athena

Chapter 1: Your First Query

Technical requirements

What is Amazon Athena?

Use cases

Separation of storage and compute

Metastore

Obtaining and preparing sample data

Running your first query

Creating your first table

Running your first analytics queries

Summary

Page 1 of 6

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (3)

FAQs

Serverless Analytics with Amazon Athena: Query structured, unstructured, or semi-structured data in seconds without setting up any infrastructure

What do you get with eBook?

Product Details

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (3)

FAQs