Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Serverless Analytics with Amazon Athena
Serverless Analytics with Amazon Athena

Serverless Analytics with Amazon Athena: Query structured, unstructured, or semi-structured data in seconds without setting up any infrastructure

By Anthony Virtuoso , Mert Turkay Hocanin , Aaron Wishnick
€32.99 €22.99
Book Nov 2021 438 pages 1st Edition
eBook
€32.99 €22.99
Print
€41.99
Subscription
€14.99 Monthly
eBook
€32.99 €22.99
Print
€41.99
Subscription
€14.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Nov 19, 2021
Length 438 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781800562349
Category :
Table of content icon View table of contents Preview book icon Preview Book

Serverless Analytics with Amazon Athena

Chapter 1: Your First Query

This chapter is all about introducing you to the serverless analytics experience offered by Amazon Athena. Data is one of the most valuable assets you and your company generate. In recent years, we have seen a revolution in data retention, where companies are capturing all manner of data that was once ignored. Everything from logs to clickstream data, to support tickets are now routinely kept for years. Interestingly, the data itself is not what is valuable. Instead, the insights that are buried in that mountain of data are what we are after. Certainly, increased awareness and retention have made the information we need to power our businesses, applications, and decisions more available but the explosion in data sizes has made the insights we seek less accessible. What could once fit nicely in a traditional RDBMS, such as Oracle, now requires a distributed filesystem such as HDFS and an accompanying Massively Parallel Processing (MPP) engine such as Spark to run even the most basic of queries in a timely fashion.

Enter Amazon Athena. Unlike traditional analytics engines, Amazon Athena is a fully managed offering. You will never have to set up any servers or tune cryptic settings to get your queries running. This allows you to focus on what is most important: using data to generating insights that drive your business. You can just focus on getting the most out of your data. This ease of use is precisely why this first chapter is all about getting hands-on and running your first query. Whether you are a seasoned analytics veteran or a newcomer to the space, this chapter will give you the knowledge you need to be running your first Athena query in less than 30 minutes. For now, we will simplify things to demonstrate why so many people choose Amazon Athena for their workloads. This will help establish your mental model for the deeper discussions, features, and examples of later sections.

In this chapter, we will cover the following topics:

  • What is Amazon Athena?
  • Obtaining and preparing sample data
  • Running your first query

Technical requirements

Wherever possible, we will provide samples or instructions to guide you through the setup. However, to complete the activities in this chapter, you will need to ensure you have the following prerequisites available. Our command-line examples will be executed using Ubuntu, but most flavors of Linux should also work without modification.

You will need internet access to GitHub, S3, and the AWS Console.

You will also require a computer with the following installed:

  • Chrome, Safari, or Microsoft Edge
  • The AWS CLI

In addition, this chapter requires you to have an AWS account and accompanying IAM user (or role) with sufficient privileges to complete the activities in this chapter. Throughout this book, we will provide detailed IAM policies that attempt to honor the age-old best practice of "least privilege." For simplicity, you can always run through these exercises with a user that has full access, but we recommend that you use scoped-down IAM policies to avoid making costly mistakes and to learn more about how to best use IAM to secure your applications and data. You can find the suggested IAM policy for this chapter in this book's accompanying GitHub repository, listed as chapter_1/iam_policy_chapter_1.json:

https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/tree/main/chapter_1

This policy includes the following:

  • Read and Write access to one S3 bucket using the following actions:
    • s3:PutObject: Used to upload data and also for Athena to write query results.
    • s3:GetObject: Used by Athena to read data.
    • s3:ListBucketMultipartUploads: Used by Athena to write query results.
    • s3:AbortMultipartUpload: Used by Athena to write query results.
    • s3:ListBucketVersions
    • s3:CreateBucket: Used by you if you don't already have a bucket you can use.
    • s3:ListBucket: Used by Athena to read data.
    • s3:DeleteObject: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
    • s3:ListMultipartUploadParts: Used by Athena to write a result.
    • s3:ListAllMyBuckets: Used by Athena to ensure you own the results bucket.
    • s3:ListJobs: Used by Athena to write results.
  • Read and Write access to one Glue Data Catalog database, using the following actions:
    • glue:DeleteDatabase: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
    • glue:GetPartitions: Used by Athena to query your data in S3.
    • glue:UpdateTable: Used when we import our sample data.
    • glue:DeleteTable: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
    • glue:CreatePartition: Used when we import our sample data.
    • glue:UpdatePartition: Used when we import our sample data.
    • glue:UpdateDatabase: Used when we import our sample data.
    • glue:CreateTable: Used when we import our sample data.
    • glue:GetTables: Used by Athena to query your data in S3.
    • glue:BatchGetPartition: Used by Athena to query your data in S3.
    • glue:GetDatabases: Used by Athena to query your data in S3.
    • glue:GetTable: Used by Athena to query your data in S3.
    • glue:GetDatabase: Used by Athena to query your data in S3.
    • glue:GetPartition: Used by Athena to query your data in S3.
    • glue:CreateDatabase: Used to create a database if you don't already have one you can use.
    • glue:DeletePartition: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.
  • Access to run Athena queries.

    Important Note

    We recommend against using Firefox with the Amazon Athena console as we have found, and reported, a bug associated with switching between certain elements in the UX.

What is Amazon Athena?

Amazon Athena is a query service that allows you to run standard SQL over data stored in a variety of sources and formats. As you will see later in this chapter, Athena is serverless, so there is no infrastructure to set up or manage. You simply pay $5 per TB scanned for the queries you run without needing to worry about idle resources or scaling.

Note

AWS has a habit of reducing prices over time. For the latest Athena pricing, please consult the Amazon Athena product page at https://aws.amazon.com/athena/pricing/?nc=sn&loc=3.

Athena is based on Presto (https://prestodb.io/), a distributed SQL engine that's open sourced by Facebook. It supports ANSI SQL, as well as Presto SQL features ranging from geospatial functions to rough query extensions, which allow you to run approximating queries, with statistically bound errors, over large datasets in only a fraction of the time. Athena's commitment to open source also provides an interesting avenue to avoid lock-in concerns because you always have the option to download and manage your own Presto deployment from GitHub. Of course, you will lose many of Athena's enhancements and must manage the infrastructure yourself, but you can take comfort in knowing you are not beholden to potentially punitive licensing agreements as you might be with other vendors.

While Athena's roots are open source, the team at AWS have added several enterprise features to the service, including the following:

  • Federated Identity via SAML and Active Directory support
  • Table, column, and even row-level access control via Lake Formation
  • Workload classification and grouping for cost control via WorkGroups
  • Automated regression testing to take the pain out of upgrades

Later chapters will cover these topics in greater detail. If you feel compelled to do so, you can use the table of contents to skip directly to those chapters and learn more.

Let's look at some use cases for Athena.

Use cases

Amazon Athena supports a wide range of use cases and we have personally used it for several different patterns. Thanks to Athena's ease of use, it is extremely common to leverage Athena for ad hoc analysis and data exploration.

Later in this book, you will use Athena from within a Jupyter notebook for machine learning. Similarly, many analysts enjoy using Athena directly from BI Tools such as Looker and Tableau, courtesy of Athena's JDBC driver. Athena's robust SQL dialect and asynchronous API model also enables application developers to build analytics right into their applications, enabling features that would not previously have been practical due to scale or operational burden. In many cases, you can replace RDBMS-driven features with Athena at a fraction of the cost and lower operational burden.

Another emerging use case for Athena is in the ETL space. While Athena advertises itself as being an engine that avoids the need for ETL by being able to query the data in place, as it is, we have seen the benefits of replacing existing or building new ETL pipelines using Athena where cost and capacity management are key factors. Athena will not necessarily achieve the same scale or performance as Spark, for example, but if your ETL jobs do not require multi-TB joins, you might find Athena to be an interesting option.

Separation of storage and compute

If you are new to serverless analytics, you may be wondering where your data is stored. Amazon Athena builds on the concept of Separation of Storage and Compute to decouple the computational resources (for example, CPU, memory, network) that do the heavy lifting of executing your SQL queries from the responsibility of keeping your data safe and available. In short, this means Athena itself does not store your data. Instead, you are free to choose from several data stores with customers increasingly pairing with DynamoDB to rapidly mutate data with S3 for their bulk data. With Athena, you can easily write a query that spans both data stores.

Amazon's Simple Storage Service, or S3 for short, is easily the most recommended data store to use with Athena. When Athena launched in 2016, S3 was the first data store it supported. Unsurprisingly, Athena has been optimized to take advantage of S3's unique ability to deliver exabyte scale and throughput while still providing eleven nines (99.999999999%) of durability. In addition to effortless scaling from a few gigabytes of data up to many petabytes, S3 offers some of the lowest prices for performance that you can find. Depending on your replication requirements, storing 1 GB of data for a month will cost you between $0.01 and $0.023. Even the most cost-efficient enterprise hard drives cost around $0.21 per GB before you add on redundancy, the power to run them, or a server and data center to house them. As with most AWS services, you should consult S3's pricing page (https://aws.amazon.com/s3/pricing/) for the latest details since AWS has cut their prices more than 70 times in the last decade.

Metastore

In addition to accessing the raw 1s and 0s that represent your data, Athena also requires metadata that helps its SQL engine understand how to interpret the data you have stored in S3 or elsewhere. This supplemental information helps Athena map collections of files, or objects in the case of S3, to SQL constructs such as tables, columns, and rows. The repository for this data, about your data, is often called a metastore. Athena works with Hive-compliant metastores, including AWS's Glue Data Catalog service. In later chapters, we will look at AWS Glue Data Catalog in more detail, as well as how you can attach Athena to your own metastore, even a homegrown one. For now, all you need to know is that Athena requires the use of a metastore to discover key attributes of the data you wish to query. The most common pieces of information that are kept in the Metastore include the following:

  • A list of tables that exist
  • The storage location of each table (for example, the S3 path or DynamoDB table name)
  • The format of the files or objects that comprise the table (for example, CSV, Parquet, JSON)
  • The column names and data types in each table (for example, inventory column is an integer, while revenue is a decimal (10,2))

Now that we have a good overview of Amazon Athena, let's look at how to use it in practice.

Obtaining and preparing sample data

Before we can start running our first query, we will need some data that we would like to analyze. Throughout this book, we will try to make use of open datasets that you can freely access but that also contain interesting information that may mirror your real-world datasets. In this chapter, we will be making use of the NYC Taxi & Limousine Commission's (TLC's) Trip Record Data for New York City's iconic yellow taxis. Yellow taxis have been recording and providing ride data to TLC since 2009. Yellow taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (also known as a street hail). In recent years, yellow taxis have also started to use their own ride-hailing apps such as Curb and Arro to keep pace with emerging ride-hailing technologies from Uber and Lyft. However, yellow taxis remain the only vehicles permitted to respond to street hails from passengers in NYC. For that reason, the dataset often has interesting patterns that can be correlated with other events in the city, such as a concert or inclement weather.

Our exercise will focus on just one of the many datasets offered by the TLC. The yellow taxis data includes the following fields:

  • VendorID: A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
  • tpep_pickup_datetime: The date and time when the meter was engaged.
  • tpep_dropoff_datetime: The date and time when the meter was disengaged.
  • Passenger_count: The number of passengers in the vehicle.
  • Trip_distance: The elapsed trip distance in miles reported by the taximeter.
  • RateCodeID: The final rate code in effect at the end of the trip. 1= Standard rate, 2= JFK, 3= Newark, 4= Nassau or Westchester, 5= Negotiated fare, 6= Group ride.
  • Store_and_fwd_flag: This flag indicates whether the trip record was held in the vehicle's memory before being sent to the vendor, also known as "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip, while N= not a store and forward trip.
  • pulocationid: Location where the meter was engaged.
  • dolocationid: Location where the meter was disengaged.
  • Payment_type: A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.
  • Fare_amount: The time-and-distance fare calculated by the meter.
  • Extra: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
  • MTA_tax: $0.50 MTA tax that is automatically triggered based on the metered rate in use.
  • Improvement_surcharge: $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
  • Tip_amount: This field is automatically populated for credit card tips. Cash tips are not included.
  • Tolls_amount: Total amount of all tolls paid in a trip.
  • Total_amount: The total amount charged to passengers. Does not include cash tips.
  • congestion_surcharge: Amount surcharges associated with time/traffic fees imposed by the city.

This dataset is easy to obtain and is relatively interesting to run analytics against. The inconsistency in field naming is difficult to overlook but we will normalize using a mixture of camel case and underscore conventions later:

  1. Our first step is to download the Trip Record Data for June 2020. You can obtain this directly from the NYC TLC's website (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) or our GitHub repository using the following command:
    wget https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/raw/main/chapter_1/yellow_tripdata_2020-06.csv.gz

    If you choose to download it from the NYC TLC directly, please gzip the file before proceeding to the next step.

  2. Now that we have some data, we can add it to our data lake by uploading it to Amazon S3. To do this, we must create an S3 bucket. If you already have an S3 bucket that you plan to use, you can skip creating a new bucket. However, we do encourage you to avoid completing these exercises in accounts that house production workloads. As a best practice, all experimentation and learning should be done in isolation.
  3. Once you have picked a bucket name and the region that you would like to use for these exercises, you can run the following command:
    aws s3api create-bucket \
    --bucket packt-serverless-analytics \
    --region us-east-1

    Important Note

    Be sure to substitute your bucket name and region. You can also create buckets directly from the AWS Console by logging in and navigating to S3 from the service list. Later in this chapter, we will use the AWS Console to edit and run our Athena queries. For simple operations, using the AWS CLI can be faster and easier to see what is happening since the AWS Console can hide multi-step operations behind a single button.

  4. Now that our bucket is ready, we can upload the data we would like to query. In addition to the bucket, we will want to put our data into a subfolder to keep things organized as we proceed through later exercises. We have an entire chapter dedicated to organizing and optimizing the layout of your data in S3. For now, let's just upload the data to a subfolder called tables/nyc_taxi using the following AWS CLI command. Be sure to replace the bucket name, packt-serverless-analytics, in the following example command with the name of your bucket:
    aws s3 cp ./yellow_tripdata_2020-06.csv.gz \
    s3://packt-serverless-analytics/tables/nyc_taxi/yellow_tripdata_2020-06.csv.gz

    This command may take a few moments to complete since it needs to upload our roughly 10 MB file over the internet to Amazon S3. If you get a permission error or message about access being denied, double-check you used the right bucket name.

  5. If the command seems to have finished running without issue, you can use the following command to confirm the file is where we expect. Be sure to replace the example bucket with your actual bucket name:
    aws s3 ls s3://packt-serverless-analytics/tables/nyc_taxi/
  6. Now that we have confirmed our sample data is where we expect, we need to add this data to our Metastore, as described in the What is Amazon Athena? section. To do this, we will use AWS Glue Data Catalog as our Metastore by creating a database to house our table. Remember that Data Catalog will not store our data, just details about where engines such as Athena can find it (for example, S3) and what format was used to store the data (for example, CSV). Unlike Amazon S3, multiple accounts can have databases and tables with the same name so that you can use the following commands as-is, without the need to rename anything. If you already have a database that you would like to use, you can skip creating a new database, but be sure to substitute your database name into subsequent commands; otherwise, they will fail:
    aws glue create-database \
    --database-input "{\"Name\":\"packt_serverless_analytics\"}" \
    --region us-east-1

Now that both our data and Metastore are ready, we can define our table right from Athena itself by running our first query.

Running your first query

Athena supports both Data Definition Language (DDL) and Data Manipulation Language (DML) queries. Queries where you SELECT data from a table are a common example of DML queries. Our first meaningful Athena query will be a DDL query that creates, or defines, our NYC Taxis data table:

  1. Let's begin by ensuring our AWS account and IAM user/role are ready to use Athena. To do that, navigate to the Athena query editor in the AWS Console: https://console.aws.amazon.com/athena/home.

    Be sure to use the same region that you uploaded your data and created your database in.

  2. If this is your first time using Athena, you will likely be met by a screen like the following. Luckily, Athena is telling us that "Before you run your first query, you need to set up a query result location in Amazon S3…". Since Athena writes the results of all queries to S3, even DDL queries, we will need to configure this setting before we can proceed. To do so, click on the highlighted text in the AWS Console that's shown in the following screenshot:
    Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena

    Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena

  3. After clicking on the modal's link, you will see the following prompt so that you can set your query result's location. You can use the same S3 bucket we used to upload our sample data, with results being used as the name of the folder that Athena will write query results to within that bucket. Be sure your location ends with a "/" to avoid errors:
Figure 1.2 – Athena's settings prompt for the query result's location

Figure 1.2 – Athena's settings prompt for the query result's location

Next, let's learn how to create a table.

Creating your first table

It is now time to run our first Athena query. The following DDL query asks Athena to create a new table called nyc_taxi in the packt_serverless_analytics database, which is stored in the AWS Glue Data Catalog. The query also specifies the schema (columns), file format, and storage location of the table. For now, the other nuances of this create query are unimportant. You may find it easier to copy create table from the create_nyc_taxi.sql (http://bit.ly/3mXj3K0) file in the chapter_1 folder of this book's GitHub repository. Paste it into Athena's query editor, change LOCATION so that it matches your bucket name, and click Run query. It should complete in a few seconds:

CREATE EXTERNAL TABLE 'packt_serverless_analytics'.'nyc_taxi'(
  'vendorid' bigint, 
  'tpep_pickup_datetime' string, 
  'tpep_dropoff_datetime' string, 
  'passenger_count' bigint, 
  'trip_distance' double, 
  'ratecodeid' bigint, 
  'store_and_fwd_flag' string, 
  'pulocationid' bigint, 
  'dolocationid' bigint, 
  'payment_type' bigint, 
  'fare_amount' double, 
  'extra' double, 
  'mta_tax' double, 
  'tip_amount' double, 
  'tolls_amount' double, 
  'improvement_surcharge' double, 
  'total_amount' double, 
  'congestion_surcharge' double)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<YOUR_BUCKET_NAME>/tables/nyc_taxi/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',',
  'skip.header.line.count'='1', 
  'typeOfData'='file')

Once your table creation DDL query completes, the left navigation pane of the Athena console will refresh with the definition of your new table. If you have other databases and tables, you may need to choose your database from the dropdown before your new table will appear.

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

At this point, the significance of the query we just ran may not be entirely apparent, but rest assured we will go deeper into why serverless DDL queries are a powerful thing. Oh, and did we mention that Athena does not charge for DDL queries?

Running your first analytics queries

When working with a new or unfamiliar set of data, it can be helpful to view a sample of the rows before exploring the dataset in more meaningful ways. This allows you to understand the schema of your dataset, including verifying that the schema (for example, column names) match the values and types. There are a few ways to do this, including the following limit query:

SELECT * from packt_serverless_analytics.nyc_taxi limit 100

This works fine in most cases, but we can do better. Many query engines, Athena included, will end up returning all 100 rows requested in the preceding query from the same S3 object. If your dataset contains many objects or files, you are getting an extremely narrow view of the table. For that reason, I prefer using the following query to view data from a broader portion of the dataset:

SELECT *
FROM packt_serverless_analytics.nyc_taxi TABLESAMPLE BERNOULLI (1) 
limit 100

This query is like the earlier limit query but uses Athena's TABLESAMPLE feature to obtain our 100 requested rows using BERNOULLI sampling. When a table is sampled using the Bernoulli method, all the objects of the table may be scanned as opposed to likely stopping after the first object. This is because the probability of a row being included in the result is independent of any other row reducing the significance of the object scan order. In the following screenshot, we can see some of the rows that were returned using TABLESAMPLE with the BERNOULLI method:

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

While that query allowed us to confirm that Athena can indeed access our data and that the schema appears to match the data itself, we have not extracted any real insights from the data. For this, we will run our first real analytics query by generating a histogram of ride durations and distances. Our goal here is to learn how much time people are typically spending in taxis, but we'll also be able to gain insights into the quality of our data. The following query uses Athena's numeric_histogram function to approximate the distribution with 10 buckets according to the difference between tpep_pickup_datetime and tpep_dropoff_datetime. Since the dataset stores datetimes as strings, we are using the date_parse function to convert the values into actual timestamps that we can then use with Athena's date_diff function to generate the ride durations as minutes. Lastly, the query uses a CROSS JOIN with UNEST to turn the histogram into rows and columns. Normally, the numeric_histogram function returns a map containing the histogram, but this can be difficult to read. UNEST helps us turn it into a more intuitive tabular format. Do not worry about remembering all these functions and SQL techniques right now. Athena frequently adds new capabilities, and you can always consult a reference.

You can copy the following code from GitHub at http://bit.ly/2Jm6o5v:

SELECT ride_minutes, number_rides
    FROM (SELECT numeric_histogram(10,
        date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         )
    )
FROM packt_serverless_analytics.nyc_taxi ) AS x (ride_histogram)
CROSS JOIN 
    UNNEST(ride_histogram) AS t (ride_minutes, number_rides);

Once you run the query, the results will look as follows. You can experiment with the number of buckets that are generated by adjusting the parameters of the numeric_histogram function. Generating 100 or even 1,000 buckets can uncover patterns that were hidden with fewer buckets. Even with just 10 buckets, we can already see a strong correlation between the distance and the number of rides. I was surprised to see that such a large portion of the yellow cab rides lasted less than 7 minutes. From this query, we can also see some likely data quality issues in the dataset. Unless one of the June 2020 rides happened in a time-traveling DeLorean, we likely have an erroneous record. Less obvious is the fact that several hundred rides claim to have lasted longer than 24 hours:

Figure 1.5 – Ride duration histogram results

Figure 1.5 – Ride duration histogram results

Let's try one more histogram query, but this time, we will target the trip distance of the rides that took less than 7 minutes. The following code block contains the modified histogram query you can run to understand that bucket of rides. You can download it from GitHub at http://bit.ly/3hkggJl:

SELECT trip_distance, number_rides
FROM 
    (SELECT numeric_histogram(5,trip_distance)
       FROM packt_serverless_analytics.nyc_taxi 
       WHERE date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         ) <= 6.328061
    ) AS x (ride_histogram)
CROSS JOIN UNNEST(ride_histogram) AS t (trip_distance , number_rides);

Considering that the average person can walk a mile in 15 minutes, New Yorkers must be in a serious hurry to opt for taxi rides instead of a 15-minute walk!

Figure 1.6 – Ride distance histogram results

Figure 1.6 – Ride distance histogram results

With that, we've been through the basics of AWS Athena. Let's conclude by providing a recap of what we've learned.

Summary

In this chapter, you saw just how easy it is to get started running queries with Athena. We obtained sample data from the NYC TLC, used it to create a table in our S3-based data lake, and ran some analytics queries to understand the insights contained in that data. Since Athena is serverless, we spent absolutely no time setting up any infrastructure or software. Incredibly, all the operations we ran in this chapter cost less than $0.00135. Without the serverless aspect of Athena, we would have found ourselves purchasing many thousands of dollars of hardware or hundreds of dollars in cloud resources to run these basic exercises.

While the main goals of this chapter were to orient you to the uniquely serverless experience of using Amazon Athena, there are a few concepts worth remembering as you continue reading. The first is the role of the Metastore. We saw that uploading our data to S3 was not enough for Athena to query the data. We also needed to register the location, schema, and file format as a table in AWS Glue Data Catalog. Once our table was defined, it became queryable from Athena. Chapter 3, Key Features, Query Types, and Functions, will cover this topic in greater depth.

The next important thing we saw was the feature-rich SQL dialect we used in our basic analytics queries. Since Athena utilizes a customized variant of Presto, you can refer to Presto's documentation (https://prestodb.io/docs/current/) as a supplement for Athena's documentation.

Chapter 2, Introduction to Amazon Athena, will go deeper into Athena's capabilities and open source roots so that you can understand when to use Athena, as well as how you can gain deeper insight into specific behaviors of the service.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore the promising capabilities of Amazon Athena and Athena’s Query Federation SDK
  • Use Athena to prepare data for common machine learning activities
  • Cover best practices for setting up connectivity between your application and Athena and security considerations

Description

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure. This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server. By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.

What you will learn

Secure and manage the cost of querying your data Use Athena ML and User Defined Functions (UDFs) to add advanced features to your reports Write your own Athena Connector to integrate with a custom data source Discover your datasets on S3 using AWS Glue Crawlers Integrate Amazon Athena into your applications Setup Identity and Access Management (IAM) policies to limit access to tables and databases in Glue Data Catalog Add an Amazon SageMaker Notebook to your Athena queries Get to grips with using Athena for ETL pipelines

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Nov 19, 2021
Length 438 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781800562349
Category :

Table of Contents

20 Chapters
Preface Chevron down icon Chevron up icon
Section 1: Fundamentals Of Amazon Athena Chevron down icon Chevron up icon
Chapter 1: Your First Query Chevron down icon Chevron up icon
Chapter 2: Introduction to Amazon Athena Chevron down icon Chevron up icon
Chapter 3: Key Features, Query Types, and Functions Chevron down icon Chevron up icon
Section 2: Building and Connecting to Your Data Lake Chevron down icon Chevron up icon
Chapter 4: Metastores, Data Sources, and Data Lakes Chevron down icon Chevron up icon
Chapter 5: Securing Your Data Chevron down icon Chevron up icon
Chapter 6: AWS Glue and AWS Lake Formation Chevron down icon Chevron up icon
Section 3: Using Amazon Athena Chevron down icon Chevron up icon
Chapter 7: Ad Hoc Analytics Chevron down icon Chevron up icon
Chapter 8: Querying Unstructured and Semi-Structured Data Chevron down icon Chevron up icon
Chapter 9: Serverless ETL Pipelines Chevron down icon Chevron up icon
Chapter 10: Building Applications with Amazon Athena Chevron down icon Chevron up icon
Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting Chevron down icon Chevron up icon
Section 4: Advanced Topics Chevron down icon Chevron up icon
Chapter 12: Athena Query Federation Chevron down icon Chevron up icon
Chapter 13: Athena UDFs and ML Chevron down icon Chevron up icon
Chapter 14: Lake Formation – Advanced Topics Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.