You're reading from Serverless Analytics with Amazon Athena

Product typeBook

Published inNov 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800562349

Edition1st Edition

Languages

Python

Tools

Amazon Athena

Concepts

Data Processing

Authors (3):

Anthony Virtuoso

Mert Turkay Hocanin

Aaron Wishnick

View More author details

Chapter 5: Securing Your Data

Data within an organization can be one of its most valuable assets. Data can drive business decisions for an organization, such as to whom and how to advertise, what the behavior of users on a website is, and how they react to sales or help businesses identify inefficient processes. An organization can also package and sell that data to customers or other organizations, getting direct revenue for the information it collects. Regardless, all organizations should protect the data they have from both internal and external entities.

We have all heard stories where a data breach has occurred in a large institution. It is a harrowing and traumatic event for the organization. There could be monetary penalties by governments for breaking laws. Still, for most companies, breaking customers' or the public's trust can be much more damaging. This is why large companies invest large amounts of resources into having dedicated security teams that provide...

Technical requirements

For this chapter, you will require the following:

Internet access to GitHub, S3, and the AWS Console.
A computer with Chrome, Safari, or Microsoft Edge and the AWS CLI version 2 installed.
An AWS account and accompanying IAM user (or role) with sufficient privileges to complete this chapter's activities. For simplicity, you can always run through these exercises with a user that has full access. However, we recommend using scoped-down IAM policies to avoid making costly mistakes and learn how to best use IAM to secure your applications and data. You can find a minimally scoped IAM policy for this chapter in this book's accompanying GitHub repository, which is listed as chapter_5/iam_policy_chapter_5.json (https://bit.ly/3qAcNtU). This policy includes the following:
- Permissions to create and list IAM roles and policies. We will be creating a service role for an AWS Glue Crawler to assume.
- Permissions to read, list, and write access to an...

General best practices to protect your data on AWS

In this section, we will go over some general best practices. However, before we do, we should understand some security basics. Let's start with what I call the five general pillars of security. They are as follows:

Authentication: Can the user or principal prove who they are? Access to AWS resources depends on IAM authentication through AWS credentials, which are like logins and passwords. These credentials can be long-lived, such as IAM user credentials, or short-lived, such as the AWS credentials that are provided when an IAM role is assumed. Throughout this chapter, we will assume that AWS IAM is the only authentication mechanism that users can use. However, we will also look at other ways to authenticate in Chapter 7, Ad Hoc Analytics.
Authorization: Is the user or principal provided permission to access a resource? When an action is requested against an AWS resource, the IAM credentials that are used are checked...

Encrypting your data and metadata in Glue Data Catalog

There are many ways a malicious person may be able to get access to your data. They may be able to listen on a network for traffic between two applications. They may be able to pull a hard drive from a machine, server, or dumpster. They may be able to gain access to an account that has access to the data they need. Regardless of how the bad actor obtains your data, you do not want them to read the data, and data encryption is how that is done. Data encryption takes your data, encodes it using an encryption key, and makes it impossible to read without the decryption key.

Encryption algorithms where the encryption key and decryption keys are the same are called symmetric encryption. Algorithms in which the keys are different are called asymmetric encryption.

Let's look at how we can encrypt data on S3.

Encrypting your data

When your data is persisted somewhere, it should be encrypted. All the data that Athena...

Enabling coarse-grained access controls with IAM resource policies for data on S3

Coarse-grained access control (CGAC) is a term that does not have an industry-standard definition. Generally, in this book, when we refer to CGAC in the context of data lakes, we are referring to object-level permissions such as individual files on S3. If a user has access to an object, they can access all the data within that file. Fine-grained access control (FGAC) provides authorization on data within the files, such as columns and rows. We will discuss FGAC in more detail in the next section.

Within AWS, there is one popular way to achieve CGAC with data on S3. That is through bucket policies that limit access to IAM principals. We will look at how to enable this in this section.

CGAC through S3 bucket policies

By default, access to S3 buckets is denied unless there are policies that grant access to it. Regarding a new IAM principal, either an IAM user or role, permissions must be provided...

Enabling FGACs with Lake Formation for data on S3

FGAC differs from coarse-grained data access control by providing access control  finer than at a file or directory level. For example, FGAC may provide column filtering (setting permissions on individual columns), data masking (running the value of a column through some function that disambiguates its value), and row filtering (allowing users to see rows in a dataset that only pertain to them).

There are many open source and third-party applications that provide this access control level within the big data world. Examples of open sourced software include Apache Ranger and Apache Sentry. An example of a third-party application is Privacera. First-party integration is also available through AWS Lake Formation.

One of AWS Lake Formation's major components is providing FGACs to data within the data lake. Administrators can determine which users have access to which objects within Glue Data Catalog, such as tables, columns...

Auditing with CloudTrail and S3 access logs

Auditing is an essential part of designing a secure system. Auditing provides validation that existing access policies are working and when there is a security incident, the impact of the incident and hopefully the bad actors. AWS has two native auditing mechanisms for data access that we will look at in detail: AWS CloudTrail and Amazon S3 access logs.

Auditing with AWS CloudTrail

AWS CloudTrail is a service that provides auditing capabilities for API calls that are made to all AWS services that support CloudTrail. When an AWS account is created, CloudTrail logging is enabled by default to help manage APIs. These APIs perform actions on AWS resources such as creating or describing EC2 instances, creating S3 buckets, or submitting Athena queries. The other class of events is data events. These are AWS APIs that are called on a resource itself. At the time of writing, S3 calls to list, get, put, or delete operations and Lambda invocations...

Summary

In this chapter, we have gone through some ways that we can protect data from malicious users. We know that no system can ever be 100% secure, but we can take some simple steps to avoid headaches in the future.

We looked at how encrypting your data early in projects can help save time and resources and how to encrypt data at rest and in transit. We looked at the difference between coarse-grained access versus FGACs to implement authorization. Authorization on S3 can be done through S3 bucket policies and/or IAM users, and role policies provide CGACs. Lastly, we looked at how auditing can be enabled and compared these approaches based on their cost and the information they can deliver.

We will dive into Lake Formation, an AWS service that creates and administrates a data lake easier and faster, in the next chapter.

Anthony Virtuoso works as a Principal Engineer at Amazon and holds multiple patents in distributed systems, software defined networks, and security. In his eight years at Amazon, he has helped launch several Amazon Web Services, the most recent of which was Amazon Managed Blockchain. As one of the original authors of Athena Query Federation, you'll often find him lurking on the Athena Federation GitHub repository answering questions and shipping bug fixes. When not at work, Anthony obsesses over a different set of customers, namely his wife and two little boys, aged 2 and 5. His kids enjoy doing science experiments with dad, like 3D printing toys, building with Lego, or searching the local pond for tardigrades.
Read more about Anthony Virtuoso

Mert Turkay Hocanin

Mert Turkay Hocanin is a Principal Big Data Architect at Amazon Web Services within the AWS Glue and AWS Lake Formation services and has previously worked for several other services including Amazon Athena, Amazon EMR, Amazon Managed Blockchain. During his time at AWS, he worked with several Fortune 500 companies on some of the largest data lakes in the world and was involved with the launching of three Amazon Web Services. Prior to being a Big Data Architect, he was a Senior Software Developer within Amazon's retail systems organization building one of the earliest data lakes in the company in 2013. When he is not helping customers build data lakes, he enjoys spending time with his wife-Subrina, son-Tristan, and exploring New York City.
Read more about Mert Turkay Hocanin

Aaron Wishnick

Aaron Wishnick works as a Senior Software Engineer at Amazon, where he has been for 7 years. During that time he has worked on Amazon's payment systems, financial intelligence systems, as well as working for AWS on Athena and AWS Proton. When not at work, Aaron and his fiance, Alyssa, are on a quest to determine just how much dog fur is too much, with their husky and malamute, Mina and Wally.
Read more about Aaron Wishnick

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Serverless Analytics with Amazon Athena

Chapter 5: Securing Your Data

Technical requirements

General best practices to protect your data on AWS

Encrypting your data and metadata in Glue Data Catalog

Encrypting your data

Enabling coarse-grained access controls with IAM resource policies for data on S3

CGAC through S3 bucket policies

Enabling FGACs with Lake Formation for data on S3

Auditing with CloudTrail and S3 access logs

Auditing with AWS CloudTrail

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Authors (3)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook