You're reading from AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

Product type Book

Published in Mar 2021

Publisher Packt

ISBN-13 9781800569003

Pages 338 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Authors (2):

Somanath Nanda

Weslley Moura

View More author details

Table of Contents (14) Chapters

Preface

Section 1: Introduction to Machine Learning

Chapter 1: Machine Learning Fundamentals

Chapter 2: AWS Application Services for AI/ML

Section 2: Data Engineering and Exploratory Data Analysis

Chapter 3: Data Preparation and Transformation

Chapter 4: Understanding and Visualizing Data

Chapter 5: AWS Services for Data Storing

Chapter 6: AWS Services for Data Processing

Section 3: Data Modeling

Chapter 7: Applying Machine Learning Algorithms

Chapter 8: Evaluating and Optimizing Models

Chapter 9: Amazon SageMaker Modeling

Other Books You May Enjoy

Chapter 5: AWS Services for Data Storing

AWS provides a wide range of services to store your data safely and securely. There are various storage options available on AWS such as block storage, file storage, and object storage. It is expensive to manage on-premises data storage due to the higher investment in hardware, admin overheads, and managing system upgrades. With AWS Storage services, you just pay for what you use, and you don't have to manage the hardware. We will also learn about various storage classes offered by Amazon S3 for intelligent access of the data and reducing costs. You can expect questions in the exam on storage classes. As we continue in this chapter, we will master the single-AZ and multi-AZ instances, and RTO (Recovery Time Objective) and RPO (Recovery Point Objective) concepts of Amazon RDS.

In this chapter, we will learn about storing our data securely for further analytics purposes by means of the following sections:

Storing data on Amazon...

Technical requirements

All you will need for this chapter is an AWS account and the AWS CLI configured. The steps to configure the AWS CLI for your account are explained in detail by Amazon here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.

You can download the code examples from Github, here: https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide/tree/master/Chapter-5/.

Storing data on Amazon S3

S3 is Amazon's cloud-based object storage service and it can be accessed from anywhere via the internet. It is an ideal storage option for large datasets. It is region-based, as your data is stored in a particular region until you move the data to a different region. Your data will never leave that region until it is configured. In a particular region, data is replicated in the availability zones of that region; this makes S3 regionally resilient. If any of the availability zones fail in a region, then other availability zones will serve your requests. It can be accessed via the AWS Console UI, AWS CLI, or AWS API requests, or via standard HTTP methods.

S3 has two main components: buckets and objects.

Buckets are created in a specific AWS region. Buckets can contain objects, but cannot contain other buckets.
Objects have two main attributes. One is Key, and the other is Value. Value is the content being stored, and Key is the name. The...

Controlling access to buckets and objects on Amazon S3

Once the object is stored in the bucket, the next major step is to manage access. S3 is private by default, and access is given to other users or groups or resources via several methods. This means that access to the objects can be managed via Access Control Lists (ACLs), Public Access Settings, Identity Policies, and Bucket Policies.

Let's look at some of these in detail.

S3 bucket policy

S3 bucket policy is a resource policy that is attached to a bucket. Resource policies decide who can access that resource. It differs from identity policies in that identity policies can be attached or assigned to the identities inside an account, whereas resource policies can control identities from the same account or different accounts. Resource policies control anonymous principals too, which means an object can be made public through resource policies. The following sample policy allows everyone in the world to read the bucket...

Protecting data on Amazon S3

In this section, we will learn how to record every version of an object. Along with durability, Amazon provides several techniques to secure the data in S3. Some of those techniques involve enabling versioning and encrypting the objects.

Versioning helps you to roll back to a previous version if there is any problem with the current object during update, delete, or put operations.

Through encryption, you can control the access of an object. You need the appropriate key to read and write an object. We will also learn Multi Factor Authentication (MFA) for delete operations. Amazon also allows Cross-Region Replication (CRR) to maintain a copy of an object in another region, which can be used for data backup during any disaster, for further redundancy, or for the enhancement of data access speed in different regions.

Applying bucket versioning

Let's now understand how we can enable bucket versioning with the help of some hands-on examples...

Securing S3 objects at rest and in transit

In the previous section, we learned about bucket default encryption, which is completely different from object-level encryption. Buckets are not encrypted, whereas objects are. A question may arise here: what is bucket default encryption? We will learn these concepts in this section. Data during transmission can be protected by using Secure Socket Layer (SSL) or Transport Layer Security (TLS) for the transfer of the HTTPS requests. The next step is to protect the data, where the authorized person can encode and decode the data.

It is possible to have different encryption settings on different objects in the same bucket. S3 supports Client-Side Encryption (CSE) and Server-Side Encryption (SSE) for objects at rest:

Client-Side Encryption: A client uploads the object to S3 via the S3 endpoint. In CSE, the data is encrypted by the client before uploading to S3. Although the transit between the user and the S3 endpoint happens in an...

Using other types of data stores

Elastic Block Store (EBS) is used to create volumes in an availability zone. The volume can only be attached to an EC2 instance in the same availability zone. Amazon EBS provides both SSD (Solid State Drive) and HDD (Hard Disk Drive) types of volumes. For SSD-based volumes, the dominant performance attribute is IOPS (Input Output Per Second), and for HDD it is throughput, which is generally measured as MiB/s. The following table shown in Figure 5.3 provides an overview of the different volumes and types:

Figure 5.3 – Different volumes and their use cases

EBS is resilient to an availability zone (AZ). If, for some reason, an AZ fails, then the volume cannot be accessed. To avoid such scenarios, snapshots can be created from the EBS volumes and snapshots are stored in S3. Once the snapshot arrives at S3, the data in the snapshot becomes region-resilient. The first snapshot is a full copy of data on the volume and, from...

Relational Database Services (RDSes)

This is one of the most commonly featuring exam topics in AWS exams. You should have sufficient knowledge prior to the exam. In this section, we will learn about Amazon's RDS.

AWS provides several relational databases as a service to its users. Users can run their desired database on EC2 instances, too. The biggest drawback is that the instance is only available in one availability zone in a region. The EC2 instance has to be administered and monitored to avoid any kind of failure. Custom scripts will be required to maintain a data backup over time. Any database major or minor version update would result in downtime. Database instances running on an EC2 instance cannot be easily scaled if the load increases on the database as replication is not an easy task.

RDS provides managed database instances that can themselves hold one or more databases. Imagine a database server running on an EC2 instance that you do not have to manage or maintain...

Managing failover in Amazon RDS

RDS instances can be single-AZ or multi-AZ. In multi-AZ, multiple instances work together, similar to an active-passive failover design.

For a single-AZ RDS instance, storage can be allocated for that instance to use. In a nutshell, a single-AZ RDS instance has one attached block store (EBS storage) available in the same availability zone. This makes the databases and the storage of the RDS instance vulnerable to availability zone failure. The storage allocated to the block storage can be SSD (gp2 or io1) or magnetic. To secure the RDS instance, it is advised to use a security group and provide access based on requirements.

Multi-AZ is always the best way to design the architecture to avoid any failure and keep the applications highly available. With multi-AZ features, a standby replica is kept in sync synchronously with the primary instance. The standby instance has its own storage in the assigned availability zone. A standby replica cannot...

Taking automatic backup, RDS snapshots, and restore and read replicas

In this section, we will see how RDS Automatic Backup and Manual Snapshots work. These features come with Amazon RDS.

Let's consider a database that is scheduled to take a backup at 5 A.M. every day. If the application fails at 11 A.M., then it is possible to restart the application from the backup taken at 11 A.M. with the loss of 6 hours' worth of data. This is called 6 hours RPO (Recovery Point Objective). So, RPO is defined as the time between the most recent backup and the incident and this defines the amount of data loss. If you want to reduce this, then you have to schedule more incremental backups, which increases the cost and backup frequency. If your business demands a lower RPO value, then the business must spend more on meeting the technical solutions.

Now, according to our example, an engineer was assigned this task to bring the system online as soon as the disaster occurred. The engineer...

Writing to Amazon Aurora with multi-master capabilities

Amazon Aurora is the most reliable relational database engine developed by Amazon to deliver speed in a simple and cost-effective manner. Aurora uses a cluster of single primary instances and zero or more replicas. Aurora's replicas can give you the advantage of both read replicas and multi-AZ instances in RDS. Aurora uses a shared cluster volume for storage and is available to all compute instances of the cluster (a maximum of 64 TiB). This allows the Aurora cluster to provision faster and improves availability and performance. Aurora uses SSD-based storage, which provides high IOPS and low latency. Aurora does not ask you to allocate storage, unlike other RDS instances; it is based on the storage that you use.

Aurora clusters have multiple endpoints, including Cluster Endpoint and Reader Endpoint. If there are zero replicas, then the cluster endpoint is the same as the reader endpoint. If there are replicas available...

Storing columnar data on Amazon Redshift

Amazon Redshift is not used for real-time transaction use, but it is used for data warehouse purposes. It is designed to support huge volumes of data at a petabyte scale. It is a column-based database used for analytics purpose, long-term processing, tending, and aggregation. Redshift Spectrum can be used to query data on S3 without loading data to the Redshift cluster (a Redshift cluster is required though). It's not an OLTP, but an OLAP. AWS QuickSight can be integrated with Redshift for visualization, with a SQL-like interface that allows you to connect using JDBC/ODBC connections for querying the data.

Redshift uses a clustered architecture in one AZ in a VPC with faster network connectivity between the nodes. It is not high availability by design as it is tightly coupled to the AZ. A Redshift cluster has a Leader Node, and this node is responsible for all the communication between the client and the computing nodes of the cluster...

Amazon DynamoDB for NoSQL database as a service

Amazon DynamoDB is a NoSQL database-as-a-service product within AWS. It's a fully managed key/value and document database. Accessing DynamoDB is easy via its endpoint. The input and output throughputs can be managed or scaled manually or in automatic fashion. It also supports data backup, point-in-time recovery, and data encryption. We will not cover the DynamoDB table structure or key structure in this chapter as this is not required for the certification exam. However, it is good to have a basic knowledge of them. For more details, please refer to the AWS docs available here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SQLtoNoSQL.html.

Summary

In this chapter, we learned about various data storage services from Amazon, and how to secure data through various policies and use these services. If you are working on machine learning use cases, then you may encounter such scenarios where you have to choose an effective data storage service for your requirements.

In the next chapter, we will learn about the processing of stored data.

Questions

To set the region that an S3 bucket is stored in, you must first create the bucket and then set the region separately.
A. True
B. False
Is it mandatory to have both the source bucket and destination bucket in the same region in order to copy the contents of S3 buckets?
A. True
B. False
By default, objects are private in a bucket.
A. True
B. False
By WS, a S3 object is immutable and you can only perform put and delete. Rename is GET and PUT of the same object with a different name.
A. True
B. False
If a user has stored an unversioned object and a versioned object...