Reader small image

You're reading from  AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781835082201
Edition2nd Edition
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

AWS Services for Data Storage

AWS provides a wide range of services to store your data safely and securely. There are various storage options available on AWS, such as block storage, file storage, and object storage. It is expensive to manage on-premises data storage due to the higher investment in hardware, admin overheads, and managing system upgrades. With AWS storage services, you just pay for what you use, and you don’t have to manage the hardware. You will also learn about various storage classes offered by Amazon S3 for intelligent access to data and to reduce costs. You can expect questions in the exam on storage classes. As you continue through this chapter, you will master the single-AZ and multi-AZ instances, and Recovery Time Objective (RTO) and Recovery Point Objective (RPO) concepts of Amazon RDS.

In this chapter, you will learn about storing your data securely for further analytical purposes throughout the following sections:

  • Storing data on Amazon...

Technical requirements

All you will need for this chapter is an AWS account and the AWS CLI configured. The steps to configure the AWS CLI for your account are explained in detail by Amazon here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.

You can download the code examples from GitHub, here: https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter02.

Storing Data on Amazon S3

S3 is Amazon’s cloud-based object storage service, and it can be accessed from anywhere via the internet. It is an ideal storage option for large datasets. It is region-based, as your data is stored in a particular region until you move the data to a different region. Your data will never leave that region until it is configured to do so. In a particular region, data is replicated in the availability zones of that region; this makes S3 regionally resilient. If any of the availability zones fail in a region, then other availability zones will serve your requests. S3 can be accessed via the AWS console UI, AWS CLI, AWS API requests, or via standard HTTP methods.

S3 has two main components: buckets and objects.

  • Buckets are created in a specific AWS region. Buckets can contain objects but cannot contain other buckets.
  • Objects have two main attributes. One is the key, and the other is the value. The value is the content being stored, and the...

Controlling access to buckets and objects on Amazon S3

Once the object is stored in the bucket, the next major step is to manage access. S3 is private by default, and access is given to other users, groups, or resources via several methods. This means that access to the objects can be managed via Access Control Lists (ACLs), Public Access Settings, Identity Policies, and Bucket Policies.

Let’s look at some of these in detail.

S3 bucket policy

An S3 bucket policy is a resource policy that is attached to a bucket. Resource policies decide who can access that resource. It differs from identity policies in that identity policies can be attached or assigned to the identities inside an account, whereas resource policies can control identities from the same account or different accounts. Resource policies control anonymous principals too, which means an object can be made public through resource policies. The following example policy allows everyone in the world to read the...

Protecting data on Amazon S3

In this section, you will learn how to record every version of an object. Along with durability, Amazon provides several techniques to secure the data in S3. Some of those techniques involve enabling versioning and encrypting the objects.

Versioning helps you to roll back to a previous version if any problem occurs with the current object during update, delete, or put operations.

Through encryption, you can control the access of an object. You need the appropriate key to read and write an object. You will also learn Multi-Factor Authentication (MFA) for delete operations. Amazon also allows Cross-Region Replication (CRR) to maintain a copy of an object in another Region, which can be used for data backup during any disaster, for further redundancy, or for the enhancement of data access speed in different Regions.

Applying bucket versioning

Let’s now understand how you can enable bucket versioning with the help of some hands-on examples...

Securing S3 objects at rest and in transit

In the previous section, you learned about bucket default encryption, which is completely different from object-level encryption. Buckets are not encrypted, whereas objects are. A question may arise here: what is the default bucket encryption? You will learn these concepts in this section. Data during transmission can be protected by using Secure Socket Layer (SSL) or Transport Layer Security (TLS) for the transfer of HTTPS requests. The next step is to protect the data, where the authorized person can encode and decode the data.

It is possible to have different encryption settings on different objects in the same bucket. S3 supports Client-Side Encryption (CSE) and Server-Side Encryption (SSE) for objects at rest:

  • CSE: A client uploads the object to S3 via the S3 endpoint. In CSE, the data is encrypted by the client before uploading to S3. Although the transit between the user and the S3 endpoint happens in an encrypted channel...

Using other types of data stores

Elastic Block Store (EBS) is used to create volumes in an Availability Zone. The volume can only be attached to an EC2 instance in the same Availability Zone. Amazon EBS provides both Solid-State Drive (SSD) and Hard Disk Drive (HDD) types of volumes. For SSD-based volumes, the dominant performance attribute is Input-Output Per Second (IOPS), and for HDD it is throughput, which is generally measured as MiB/s. You can choose between different volume types, such as General Purpose SSD (gp2), Provisioned IOPS SSD (io1), or Throughput Optimized HDD (st1), depending on your requirements. Provisioned IOPS volumes are often used for high-performance workloads, such as deep learning training, where low latency and high throughput are critical. Table 2.1 provides an overview of the different volumes and types:

Relational Database Service (RDS)

This is one of the most commonly featured topics in AWS exams. You should have sufficient knowledge prior to the exam. In this section, you will learn about Amazon’s RDS.

AWS provides several relational databases as a service to its users. Users can run their desired database on EC2 instances, too. The biggest drawback is that the instance is only available in one Availability Zone in a Region. The EC2 instance has to be administered and monitored to avoid any kind of failure. Custom scripts will be required to maintain a data backup over time. Any database major or minor version update would result in downtime. Database instances running on an EC2 instance cannot be easily scaled if the load increases on the database as replication is not an easy task.

RDS provides managed database instances that can themselves hold one or more databases. Imagine a database server running on an EC2 instance that you do not have to manage or maintain....

Managing failover in Amazon RDS

RDS instances can be Single-AZ or Multi-AZ. In Multi-AZ, multiple instances work together, similar to an active-passive failover design.

For a Single-AZ RDS instance, storage can be allocated for that instance to use. In a nutshell, a Single-AZ RDS instance has one attached block store (EBS storage) available in the same Availability Zone. This makes the databases and the storage of the RDS instance vulnerable to Availability Zone failure. The storage allocated to the block storage can be SSD (gp2 or io1) or magnetic. To secure the RDS instance, it is advised to use a security group and provide access based on requirements.

Multi-AZ is always the best way to design the architecture to prevent failures and keep the applications highly available. With Multi-AZ features, a standby replica is kept in sync synchronously with the primary instance. The standby instance has its own storage in the assigned Availability Zone. A standby replica cannot be...

Taking automatic backups, RDS snapshots, and restore and read replicas

In this section, you will see how RDS automatic backups and manual snapshots work. These features come with Amazon RDS.

Let’s consider a database that is scheduled to take a backup at 5 A.M. every day. If the application fails at 11 A.M., then it is possible to restart the application from the backup taken at 11 A.M. with the loss of 6 hours’ worth of data. This is called a 6-hour Recovery Point Objective (RPO). The RPO is defined as the time between the most recent backup and the incident, and this determines the amount of data loss. If you want to reduce this, then you have to schedule more incremental backups, which increases the cost and backup frequency. If your business demands a lower RPO value, then the business must spend more to provide the necessary technical solutions.

Now, according to our example, an engineer was assigned the task of bringing the system back online as soon as the...

Writing to Amazon Aurora with multi-master capabilities

Amazon Aurora is the most reliable relational database engine developed by Amazon to deliver speed in a simple and cost-effective manner. Aurora uses a cluster of single primary instances and zero or more replicas. Aurora’s replicas can give you the advantage of both read replicas and Multi-AZ instances in RDS. Aurora uses a shared cluster volume for storage and is available to all compute instances of the cluster (a maximum of 64 TiB). This allows the Aurora cluster to provision faster and improves availability and performance. Aurora uses SSD-based storage, which provides high IOPS and low latency. Aurora does not ask you to allocate storage, unlike other RDS instances; it is based on the storage that you use.

Aurora clusters have multiple endpoints, including the cluster endpoint and reader endpoint. If there are zero replicas, then the cluster endpoint is the same as the reader endpoint. If there are replicas available...

Storing columnar data on Amazon Redshift

Amazon Redshift is not used for real-time transactions, but it is used for data warehouse purposes. It is designed to support huge volumes of data at a petabyte scale. It is a column-based database used for analytics, long-term processing, tending, and aggregation. Redshift Spectrum can be used to query data on S3 without loading data to the Redshift cluster (a Redshift cluster is required, though). It’s not an OLTP, but an OLAP. AWS QuickSight can be integrated with Redshift for visualization, with a SQL-like interface that allows you to connect using JDBC/ODBC connections to query the data.

Redshift uses a clustered architecture in one AZ in a VPC with faster network connectivity between the nodes. It is not high availability by design as it is tightly coupled to the AZ. A Redshift cluster has a leader node, and this node is responsible for all the communication between the client and the computing nodes of the cluster, query planning...

Amazon DynamoDB for NoSQL Database-as-a-Service

Amazon DynamoDB is a NoSQL database-as-a-service product within AWS. It’s a fully managed key/value and document database. Accessing DynamoDB is easy via its endpoint. The input and output throughputs can be managed or scaled manually or automatically. It also supports data backup, point-in-time recovery, and data encryption.

One example where Amazon DynamoDB can be used with Amazon SageMaker in a cost-efficient way is for real-time prediction applications. DynamoDB can serve as a storage backend for storing and retrieving input data for prediction models built using SageMaker. Instead of continuously running and scaling an inference endpoint, which can be costlier, you can leverage DynamoDB’s low-latency access and scalability to retrieve the required input data on demand.

In this setup, the input data for predictions can be stored in DynamoDB tables, where each item represents a unique data instance. When a prediction...

Summary

In this chapter, you learned about various data storage services from Amazon, and how to secure data through various policies and use these services. If you are working on machine learning use cases, then you may encounter such scenarios where you have to choose an effective data storage service for your requirements.

In the next chapter, you will learn about the migration and processing of stored data.

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

  1. Click the link – https://packt.link/MLSC01E2_CH02.

    Alternatively, you can scan the following QR code (Figure 2.3):

Figure 2.3 – QR code that opens Chapter Review Questions for logged-in users

Figure 2.3 – QR code that opens Chapter Review...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Volume Types

Use cases

General Purpose SSD (gp2...

Attempt

Score

Time Taken

Attempt 5

77%

21 mins 30 seconds

Attempt 6

78%

18 mins 34 seconds

Attempt 7

76%

14 mins 44 seconds

Table 2.2 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura