Reader small image

You're reading from  AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781835082201
Edition2nd Edition
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

AWS Services for Data Migration and Processing

In the previous chapter, you learned about several ways of storing data in AWS. In this chapter, you will explore the techniques for using that data and gaining some insight from the data. There are use cases where you have to process your data or load the data to a hive data warehouse to query and analyze the data. If you are on AWS and your data is in S3, then you have to create a table in hive on AWS EMR to query the data in the hive table. To provide the same functionality as a managed service, AWS has a product called Athena, where you create a data catalog and query your data on S3. If you need to transform the data, then AWS Glue is the best option to transform and restore it to S3. Imagine a use case where you need to stream data and create analytical reports on that data. For this, you can opt for AWS Kinesis Data Streams to stream data and store it in S3. Using Glue, the same data can be copied to Redshift for further analytical...

Technical requirements

Creating ETL jobs on AWS Glue

In a modern data pipeline, there are multiple stages, such as generating data, collecting data, storing data, performing ETL, analyzing, and visualizing. In this section, you will cover each of these at a high level and understand the extract, transform, load (ETL) process in depth:

  • Data can be generated from several devices, including mobile devices or IoT, weblogs, social media, transactional data, and online games.
  • This huge amount of generated data can be collected by using polling services, through API gateways integrated with AWS Lambda to collect the data, or via streams such as AWS Kinesis, AWS-managed Kafka, or Kinesis Firehose. If you have an on-premises database and you want to bring that data to AWS, then you would choose AWS DMS for that. You can sync your on-premises data to Amazon S3, Amazon EFS, or Amazon FSx via AWS DataSync. AWS Snowball is used to collect/transfer data into and out of AWS.
  • The next step involves storing...

Querying S3 data using Athena

Athena is a serverless service designed for querying data stored in S3. It is serverless because the client doesn’t manage the servers that are used for computation:

  • Athena uses a schema to present the results against a query on the data stored in S3. You define how (the way or the structure) you want your data to appear in the form of a schema and Athena reads the raw data from S3 to show the results as per the defined schema.
  • The output can be used by other services for visualization, storage, or various analytics purposes. The source data in S3 can be in any of the following structured, semi-structured, or unstructured data formats: XML, JSON, CSV/TSV, AVRO, Parquet, or ORC (as well as others). CloudTrail, ELB logs, and VPC flow logs can also be stored in S3 and analyzed by Athena.
  • This follows the schema-on-read technique. Unlike traditional techniques, tables are defined in advance in a data catalog, and the data’s structure...

Processing real-time data using Kinesis Data Streams

Kinesis is Amazon’s streaming service and can be scaled based on requirements. It has a level of persistence that retains data for 24 hours by default or optionally up to 365 days. Kinesis Data Streams is used for large-scale data ingestion, analytics, and monitoring:

  • Kinesis streams can be ingested by multiple producers and multiple consumers can also read data from the streams. The following is an example to help you understand this. Suppose you have a producer ingesting data to a Kinesis stream and the default retention period is 24 hours, which means data ingested at 05:00:00 A.M. today will be available in the stream until 04:59:59 A.M. tomorrow. This data won’t be available beyond that point, and ideally, it should be consumed before it expires; otherwise, it can be stored somewhere if it’s critical. The retention period can be extended to a maximum of 365 days, at an extra cost.
  • Kinesis can...

Storing and transforming real-time data using Kinesis Data Firehose

There are a lot of use cases that require data to be streamed and stored for future analytics purposes. To overcome such problems, you can write a Kinesis consumer to read the Kinesis stream and store the data in S3. This solution needs an instance or a machine to run the code with the required access to read from the stream and write to S3. The other possible option would be to run a Lambda function that gets triggered on the putRecord or putRecords API made to the stream and reads the data from the stream to store in the S3 bucket:

  • To make this easy, Amazon provides a separate service called Kinesis Data Firehose. This can easily be plugged into a Kinesis data stream and will require essential IAM roles to write data into S3. It is a fully managed service to reduce the load of managing servers and code. It also supports loading the streamed data into Amazon Redshift, Elasticsearch, and Splunk. Kinesis Data...

Different ways of ingesting data from on-premises into AWS

With the increasing demand for data-driven use cases, managing data on on-premises servers is pretty tough at the moment. Taking backups is not easy when you deal with a huge amount of data. This data in data lakes is used to build deep neural networks, create a data warehouse to extract meaningful information from it, run analytics, and generate reports.

Now, if you look at the available options to migrate data into AWS, this comes with various challenges too. For example, if you want to send data to S3, then you have to write a few lines of code to send your data to AWS. You will have to manage the code and servers to run the code. It has to be ensured that the data is commuting via the HTTPS network. You need to verify whether the data transfer was successful. This adds complexity as well as time and effort challenges to the process. To avoid such scenarios, AWS provides services to match or solve your use cases by designing...

Processing stored data on AWS

There are several services for processing the data stored in AWS. You will learn about AWS Batch and AWS Elastic MapReduce (EMR) in this section. EMR is a product from AWS that primarily runs MapReduce jobs and Spark applications in a managed way. AWS Batch is used for long-running, compute-heavy workloads.

AWS EMR

EMR is a managed implementation of Apache Hadoop provided as a service by AWS. It includes other components of the Hadoop ecosystem, such as Spark, HBase, Flink, Presto, Hive, and Pig. You will not need to learn about these in detail for the certification exam, but here’s some information about EMR:

  • EMR clusters can be launched from the AWS console or via the AWS CLI with a specific number of nodes. The cluster can be a long-term cluster or an ad hoc cluster. In a long-running traditional cluster, you have to configure the machines and manage them yourself. If you have jobs that need to be executed faster, then you need...

Summary

In this chapter, you learned about different ways of processing data in AWS. You also learned the capabilities in terms of extending your data centers to AWS, migrating data to AWS, and the ingestion process. You learned about the various ways of using data to process it and make it ready for analysis. You understood the magic of using a data catalog, which helps you to query your data via AWS Glue and Athena.

In the next chapter, you will learn about various machine learning algorithms and their usage.

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

  1. Click the link – https://packt.link/MLSC01E2_CH03.

    Alternatively, you can scan the following QR code (Figure 3.9):

Figure 3.9 – QR code that opens Chapter Review Questions for logged-in users

Figure 3.9 – QR code that opens Chapter Review...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Attempt

Score

Time Taken

Attempt 5

77%

21 mins 30 seconds

Attempt 6

78%

18 mins 34 seconds

Attempt 7

76%

14 mins 44 seconds

Table 3.1 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura