You're reading from AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781835082201

Edition2nd Edition

Concepts

Machine Learning

Authors (2):

Somanath Nanda

Weslley Moura

View More author details

AWS Services for Data Migration and Processing

In the previous chapter, you learned about several ways of storing data in AWS. In this chapter, you will explore the techniques for using that data and gaining some insight from the data. There are use cases where you have to process your data or load the data to a hive data warehouse to query and analyze the data. If you are on AWS and your data is in S3, then you have to create a table in hive on AWS EMR to query the data in the hive table. To provide the same functionality as a managed service, AWS has a product called Athena, where you create a data catalog and query your data on S3. If you need to transform the data, then AWS Glue is the best option to transform and restore it to S3. Imagine a use case where you need to stream data and create analytical reports on that data. For this, you can opt for AWS Kinesis Data Streams to stream data and store it in S3. Using Glue, the same data can be copied to Redshift for further analytical...

Technical requirements

You can download the data used in the examples from GitHub, available here: https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter03.

Creating ETL jobs on AWS Glue

In a modern data pipeline, there are multiple stages, such as generating data, collecting data, storing data, performing ETL, analyzing, and visualizing. In this section, you will cover each of these at a high level and understand the extract, transform, load (ETL) process in depth:

Data can be generated from several devices, including mobile devices or IoT, weblogs, social media, transactional data, and online games.
This huge amount of generated data can be collected by using polling services, through API gateways integrated with AWS Lambda to collect the data, or via streams such as AWS Kinesis, AWS-managed Kafka, or Kinesis Firehose. If you have an on-premises database and you want to bring that data to AWS, then you would choose AWS DMS for that. You can sync your on-premises data to Amazon S3, Amazon EFS, or Amazon FSx via AWS DataSync. AWS Snowball is used to collect/transfer data into and out of AWS.
The next step involves storing...

Queryin g S3 data using Athena

Athena is a serverless service designed for querying data stored in S3. It is serverless because the client doesn’t manage the servers that are used for computation:

Athena uses a schema to present the results against a query on the data stored in S3. You define how (the way or the structure) you want your data to appear in the form of a schema and Athena reads the raw data from S3 to show the results as per the defined schema.
The output can be used by other services for visualization, storage, or various analytics purposes. The source data in S3 can be in any of the following structured, semi-structured, or unstructured data formats: XML, JSON, CSV/TSV, AVRO, Parquet, or ORC (as well as others). CloudTrail, ELB logs, and VPC flow logs can also be stored in S3 and analyzed by Athena.
This follows the schema-on-read technique. Unlike traditional techniques, tables are defined in advance in a data catalog, and the data’s structure...

Processi ng real-time data using Kinesis Data Streams

Kinesis is Amazon’s streaming service and can be scaled based on requirements. It has a level of persistence that retains data for 24 hours by default or optionally up to 365 days. Kinesis Data Streams is used for large-scale data ingestion, analytics, and monitoring:

Kinesis streams can be ingested by multiple producers and multiple consumers can also read data from the streams. The following is an example to help you understand this. Suppose you have a producer ingesting data to a Kinesis stream and the default retention period is 24 hours, which means data ingested at 05:00:00 A.M. today will be available in the stream until 04:59:59 A.M. tomorrow. This data won’t be available beyond that point, and ideally, it should be consumed before it expires; otherwise, it can be stored somewhere if it’s critical. The retention period can be extended to a maximum of 365 days, at an extra cost.
Kinesis can...

Storing and transforming real-time data using Kinesis Data Firehose

There are a lot of use cases that require data to be streamed and stored for future analytics purposes. To overcome such problems, you can write a Kinesis consumer to read the Kinesis stream and store the data in S3. This solution needs an instance or a machine to run the code with the required access to read from the stream and write to S3. The other possible option would be to run a Lambda function that gets triggered on the putRecord or putRecords API made to the stream and reads the data from the stream to store in the S3 bucket:

To make this easy, Amazon provides a separate service called Kinesis Data Firehose. This can easily be plugged into a Kinesis data stream and will require essential IAM roles to write data into S3. It is a fully managed service to reduce the load of managing servers and code. It also supports loading the streamed data into Amazon Redshift, Elasticsearch, and Splunk. Kinesis Data...

Differen t ways of ingesting data from on-premises into AWS

With the increasing demand for data-driven use cases, managing data on on-premises servers is pretty tough at the moment. Taking backups is not easy when you deal with a huge amount of data. This data in data lakes is used to build deep neural networks, create a data warehouse to extract meaningful information from it, run analytics, and generate reports.

Now, if you look at the available options to migrate data into AWS, this comes with various challenges too. For example, if you want to send data to S3, then you have to write a few lines of code to send your data to AWS. You will have to manage the code and servers to run the code. It has to be ensured that the data is commuting via the HTTPS network. You need to verify whether the data transfer was successful. This adds complexity as well as time and effort challenges to the process. To avoid such scenarios, AWS provides services to match or solve your use cases by designing...

Processi ng stored data on AWS

There are several services for processing the data stored in AWS. You will learn about AWS Batch and AWS Elastic MapReduce (EMR) in this section. EMR is a product from AWS that primarily runs MapReduce jobs and Spark applications in a managed way. AWS Batch is used for long-running, compute-heavy workloads.

AWS EMR

EMR is a managed implementation of Apache Hadoop provided as a service by AWS. It includes other components of the Hadoop ecosystem, such as Spark, HBase, Flink, Presto, Hive, and Pig. You will not need to learn about these in detail for the certification exam, but here’s some information about EMR:

EMR clusters can be launched from the AWS console or via the AWS CLI with a specific number of nodes. The cluster can be a long-term cluster or an ad hoc cluster. In a long-running traditional cluster, you have to configure the machines and manage them yourself. If you have jobs that need to be executed faster, then you need...

Summary

In this chapter, you learned about different ways of processing data in AWS. You also learned the capabilities in terms of extending your data centers to AWS, migrating data to AWS, and the ingestion process. You learned about the various ways of using data to process it and make it ready for analysis. You understood the magic of using a data catalog, which helps you to query your data via AWS Glue and Athena.

In the next chapter, you will learn about various machine learning algorithms and their usage.

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

Click the link – https://packt.link/MLSC01E2_CH03.
Alternatively, you can scan the following QR code (Figure 3.9):

Figure 3.9 – QR code that opens Chapter Review...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Attempt	Score	Time Taken
Attempt 5	77%	21 mins 30 seconds
Attempt 6	78%	18 mins 34 seconds
Attempt 7	76%	14 mins 44 seconds

Table 3.1 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

The rest of the chapter is locked

You have been reading a chapter from

AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

AWS Services for Data Migration and Processing

Technical requirements

Creating ETL jobs on AWS Glue

Querying S3 data using Athena

Processing real-time data using Kinesis Data Streams

Storing and transforming real-time data using Kinesis Data Firehose

Different ways of ingesting data from on-premises into AWS

Processing stored data on AWS

AWS EMR

Summary

Exam Readiness Drill – Chapter Review Questions

Working On Timing

Unlock this book and the full library FREE for 7 days

Authors (2)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook

Queryin g S3 data using Athena

Processi ng real-time data using Kinesis Data Streams

Differen t ways of ingesting data from on-premises into AWS

Processi ng stored data on AWS