Reader small image

You're reading from  AWS for Solutions Architects - Second Edition

Product typeBook
Published inApr 2023
PublisherPackt
ISBN-139781803238951
Edition2nd Edition
Right arrow
Authors (4):
Saurabh Shrivastava
Saurabh Shrivastava
author image
Saurabh Shrivastava

Saurabh Shrivastava is a technology leader, author, inventor, and public speaker with over 18 years of experience in the IT industry. He currently works at Amazon Web Services (AWS) as a Global Solutions Architect Leader and enables global consulting partners and enterprise customers on their journey to the cloud. Saurabh led the AWS global technical partnerships, set his team's vision and execution model, and nurtured multiple new strategic initiatives. Saurabh has authored various blogs and whitepapers across a diverse range of technologies, such as big data, IoT, machine learning, and cloud computing. He is passionate about the latest innovations and their impact on our society and daily life. He holds a patent in the area of cloud platform automation. Before AWS, Saurabh worked as an enterprise solution architect, software architect, and software engineering manager in Fortune 50 enterprises, start-ups, and global product and consulting organizations.
Read more about Saurabh Shrivastava

Neelanjali Srivastav
Neelanjali Srivastav
author image
Neelanjali Srivastav

Neelanjali Srivastav is a technology leader, product manager, agile coach, and cloud practitioner with over 16 years of experience in the software industry. She currently works at Amazon Web Services (AWS) as a Senior Product Manager and enables global customers on their data journey to the cloud. Neelanjali evangelizes and enables AWS customer and partners in AWS database, analytics, and machine learning services. She sets the product vision and cultivates new products in incubation. Before AWS, Neelanjali led teams of software engineers, solutions architects, and systems analysts to modernize IT systems and develop innovative software solutions for large enterprises. Neelanjali has held multiple roles in the IT services industry and R&D, focusing on enterprise application management, cloud service management, and orchestration.
Read more about Neelanjali Srivastav

Alberto Artasanchez
Alberto Artasanchez
author image
Alberto Artasanchez

Alberto Artasanchez is a solutions architect with expertise in the cloud, data solutions, and machine learning, with a career spanning over 28 years in various industries. He is an AWS Ambassador and publishes frequently in a variety of cloud and data science publications. He is often tapped as a speaker on topics including data science, big data, and analytics. He has a strong and extensive track record of designing and building end-to-end machine learning platforms at scale. He also has a long track record of leading data engineering teams and mentoring, coaching, and motivating them. He has a great understanding of how technology drives business value and has a passion for creating elegant solutions to complicated problems.
Read more about Alberto Artasanchez

Imtiaz Sayed
Imtiaz Sayed
author image
Imtiaz Sayed

Imtiaz (Taz) Sayed leads the Worldwide Data Analytics Solutions Architecture community at AWS. He is a Principal Solutions Architect, and works with diverse customers engaging in thought leadership, strategic partnerships and specialized guidance on building modern data platforms on AWS.  He is a technologist with over 20 years of experience across several domains including distributed architectures, data analytics, service mesh, databases, and DevOps.
Read more about Imtiaz Sayed

View More author details
Right arrow

Big Data and Streaming Data Processing in AWS

Traditionally, a business’s most important resources are its human and financial capital. However, in the last few decades, more and more businesses have realized that another resource may be just as, if not more, vital: its data capital.

Data has taken a special place at the center of some of today’s most successful enterprises. For this reason, business leaders have concluded that to survive in today’s business climate, they must collect, process, transform, distill, and safeguard their data like their other traditional business capital.

In this chapter, you will dive deep into AWS’s analytics services. First, you will learn about Amazon EMR, which is Hadoop in the cloud, and about AWS data cataloging offering, AWS Glue. Finally, you will look at how to handle streaming data using AWS. In this chapter, you will cover the following topics:

  • Why use the cloud for big data analytics?
  • Amazon...

Why use the cloud for big data analytics?

The definition of big data has changed drastically in the last 2 decades. Now, there is a massive amount of data coming from various sources. This data can be structured, unstructured, or semi-structured. We see large technology organizations, such as Amazon, Google, Meta, and so on, flourishing as they can get insight from user data and utilize it for customer benefits, thus growing their business multifold. IDC says, “The Global DataSphere is expected to more than double in size from 2022 to 2026,” (Source – https://www.idc.com/getdoc.jsp?containerId=US49018922).

Now, it’s normal for organizations to have multi-terabytes or petabytes of data, and you want to gain new insights to use the power of this collected data. You must easily access and analyze all data types, such as log files, clickstream data, voice, and video. But your team may require diverse skills and tools. You need to enable your team and applications...

Amazon Elastic Map Reduce (EMR)

Back in 2009, AWS introduced EMR, a tool that can handle extremely large amounts of data (terabytes and petabytes) using the latest open-source big data tools like Spark, Hive, Presto, HBase, Flink, and Hudi in the cloud. Amazon EMR is a managed cluster platform that makes it easier to run big data tools, such as Apache Hadoop and Apache Spark, on the AWS cloud for processing and analyzing massive datasets. It is a wrapper around distributed open-source computing frameworks. This wrapper abstracts the effort required to set up infrastructure, security, network communication, disaster recovery, and scalability. Additionally, EMR offers 100% compliance with open-source APIs. So, there is no need to change your application code when you move to EMR from the on-premises Hadoop system.

EMR runs directly against the data stored in your S3 data lake, so you don’t need to move that data or transform your data. You can store data in the data lake...

Introduction to AWS Glue

Data-driven businesses can increase their profitability and efficiency, reduce costs, deliver new products and services, better serve their customers, comply with regulatory requirements, and ultimately thrive. Unfortunately, as we have seen many examples of in recent years, companies that don’t make this transition will not be able to survive. An important part of a data-driven enterprise is the ability to ingest, process, transform, and analyze this data.

AWS Glue is a foundational service at the heart of the AWS offering.

With the introduction of Apache Spark, enterprises can process petabytes’ worth of data daily. Processing this amount of data opens the door to making data an enterprise’s most valuable asset. Processing this data at this scale allows enterprises to create new industries and markets. Some examples of business activities that have significantly benefited from this massive data processing are as follows:

...

Choosing between AWS Glue and Amazon EMR

Having learned about Glue and EMR, you must be wondering, to some extent, whether these offerings do a similar job in data processing, and when to choose one over the other. Yes, AWS has a similar offering and that can be confusing sometimes, but both have a specific purpose. Amazon always works backward from the customer, so all these offerings are available because customers have asked for them.

There is a no-brainer for your data cataloging needs; you should always use AWS Glue, and these data catalogs can be utilized when you are processing a job in EMR. However, Glue only supports the Spark framework, and if you are interested in using any other open-source software such as Hive, Ping, or Presto, then you need to choose EMR.

When running data transformation using the Spark platform, you must choose between EMR and Glue. Suppose you are migrating your ETL job from an on-premises Hadoop environment. In that case, you can go with...

Handling streaming data in AWS

In today’s world, businesses aim to gain a competitive edge by providing timely tailored experiences to consumers. Consumers expect personalized experiences that meet their specific needs and reject those that don’t, such as when applying for a loan, investing, shopping online, tracking health alerts, or monitoring home security systems. As a result, speed has become a critical characteristic that businesses strive to achieve. Insights from data are perishable and can lose value quickly. Streaming data processing allows analytical insights to be gathered and acted upon instantly to deliver the desired customer experience.

Batch processing data doesn’t allow for real-time risk mitigation or customer authentication, and the customer experience can be ruined – and is hard to recover – if action isn’t taken in real time. Acting on real-time data can help prevent fraud and increase customer loyalty. Untimely...

Choosing between Amazon Kinesis and Amazon MSK

AWS launched Kinesis in 2013, and it was the only streaming data offering until 2018 when AWS launched MSK, in response to the high demand from their customers for managed Apache Kafka clusters. Now, both are similar offerings, so you must be wondering when to choose one versus the other. If you already have an existing Kafka workload on-premises or are running Kafka in EC2, it’s better to migrate to MSK, as you don’t need to make any changes in the code. You can take the help of the existing MirrorMaker tool to migrate. The following diagrams show key architectural differences between MSK and Kinesis:

Figure 10.4: Amazon MSK architecture

Figure 10.5: Amazon Kinesis architecture

As shown in the preceding diagrams, there are similarities between the MSK and Kinesis architectures. In the MSK cluster, you have brokers to store and ingest data, while in Kinesis, you have shards. In MSK, you need Zookeeper...

Summary

In this chapter, you started by understanding why to choose the cloud for big data analytics. You learned about the details of Amazon EMR, which is the AWS Hadoop offering on the cloud, and that in 2021, AWS also launched the server offering of EMR. You learned about EMR clusters, file systems, and security.

Later in this chapter, you were introduced to one of the most important services in the AWS stack – AWS Glue. You learned about the high-level components that comprise AWS Glue, such as the AWS Glue console, the AWS Glue Data Catalog, AWS Glue crawlers, and AWS Glue code generators. You then learned how everything is connected and how it can be used. Finally, you learned about the recommended best practices when architecting and implementing AWS Glue. You also learned when to choose Glue over EMR, and vice versa.

Real-time insights are becoming essential to the modern customer experience, and you learned about handling streaming data in the cloud. You learned...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS for Solutions Architects - Second Edition
Published in: Apr 2023Publisher: PacktISBN-13: 9781803238951
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (4)

author image
Saurabh Shrivastava

Saurabh Shrivastava is a technology leader, author, inventor, and public speaker with over 18 years of experience in the IT industry. He currently works at Amazon Web Services (AWS) as a Global Solutions Architect Leader and enables global consulting partners and enterprise customers on their journey to the cloud. Saurabh led the AWS global technical partnerships, set his team's vision and execution model, and nurtured multiple new strategic initiatives. Saurabh has authored various blogs and whitepapers across a diverse range of technologies, such as big data, IoT, machine learning, and cloud computing. He is passionate about the latest innovations and their impact on our society and daily life. He holds a patent in the area of cloud platform automation. Before AWS, Saurabh worked as an enterprise solution architect, software architect, and software engineering manager in Fortune 50 enterprises, start-ups, and global product and consulting organizations.
Read more about Saurabh Shrivastava

author image
Neelanjali Srivastav

Neelanjali Srivastav is a technology leader, product manager, agile coach, and cloud practitioner with over 16 years of experience in the software industry. She currently works at Amazon Web Services (AWS) as a Senior Product Manager and enables global customers on their data journey to the cloud. Neelanjali evangelizes and enables AWS customer and partners in AWS database, analytics, and machine learning services. She sets the product vision and cultivates new products in incubation. Before AWS, Neelanjali led teams of software engineers, solutions architects, and systems analysts to modernize IT systems and develop innovative software solutions for large enterprises. Neelanjali has held multiple roles in the IT services industry and R&D, focusing on enterprise application management, cloud service management, and orchestration.
Read more about Neelanjali Srivastav

author image
Alberto Artasanchez

Alberto Artasanchez is a solutions architect with expertise in the cloud, data solutions, and machine learning, with a career spanning over 28 years in various industries. He is an AWS Ambassador and publishes frequently in a variety of cloud and data science publications. He is often tapped as a speaker on topics including data science, big data, and analytics. He has a strong and extensive track record of designing and building end-to-end machine learning platforms at scale. He also has a long track record of leading data engineering teams and mentoring, coaching, and motivating them. He has a great understanding of how technology drives business value and has a passion for creating elegant solutions to complicated problems.
Read more about Alberto Artasanchez

author image
Imtiaz Sayed

Imtiaz (Taz) Sayed leads the Worldwide Data Analytics Solutions Architecture community at AWS. He is a Principal Solutions Architect, and works with diverse customers engaging in thought leadership, strategic partnerships and specialized guidance on building modern data platforms on AWS.  He is a technologist with over 20 years of experience across several domains including distributed architectures, data analytics, service mesh, databases, and DevOps.
Read more about Imtiaz Sayed