Reader small image

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781801071079
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sakti Mishra
Sakti Mishra
author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Right arrow

Use case and architecture overview

For this use case, let's assume you have a vendor who provides incremental sales data at the end of every day. The file arrives in S3 as CSV and it needs to be processed and made available to your data analysts for querying.

Your assignment is to build a data pipeline that automatically picks up the new sales file from the S3 input bucket, processes it with required transformations, and makes it available in the target S3 bucket, which will be used for querying. To implement this pipeline, you have planned to integrate a transient EMR cluster with Spark as the distributed processing engine. This EMR cluster is not active and gets created just before executing the job and gets terminated after completing the job.

Architecture overview

The following is the high-level architecture diagram of the data pipeline:

Figure 9.1 – Reference architecture diagram for a batch ETL pipeline

Here are the steps as shown...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079

Author (1)

author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra