Reader small image

You're reading from  Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook
Published inOct 2021
PublisherPackt
ISBN-139781801077743
Edition1st Edition
Right arrow
Author (1)
Manoj Kukreja
Manoj Kukreja
author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Right arrow

Chapter 3: Data Engineering on Microsoft Azure

In the previous chapter, we discussed how cloud adoption offers greater flexibility and faster deployments for data engineering and analytical workloads. In this chapter, we'll discuss the major tools and services in Microsoft Azure that may help us implement such a solution.

In this chapter, we will cover the following topics:

  • Introduction to data engineering in Azure
  • Performing data engineering in Azure
  • How to open a free account with Azure

Introducing data engineering in Azure

In recent years, Microsoft Azure has added several powerful services to its arsenal that seamlessly collect, store, process, and publish data for both batch and streaming workloads. Gone are the days where choices for storage and compute were severely limited among cloud vendors. As a user, you simply needed to conform with the supplied tools and services: now, your options are more extensive.

Today, the cloud ecosystem looks very different from what it did previously. The growth of cloud services allows users to choose from a variety of storage, compute, and deployment options. As an example, if I want to run a Spark program, I can choose from at least four different options in Microsoft Azure. The real question is, if all four options are running Apache Spark, then why are these options even required?

Important Note

The array of options available on the cloud are not limited to compute only: the same variety exists for data collection...

Performing data engineering in Microsoft Azure

Data engineering in Microsoft Azure can be performed using the following three options:

  • Self-managed data engineering services (IaaS)
  • Azure-managed data engineering services (PaaS)
  • Data engineering as a service (SaaS):

Figure 3.1 – Data engineering option in Microsoft Azure

Self-managed data engineering services (IaaS)

In the early phases of data engineering, using well-known distributed frameworks such as Hadoop, Spark, and Kafka rose sharply. As a result, many organizations were deploying Hadoop/Spark/Kafka using on-premises infrastructures. Since Hadoop/Spark/Kafka are multi-node frameworks, this meant the installations were performed using physical and virtual machines hosted on either the organization's owned or co-located data centers.

Then came the period when the cloud started to become a reality and organizations started to move their Hadoop/Spark/Kafka clusters to...

Opening a free account with Microsoft Azure

In the upcoming chapters, we will be using the services you have just been reading about to build a data lake using the lakehouse architecture. Therefore, it is time to open a free Azure account that gives you 12 months of free services, plus a one-time $260 credit. Please note that not all – but most – services are free. To open a free account and browse through the free services, please visit the following link: https://azure.microsoft.com/en-ca/free/.

Here is some valuable advice:

  • It is always a good idea to remove the compute resources once you've used them.
  • You get 5 GB of locally redundant data (LRS) data for free. There is no need to remove data that's stored during future exercises since we will not be exceeding this limit.
  • While using the free services in Azure, please keep a strict eye on your billing using the following link: https://portal.azure.com/#blade/Microsoft_Azure_Billing/BillingMenuBlade...

Summary

In this chapter, we learned about the IaaS, PaaS, and SaaS services in Azure that can help a data engineer build a data lake. We also discussed that cloud vendors provide many different options to perform similar operations. It is up to the data engineer to choose the right service that provides the customer with benefits based on their usage patterns, in-house skills, and budget.

The modern-day data pipeline requires careful planning, design, development, and deployment. In the next chapter, we will learn about the life cycle of a data pipeline and effective strategies for each phase of data pipeline creation.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja