You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook

Published inOct 2021

PublisherPackt

ISBN-139781801077743

Edition1st Edition

Tools

Apache Spark

Concepts

Data Processing

Author (1)

Manoj Kukreja

Chapter 3: Data Engineering on Microsoft Azure

In the previous chapter, we discussed how cloud adoption offers greater flexibility and faster deployments for data engineering and analytical workloads. In this chapter, we'll discuss the major tools and services in Microsoft Azure that may help us implement such a solution.

In this chapter, we will cover the following topics:

Introduction to data engineering in Azure
Performing data engineering in Azure
How to open a free account with Azure

Introducing data engineering in Azure

In recent years, Microsoft Azure has added several powerful services to its arsenal that seamlessly collect, store, process, and publish data for both batch and streaming workloads. Gone are the days where choices for storage and compute were severely limited among cloud vendors. As a user, you simply needed to conform with the supplied tools and services: now, your options are more extensive.

Today, the cloud ecosystem looks very different from what it did previously. The growth of cloud services allows users to choose from a variety of storage, compute, and deployment options. As an example, if I want to run a Spark program, I can choose from at least four different options in Microsoft Azure. The real question is, if all four options are running Apache Spark, then why are these options even required?

Important Note

The array of options available on the cloud are not limited to compute only: the same variety exists for data collection...

Performing data engineering in Microsoft Azure

Data engineering in Microsoft Azure can be performed using the following three options:

Self-managed data engineering services (IaaS)
Azure-managed data engineering services (PaaS)
Data engineering as a service (SaaS):

Figure 3.1 – Data engineering option in Microsoft Azure

Self-managed data engineering services (IaaS)

In the early phases of data engineering, using well-known distributed frameworks such as Hadoop, Spark, and Kafka rose sharply. As a result, many organizations were deploying Hadoop/Spark/Kafka using on-premises infrastructures. Since Hadoop/Spark/Kafka are multi-node frameworks, this meant the installations were performed using physical and virtual machines hosted on either the organization's owned or co-located data centers.

Then came the period when the cloud started to become a reality and organizations started to move their Hadoop/Spark/Kafka clusters to...

Opening a free account with Microsoft Azure

In the upcoming chapters, we will be using the services you have just been reading about to build a data lake using the lakehouse architecture. Therefore, it is time to open a free Azure account that gives you 12 months of free services, plus a one-time $260 credit. Please note that not all – but most – services are free. To open a free account and browse through the free services, please visit the following link: https://azure.microsoft.com/en-ca/free/.

Here is some valuable advice:

It is always a good idea to remove the compute resources once you've used them.
You get 5 GB of locally redundant data (LRS) data for free. There is no need to remove data that's stored during future exercises since we will not be exceeding this limit.
While using the free services in Azure, please keep a strict eye on your billing using the following link: https://portal.azure.com/#blade/Microsoft_Azure_Billing/BillingMenuBlade...

Summary

In this chapter, we learned about the IaaS, PaaS, and SaaS services in Azure that can help a data engineer build a data lake. We also discussed that cloud vendors provide many different options to perform similar operations. It is up to the data engineer to choose the right service that provides the customer with benefits based on their usage patterns, in-house skills, and budget.

The modern-day data pipeline requires careful planning, design, development, and deployment. In the next chapter, we will learn about the life cycle of a data pipeline and effective strategies for each phase of data pipeline creation.

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages