Reader small image

You're reading from  Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook
Published inOct 2021
PublisherPackt
ISBN-139781801077743
Edition1st Edition
Right arrow
Author (1)
Manoj Kukreja
Manoj Kukreja
author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Right arrow

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Our data journey is finally approaching its destination. As the new era of analytics takes over, the demand for data engineers will continue to grow, and so will the amount of code that they will produce. The ever-increasing demand for developing, managing, and deploying large code sets is already testing the limits of modern data engineers.

Luckily, a modern trend is fast emerging that has the potential of taking a lot of burden off poor data engineers. In this chapter, we will learn about code delivery automation using CI/CD pipelines. In short, CI/CD is a collection of practices that's used to integrate and deliver code faster using small atomic changes.

In this chapter, we will cover the following topics:

  • Understanding CI/CD
  • Designing CI/CD pipelines
  • Developing CI/CD pipelines

Understanding CI/CD

The process of data transformation is continuous. In every modern organization, the volume and variety of data is increasing at a very high pace. As a result, the need for creating new or modifying existing data pipelines is very high. This sudden growth in data pipeline code is testing the limits of the traditional software delivery cycle.

As a result, organizations are eagerly looking forward to adopting viable methods that can accelerate product delivery, using a combination of best practices and automation. After all, streamlining the software cycle creates a clear path to success. Before we try to understand how CI/CD works, there is merit in understanding the traditional software delivery cycle.

Traditional software delivery cycle

Before we start talking about the modern approach to software delivery, let's understand how the traditional method has worked so far:

Figure 12.1 – Traditional software delivery cycle

...

Designing CI/CD pipelines

Before we deep dive into the actual development and implementation of CI/CD pipelines, we should try to design their layout. In typical data analytics projects, the focus of development revolves around two key areas:

  • Infrastructure Deployment: As discussed in the previous chapter, these days, it is recommended to perform cloud deployments using the Infrastructure as Code (IaC) practice. Infrastructure code used to be developed by DevOps engineers, although recently, data engineers are being asked to share this responsibility.
  • Data Pipelines: The development of data pipelines is likely handled entirely by data engineers. The code that's developed includes functionality to perform data collection, ingestion, curation, aggregations, governance, and distribution.

Following the continuous development, integration, and deployment principles, the recommended approach is to create two CI/CD pipelines that we will refer to as the Electroniz...

Developing CI/CD pipelines

In this section, we will learn how to create and deploy the two CI/CD pipelines we mentioned previously. We will create these CI/CD pipelines using Azure DevOps. Azure DevOps is a collection of developer services for planning, collaborating, developing, and deploying code. Although Azure DevOps supports a variety of developer services, for this exercise, we will primarily focus on Azure Repos and Azure Pipelines.

I know we are eager to proceed with creating the pipelines, but there is a fair bit of preparation required before we can get started. The process starts with creating the Azure DevOps organization, which can be done in a few simple steps. However, to use the free tier of Azure Pipelines, you need to fill in a free parallelism request form for your newly created Azure DevOps organization. The approval process may take 2-3 days to complete.

Creating an Azure DevOps organization

Follow these steps to create an Azure DevOps organization:

...

Summary

In an era where organizations are aiming to do more with less, automation is quickly gaining a lot of attention. As CI/CD continues to grow and gain strength, it is set to become one of the most critical skills for modern data engineers. In most cases, the high cost of data engineers can only be justified if their skill set includes automation.

In many respects, adopting automation practices such as CI/CD is proving to be a lifesaver. Not only does automation take a lot of work off the data engineers' shoulders, but it also lowers costs by predictably performing repetitive iterations. On top of that, the built-in approval and fail fast mechanisms in CI/CD ensure team accountability and collaboration. If used wisely, adopting automation can ensure the predictable and seamless delivery of code and infrastructure components.

This is the last chapter of this book. I must admit that in the last 12 chapters, we have covered a lot of ground. We undertook the journey of...

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja