You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook

Published inOct 2021

PublisherPackt

ISBN-139781801077743

Edition1st Edition

Tools

Apache Spark

Concepts

Data Processing

Author (1)

Manoj Kukreja

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Our data journey is finally approaching its destination. As the new era of analytics takes over, the demand for data engineers will continue to grow, and so will the amount of code that they will produce. The ever-increasing demand for developing, managing, and deploying large code sets is already testing the limits of modern data engineers.

Luckily, a modern trend is fast emerging that has the potential of taking a lot of burden off poor data engineers. In this chapter, we will learn about code delivery automation using CI/CD pipelines. In short, CI/CD is a collection of practices that's used to integrate and deliver code faster using small atomic changes.

In this chapter, we will cover the following topics:

Understanding CI/CD
Designing CI/CD pipelines
Developing CI/CD pipelines

Understanding CI/CD

The process of data transformation is continuous. In every modern organization, the volume and variety of data is increasing at a very high pace. As a result, the need for creating new or modifying existing data pipelines is very high. This sudden growth in data pipeline code is testing the limits of the traditional software delivery cycle.

As a result, organizations are eagerly looking forward to adopting viable methods that can accelerate product delivery, using a combination of best practices and automation. After all, streamlining the software cycle creates a clear path to success. Before we try to understand how CI/CD works, there is merit in understanding the traditional software delivery cycle.

Traditional software delivery cycle

Before we start talking about the modern approach to software delivery, let's understand how the traditional method has worked so far:

Figure 12.1 – Traditional software delivery cycle

...

Designing CI/CD pipelines

Before we deep dive into the actual development and implementation of CI/CD pipelines, we should try to design their layout. In typical data analytics projects, the focus of development revolves around two key areas:

Infrastructure Deployment: As discussed in the previous chapter, these days, it is recommended to perform cloud deployments using the Infrastructure as Code (IaC) practice. Infrastructure code used to be developed by DevOps engineers, although recently, data engineers are being asked to share this responsibility.
Data Pipelines: The development of data pipelines is likely handled entirely by data engineers. The code that's developed includes functionality to perform data collection, ingestion, curation, aggregations, governance, and distribution.

Following the continuous development, integration, and deployment principles, the recommended approach is to create two CI/CD pipelines that we will refer to as the Electroniz...

Developing CI/CD pipelines

In this section, we will learn how to create and deploy the two CI/CD pipelines we mentioned previously. We will create these CI/CD pipelines using Azure DevOps. Azure DevOps is a collection of developer services for planning, collaborating, developing, and deploying code. Although Azure DevOps supports a variety of developer services, for this exercise, we will primarily focus on Azure Repos and Azure Pipelines.

I know we are eager to proceed with creating the pipelines, but there is a fair bit of preparation required before we can get started. The process starts with creating the Azure DevOps organization, which can be done in a few simple steps. However, to use the free tier of Azure Pipelines, you need to fill in a free parallelism request form for your newly created Azure DevOps organization. The approval process may take 2-3 days to complete.

Creating an Azure DevOps organization

Follow these steps to create an Azure DevOps organization:

...

Summary

In an era where organizations are aiming to do more with less, automation is quickly gaining a lot of attention. As CI/CD continues to grow and gain strength, it is set to become one of the most critical skills for modern data engineers. In most cases, the high cost of data engineers can only be justified if their skill set includes automation.

In many respects, adopting automation practices such as CI/CD is proving to be a lifesaver. Not only does automation take a lot of work off the data engineers' shoulders, but it also lowers costs by predictably performing repetitive iterations. On top of that, the built-in approval and fail fast mechanisms in CI/CD ensure team accountability and collaboration. If used wisely, adopting automation can ensure the predictable and seamless delivery of code and infrastructure components.

This is the last chapter of this book. I must admit that in the last 12 chapters, we have covered a lot of ground. We undertook the journey of...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Understanding CI/CD

Traditional software delivery cycle

Designing CI/CD pipelines

Developing CI/CD pipelines

Creating an Azure DevOps organization

Summary

Why subscribe?

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook