You're reading from Simplifying Data Engineering and Analytics with Delta

Product typeBook

Published inJul 2022

PublisherPackt

ISBN-139781801814867

Edition1st Edition

Concepts

Big Data

Author (1)

Anindita Mahapatra

Chapter 11: Operationalizing Data and ML Pipelines

"We are what we repeatedly do. Excellence, then, is not an act, but a habit."

– Aristotle

In the previous chapters, we saw how Delta helps to democratize data products and services and facilitates data sharing within the organization and externally with vendors and partners. Creating a Proof of Concept (POC) happy path to prove what is feasible is a far cry from taking a workload to production. Stakeholders and consumers get upset when their reports are not available in a timely manner and, over time, lose confidence in the team's ability to deliver on their promises. This affects the profitability metrics of the business.

In this chapter, we will look into the aspects of DevOps that harden a pipeline so that it stands the test of time and people do not...

Technical requirements

To follow the instructions of this chapter, make sure you have the code and instructions as detailed in this GitHub location: https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter11.

https://delta.io/roadmap/ is the list of features coming to open source Delta in the near future. This chapter refers to some of them, including Delta clone.

We discuss a Databricks-specific feature called Delta Live Table (DLT) to give an example of what to aspire for in an intelligent pipeline.

Let's get started!

Why operationalize?

Consistently bringing data in a timely manner to the right stakeholders is what data/analytics operationalization is all about. It looks deceptively simple, but only about 1% of AI/ML projects truly succeed, and the main reasons are a lack of scale, a lack of trust, and a lack of governance, meaning that not all the compliance boxes are checked to deliver the project within the window of opportunity. The key areas that need attention to enable this include getting complete datasets, including unstructured data, which is the hardest to tame, accelerating the development process by improving means of collaboration between data personas and having a well-defined governance and deployment framework.

By now, the medallion architecture should be a familiar architecture blueprint construct. It is to be noted that in the real world, several producers, several pipelines, and several consumers criss-cross. Each pipeline transforms and wrangles data based on the requirements...

Understanding and monitoring SLAs

A Service Level Agreement (SLA) is part of an explicit or implicit contract attesting to certain service quality metrics and expectations. Violation of some of these could result in penalties, fines, and loss of reputation. There is usually a cost and service quality tradeoff. So, it is important to articulate the SLA requirements of each use case and describe how it will be measured and tracked so that there is no ambiguity of whether it was honored or violated. There should also be clear guidance on how SLA violations are reported and the obligations and consequences on behalf of the service provider to remedy or compensate for the breach.

There are several types of SLA, and common ones include metrics for system availability, system response time, customer satisfaction as measured by Net Promoter Score (NPS), support tickets raised over a period of time, defect/error tickets and the response time to address them, and security incidents. It is...

Scaling and high availability

Scalability refers to the elasticity of compute resources, meaning adding more compute capacity as data volume increases to support a heavier workload. It is sometimes necessary to scale down resources that aren't in use to save compute costs. Scaling can be of two types: vertical or horizontal. Vertical scaling refers to replacing existing node types with bigger instance types. This is not sustainable after a point because there is an upper bound on the largest possible instance. Horizontal scaling refers to the addition of more worker nodes of the same type and is truly infinitely scalable. Each serves different scenarios. If the largest partition is no longer divisible, we benefit from a bigger node type. However, the advantage is that some of the nodes can be turned off when there is low data volume. This is an infrastructure and architecture capability and not directly related to Delta.

High availability (HA) refers to the system uptime...

Planning for DR

Planning for DR requires a balance of cost and time needed for a business to recover from an outage. The shorter the time expectation, the more expensive the DR solution.

It is important to understand two key SLAs for the business use case:

Recovery Time Objective (RTO) refers to the duration in which a business is mandated to recover from an outage. For example, if RTO is 1 hour and it is 30 minutes since the outage, then we have 30 more minutes to recover and bring the operations back online without violating the RTO stipulations.
Recovery Point Objective (RPO) refers to the maximum time period of a disruption after which the loss of data collection and processing will exceed the business's agreed-upon threshold. For example, if backup was done in the last hour and the defined RPO is 2 hours, we still have an hour to recover from the disruption to the business.

In the next section, we will see how to use these values of RTO and...

Guaranteeing data quality

We've examined the medallion architecture blueprint, where raw data is brought as is into bronze, refined in silver, and aggregated in gold. As data moves from left to right in the pipeline it is getting transformed and refined in quality by the application of business rules (how to normalize data and impute missing values, among other things), and this curated data is more valuable than the original data. But what if the transformations, instead of refining and increasing the quality of the data, actually have bugs and can occasionally cause damage? We need a way to monitor the quality and ensure it is maintained over time, and if for some reason it degrades, we need to be notified. If there is an occasional fix by updating the data, it is a needle in a haystack scenario, but nevertheless, it needs to be accommodated easily.

Delta's ACID transaction support ensures that in the event of a failure, no data is committed, ensuring that consumers...

Automation of CI/CD pipelines

POC and Pilot code to prove out an end-to-end path does not get sanctioned for production as is. Typically, it makes its way through dev, stage, and prod environments, where it gets tested and scrutinized. A data product may involve different data teams and different departments to come together and test the data product holistically. An ML cycle has a few additional steps around ML artifact testing to ensure that insights are not only generated, but also valid and relevant. So, Continuous Training (CT) and Continuous Monitoring (CM) are additional steps in the pipeline. Last but not least, data has to be versioned because outcomes need to be compared with expected results, sometimes within an acceptable threshold.

Automation takes a little time to build, but it saves a lot more time and grief in the long run. So, investing in testing frameworks and automation around CI/CD pipelines is a task that is worth investing in. Continuous Integration...

Data as code – An intelligent pipeline

All the operationalizing aspects referred to in the previous sections would have to be explicitly coded by DevOps, MLOps, and DataOps personas. A managed platform such as Databricks has abstracted the complexity of all these features as part of its DLT offering. The culmination of all these features out of the box gives rise to intelligent pipelines. There is a shift from a procedural to a declarative definition of a pipeline where, as an end user, you specify the "what" aspects of the data transformations, delegating the "how" aspects to the underlying platform. This is especially useful for simplifying the ETL development and go to production process when pipelines need to be democratized across multiple use cases for large, fast-moving data volumes such as IoT sensor data.

These are the key differentiators:

The ability to understand the dependencies of the transformations to generate the underlying DAG...

Summary

Organizations rely on good data to be delivered in a timely manner to make better business decisions. Every use case has SLAs and metrics that need to be honored. So, operationalizing a pipeline starts with an understanding of both the functional and non-functional business requirements so that people are not surprised that it either does not comply with expectations or is too expensive. With thousands of data pipelines spanning multiple lines of businesses and their inter-dependencies, it is a non-trivial task to ensure they all run successfully and the data they produce is complete and reliable.

In this chapter, we examined the various aspects to be considered when building reliable and robust pipelines and ensuring they continue to run in spite of environmental issues to ensure business continuity. In addition, we explored the need for lineage tracking, observability, and appropriate alerting so everyone is on the same page and can make decisions on when to consume them...

The rest of the chapter is locked

You have been reading a chapter from

Simplifying Data Engineering and Analytics with Delta

Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages