You're reading from Simplifying Data Engineering and Analytics with Delta

Product typeBook

Published inJul 2022

PublisherPackt

ISBN-139781801814867

Edition1st Edition

Concepts

Big Data

Author (1)

Anindita Mahapatra

Chapter 13: Managing Your Data Journey

“You possess all the attributes of a demagogue; a screeching, horrible voice, a perverse, cross-grained nature, and the language of the marketplace. In you, all is united which is needful for governing.”

– Aristophanes, The Knights

In the previous chapters, we looked at the roles and responsibilities of the primary data personas, namely data engineers and data scientists, and ML practitioners, business analysts, and DevOps/MLOps personas. One persona that we have not talked about much is that of an administrator. They are the gatekeepers that hold the key to deploying infrastructure, enabling users and principals on a platform, setting ground rules on who can do what, being responsible for version upgrades, applying patches, security, and enabling new features, and providing direction for business continuity and disaster recovery, and...

Provisioning a multi-tenant infrastructure

The administrator is tasked with setting up the infrastructure for the tenants of an environment. One question that often arises is what should be the optimum balance of collaboration and isolation is. Creating single deployments and putting everyone there could lead to hitting rate limits and is not a sustainable strategy. Since we have the luxury of elasticity of the cloud, we can turn on as many environments as we wish to isolate data and users and provide better blast radius control in case of a security breach. Conversely, creating too many environments leads to harder governance and maintenance challenges, collaboration suffers, and the enablement cycle could be much longer.

Let’s examine the various scenarios:

Separate development, staging, and production environments.
Disaster recovery requires setting a parallel production environment in a different region.
Different lines of business units want a separate...

Data democratization via policies and processes

If everything is locked down, then there is no threat of exposure. However, that is not the intended agenda of data organizations. Getting the relevant data in the hands of the right privileged audience helps a company innovate by allowing people to explore and discover new meaningful ways to add business value from the data. IT should not be the bottleneck in the process of data democratization. If new datasets are brought in, IT should not be overwhelmed with tickets from every part of the organization requesting access to them. So, enabling self-service with appropriate security guardrails is an important responsibility of an administrator. This is where policies play an important role in policing an environment, either preventing an unintended situation from taking place or reporting against it by running scans to detect patterns so that bad actors or novices can be corrected in time.

Policies can be of several types; some typical...

Capacity planning

Data volumes are constantly growing. Capacity planning is the art and science of arriving at the right infrastructure that caters to the current and future needs of a business. It has several inputs, including the incoming data volume, the volume of historical data that needs to be retained, the SLAs for end-to-end latency, and the kind of processing and transformations that are done on the data. It is directly linked to your ability to sustain scalable growth at a manageable cost point. We may be tempted to think that leveraging the elasticity properties of cloud infrastructure absolves us from planning around capacity, which is in correct!

So, how do you go about forecasting demand? The simplest way is to use a sliver of data, establish a pilot workstream, take the memory, compute and storage metrics and project it out for the full workload, adding in some buffer for growth and then repeating it for every known use case, while keeping a buffer for unplanned activity...

Managing and monitoring

Every organization has policies around data access and data use that need to be honored. In addition, there are compliance guidelines in some regulated industries to prove that compliance is honored, using an audit trail of the types of user access and manipulation of the data. Hence, there is a need to be able to set the controls in place, detect whether something has been changed, and provide a transparent audit trail. This includes access to raw data as well as via tables that are an artifact on top of the data.

The metrics collected from these logs need to be compared over a period of time to understand trend lines. Delta’s versioning capability comes in handy to monitor not only operations done on a table but metrics logged as well. It would be fair to say that these metrics need more permanence and some date/time stamp would be used to log them.

There are several types of logs in a system. The main ones include the following:

Audit...

Data migration

Technologies are constantly evolving. It is important to choose a platform and architecture that is future-proof and extensible and supports a pluggable paradigm to play nicely with other tools of an ecosystem. So gravitating towards open data formats, open source tooling, and cloud-based architecture with separation of compute and storage, you can dodge the main bullets. There will be a time when this is no longer sustainable and the whole data platform needs a refreshing overhaul. Some examples of this that we’ve seen in recent years is migration from Hadoop-based systems that are complex and difficult to manage to cloud-native data platforms. The same is true of expensive data warehousing solutions such as Netezza, Teradata, and Exadata. Migration projects are expensive, time-consuming, and critical to the overall value of a business and tech investments and need to be planned and executed very carefully.

How will you determine whether to patch an existing...

COE best practices

Establishing an internal steering committee/team as the Center of Excellence (COE) for creating advanced analytics is a complex process. Its primary purpose is to provide a blueprint to onboard data teams, and enable them with technical and operational practices, support for handling issues and tickets, and executive alignment to ensure that technical investments align to business objectives and value can be realized and quantified. The role is that of an enabler, as a governance overseer, but never to the point of a bottleneck. In some organizations, the COE team is responsible for managing all or part of an infrastructure and the shared data ingestion process, ratifying vendor tools and frameworks for internal consumption. They are either funded directly or get compensated by a chargeback model from the individual lines of business that they service.

The foundational blocks include the following aspects:

Cloud strategy: Which cloud to use, whether a...

Summary

The previous chapters focused on the role of data engineers, data scientists, business analysts, and DevOps/MLOps personas. This chapter focused on the admin persona who plays a pivotal role in an organization’s data journey by enabling the infrastructure, onboarding users, and providing data governance and security constructs so that self-service can be fully and safely democratized. We looked into various tasks, such as COE duties and responsibilities and data migration efforts, among others, which require admins to do a lot of the heavy lifting. Consolidating data into a common, open format such as Parquet with a transactional protocol such as Delta helps in use cases involving data sharing and migrations. It is important to keep in mind that technology and business users need to plan an enterprise’s data initiatives together to make sure that the insights generated are relevant and useful for the enterprise.

The rest of the chapter is locked

You have been reading a chapter from

Simplifying Data Engineering and Analytics with Delta

Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Simplifying Data Engineering and Analytics with Delta

Chapter 13: Managing Your Data Journey

Provisioning a multi-tenant infrastructure

Data democratization via policies and processes

Capacity planning

Managing and monitoring

Data sharing

Data migration

COE best practices

Summary

Why subscribe?

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook