Reader small image

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product typeBook
Published inJul 2022
PublisherPackt
ISBN-139781801814867
Edition1st Edition
Concepts
Right arrow
Author (1)
Anindita Mahapatra
Anindita Mahapatra
author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Right arrow

Chapter 13: Managing Your Data Journey

“You possess all the attributes of a demagogue; a screeching, horrible voice, a perverse, cross-grained nature, and the language of the marketplace. In you, all is united which is needful for governing.”

– Aristophanes, The Knights

In the previous chapters, we looked at the roles and responsibilities of the primary data personas, namely data engineers and data scientists, and ML practitioners, business analysts, and DevOps/MLOps personas. One persona that we have not talked about much is that of an administrator. They are the gatekeepers that hold the key to deploying infrastructure, enabling users and principals on a platform, setting ground rules on who can do what, being responsible for version upgrades, applying patches, security, and enabling new features, and providing direction for business continuity and disaster recovery, and...

Provisioning a multi-tenant infrastructure

The administrator is tasked with setting up the infrastructure for the tenants of an environment. One question that often arises is what should be the optimum balance of collaboration and isolation is. Creating single deployments and putting everyone there could lead to hitting rate limits and is not a sustainable strategy. Since we have the luxury of elasticity of the cloud, we can turn on as many environments as we wish to isolate data and users and provide better blast radius control in case of a security breach. Conversely, creating too many environments leads to harder governance and maintenance challenges, collaboration suffers, and the enablement cycle could be much longer.

Let’s examine the various scenarios:

  • Separate development, staging, and production environments.
  • Disaster recovery requires setting a parallel production environment in a different region.
  • Different lines of business units want a separate...

Data democratization via policies and processes

If everything is locked down, then there is no threat of exposure. However, that is not the intended agenda of data organizations. Getting the relevant data in the hands of the right privileged audience helps a company innovate by allowing people to explore and discover new meaningful ways to add business value from the data. IT should not be the bottleneck in the process of data democratization. If new datasets are brought in, IT should not be overwhelmed with tickets from every part of the organization requesting access to them. So, enabling self-service with appropriate security guardrails is an important responsibility of an administrator. This is where policies play an important role in policing an environment, either preventing an unintended situation from taking place or reporting against it by running scans to detect patterns so that bad actors or novices can be corrected in time.

Policies can be of several types; some typical...

Capacity planning

Data volumes are constantly growing. Capacity planning is the art and science of arriving at the right infrastructure that caters to the current and future needs of a business. It has several inputs, including the incoming data volume, the volume of historical data that needs to be retained, the SLAs for end-to-end latency, and the kind of processing and transformations that are done on the data. It is directly linked to your ability to sustain scalable growth at a manageable cost point. We may be tempted to think that leveraging the elasticity properties of cloud infrastructure absolves us from planning around capacity, which is in correct!

So, how do you go about forecasting demand? The simplest way is to use a sliver of data, establish a pilot workstream, take the memory, compute and storage metrics and project it out for the full workload, adding in some buffer for growth and then repeating it for every known use case, while keeping a buffer for unplanned activity...

Managing and monitoring

Every organization has policies around data access and data use that need to be honored. In addition, there are compliance guidelines in some regulated industries to prove that compliance is honored, using an audit trail of the types of user access and manipulation of the data. Hence, there is a need to be able to set the controls in place, detect whether something has been changed, and provide a transparent audit trail. This includes access to raw data as well as via tables that are an artifact on top of the data.

The metrics collected from these logs need to be compared over a period of time to understand trend lines. Delta’s versioning capability comes in handy to monitor not only operations done on a table but metrics logged as well. It would be fair to say that these metrics need more permanence and some date/time stamp would be used to log them.

There are several types of logs in a system. The main ones include the following:

  1. Audit...

Data sharing

This is a paradox to a lot of collaboration and isolation concepts we reviewed in earlier sections. When groups or lines of business have a lot of data dependencies, they are usually housed together to facilitate better collaboration, and if they do not have any operational dependencies, they can be segregated in their own environments – for example, HR and marketing may be in their own domain meshes. However, what happens if there is a need for them to share some insights? There should be a way to promote it, as it leads to better stakeholder engagement that improves enterprise value. However, all the painful architecting to ensure this accidental exposure does not happen will now have to be reconsidered. That is a lot of unnecessary complexity and re-architecting. Also, data replication to a shared location will lead to the two getting out of sync. Thankfully, Delta sharing comes to the rescue.

A simple, open, and secure way to share data can be achieved through...

Data migration

Technologies are constantly evolving. It is important to choose a platform and architecture that is future-proof and extensible and supports a pluggable paradigm to play nicely with other tools of an ecosystem. So gravitating towards open data formats, open source tooling, and cloud-based architecture with separation of compute and storage, you can dodge the main bullets. There will be a time when this is no longer sustainable and the whole data platform needs a refreshing overhaul. Some examples of this that we’ve seen in recent years is migration from Hadoop-based systems that are complex and difficult to manage to cloud-native data platforms. The same is true of expensive data warehousing solutions such as Netezza, Teradata, and Exadata. Migration projects are expensive, time-consuming, and critical to the overall value of a business and tech investments and need to be planned and executed very carefully.

How will you determine whether to patch an existing...

COE best practices

Establishing an internal steering committee/team as the Center of Excellence (COE) for creating advanced analytics is a complex process. Its primary purpose is to provide a blueprint to onboard data teams, and enable them with technical and operational practices, support for handling issues and tickets, and executive alignment to ensure that technical investments align to business objectives and value can be realized and quantified. The role is that of an enabler, as a governance overseer, but never to the point of a bottleneck. In some organizations, the COE team is responsible for managing all or part of an infrastructure and the shared data ingestion process, ratifying vendor tools and frameworks for internal consumption. They are either funded directly or get compensated by a chargeback model from the individual lines of business that they service.

The foundational blocks include the following aspects:

  • Cloud strategy: Which cloud to use, whether a...

Summary

The previous chapters focused on the role of data engineers, data scientists, business analysts, and DevOps/MLOps personas. This chapter focused on the admin persona who plays a pivotal role in an organization’s data journey by enabling the infrastructure, onboarding users, and providing data governance and security constructs so that self-service can be fully and safely democratized. We looked into various tasks, such as COE duties and responsibilities and data migration efforts, among others, which require admins to do a lot of the heavy lifting. Consolidating data into a common, open format such as Parquet with a transactional protocol such as Delta helps in use cases involving data sharing and migrations. It is important to keep in mind that technology and business users need to plan an enterprise’s data initiatives together to make sure that the insights generated are relevant and useful for the enterprise.

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplifying Data Engineering and Analytics with Delta
Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra