Home Data Driving Data Quality with Data Contracts

Driving Data Quality with Data Contracts

By Andrew Jones
ai-assist-svg-icon Book + AI Assistant
eBook + AI Assistant $27.99 $18.99
Print $34.99
Subscription $15.99 $10 p/m for three months
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
eBook + AI Assistant $27.99 $18.99
Print $34.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: A Brief History of Data Platforms
About this book
Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.
Publication date:
June 2023
Publisher
Packt
Pages
206
ISBN
9781837635009

 

A Brief History of Data Platforms

Before we can appreciate why we need to make a fundamental shift to a data contracts-backed data platform in order to improve the quality of our data, and ultimately the value we can get from that data, we need to understand the problems we are trying to solve. I’ve found the best way to do this is to look back at the recent generations of data architectures. By doing that, we’ll see that despite the vast improvements in the tooling available to us, we’ve been carrying through the same limitations in the architecture. That’s why we continue to struggle with the same old problems.

Despite these challenges, the importance of data continues to grow. As it is used in more and more business-critical applications, we can no longer accept data platforms that are unreliable, untrusted, and ineffective. We must find a better way.

By the end of this chapter, we’ll have explored the three most recent generations of data architectures at a high level, focusing on just the source and ingestion of upstream data, and the consumption of data downstream. We will gain an understanding of their limitations and bottlenecks and why we need to make a change. We’ll then be ready to learn about data contracts.

In this chapter, we’re going to cover the following main topics:

  • The enterprise data warehouse
  • The big data platform
  • The modern data stack
  • The state of today’s data platforms
  • The ever-increasing use of data in business-critical applications
 

The enterprise data warehouse

We’ll start by looking at the data architecture that was prevalent in the late 1990s and early 2000s, which was centered around an enterprise data warehouse (EDW). As we discuss the architecture and its limitations, you’ll start to notice how many of those limitations continue to affect us today, despite over 20 years of advancement in tools and capabilities.

EDW is the collective term for a reporting and analytics solution. You’d typically engage with one or two big vendors who would provide these capabilities for you. It was expensive and only larger companies that could justify the investment.

The architecture was built around a large database in the center. This was likely an Oracle or MS SQL Server database, hosted on-premises (this was before the advent of cloud services). The extract, transform, and load (ETL) process was performed on data from source systems, or more accurately, the underlying database of those systems. That data could then be used to drive reporting and analytics.

The following diagram shows the EDW architecture:

Figure 1.1 – The EDW architecture

Figure 1.1 – The EDW architecture

Because this ETL ran against the database of the source system, reliability was a problem. It created a load on the database that could negatively impact the performance of the upstream service. That, and the limitations of the technology we were using at the time, meant we could do few transforms on the data.

We also had to update the ETL process as the database schema and the data evolved over time, relying on the data generators to let us know when that happened. Otherwise, the pipeline would fail.

Those who owned databases were somewhat aware of the ETL work and the business value it drove. There were few barriers between the data generators and consumers and good communication.

However, the major limitation of this architecture was the database used for the data warehouse. It was very expensive and, as it was deployed on-premises, was of a fixed size and hard to scale. That created a limit on how much data could be stored there and made available for analytics.

It became the responsibility of the ETL developers to decide what data should be available, depending on the business needs, and to build and maintain that ETL process by getting access to the source systems and their underlying databases.

And so, this is where the bottleneck was. The ETL developers had to control what data went in, and they were the only ones who could make data available in the warehouse. Data would only be made available if it met a strong business need, and that typically meant the only data in the warehouse was data that drove the company KPIs. If you wanted some data to do some analysis and it wasn’t already in there, you had to put a ticket in their backlog and hope for the best. If it did ever get prioritized, it was probably too late for what you wanted it for.

Note

Let’s illustrate how different roles worked together with this architecture with an example.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She’s aware that some of the data from that database is extracted by a data analyst, Bukayo, and that is used to drive top-level business KPIs.

Bukayo can’t do much transformation on the data, due to the limitations of the technology and the cost of infrastructure, so the reporting he produces is largely on the raw data.

There are no defined expectations between Vivianne and Bukayo, and Bukayo relies on Vivianne telling him in advance whether there are any changes to the data or the schema.

The extraction is not reliable. The ETL process could affect the performance of the database, and so can be switched off when there is an incident. Schema and data changes are not always known in advance. The downstream database also has limited performance and cannot be easily scaled to deal with an increase in the data or usage.

Both Vivianne and Bukayo lack autonomy. Vivianne can’t change her database schema without getting approval from Bukayo. Bukayo can only get a subset of data, with little say over the format. Furthermore, any potential users downstream of Bukayo can only access the data he has extracted, severely limiting the accessibility of the organization’s data.

This won’t be the last time we see a bottleneck that prevents access to, and the use of, quality data. Let’s look now at the next generation of data architecture and the introduction of big data, which was made possible by the release of Apache Hadoop in 2006.

 

The big data platform

As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoop in 2006.

Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.

This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.

The following diagram shows the reference data platform architecture built upon Hadoop:

Figure 1.2 – The big data platform architecture

Figure 1.2 – The big data platform architecture

At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.

Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.

We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.

However, in practice, much of that data was never used. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance.

This is why we started hearing people refer to it as the data swamp. While much of the data was likely valuable, the inaccessibility of the data lake meant it was never used. Gartner introduced the term dark data to describe this, where data is collected and never used, and the costs of storing and managing that data outweigh any value gained from it (https://www.gartner.com/en/information-technology/glossary/dark-data). In 2015, IDC estimated 90% of unstructured data could be considered dark (https://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html).

Another consequence of this architecture was that it moved the end data consumers further away from the data generators. Typically, a central data engineering team was introduced to focus solely on ingesting the data into the data lake, building the tools and the connections required to do that from as many source systems as possible. They were the ones interacting with the data generators, not the ultimate consumers of the data.

So, despite the advance in tools and technologies, in practice, we still had many of the same limitations as before. Only a limited amount of data could be made available for analysis and other uses, and we had that same bottleneck controlling what that data was.

Note

Let’s return to our example to illustrate how different roles worked together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that some of the data from that database is extracted in a raw form, and is unlikely to know exactly what the data is. Certainly, she doesn’t know why.

Ben is a data engineer who works on the ELT pipeline. He aims to extract as much of the data as possible into the data lake. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is another data engineer, specializing in writing MapReduce jobs. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These MapReduce jobs have unpredictable performance and are difficult to debug. The jobs do not run reliably.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs lack the autonomy and the expertise to access it, preventing the use of data for anything other than the most critical business requirements.

The next generation of data architectures began in 2012 with the launch of Amazon Redshift on AWS and the explosion of tools and investment into what became known as the modern data stack (MDS). In the next section, we’ll explore this architecture and see whether we can finally get rid of this bottleneck.

 

The modern data stack

Amazon Redshift was the first cloud-native data warehouse and provided a real step-change in capabilities. It had the ability to store almost limitless data at a low cost in a SQL-compatible database, and the massively parallel processing (MPP) capabilities meant you could process that data effectively and efficiently at scale.

This sounds like what we had with Hadoop, but the key differences were the SQL compatibility and the more strongly defined structure of the data. This made it much more accessible than the unstructured files on an HDFS cluster. It also presented an opportunity to build services on top of Redshift and later SQL-compatible warehouses such as Google BigQuery and Snowflake, which led to an explosion of tools that make up today’s modern data stack. This includes ELT tools such as Fivetran and Stitch, data transformation tools such as dbt, and reverse ETL tools such as Hightouch.

These data warehouses evolved further to become what we now call a data lakehouse, which brings together the benefits of a modern data warehouse (SQL compatibility and high performance with MPP) with the benefits of a data lake (low cost, limitless storage, and support for different data types).

Into this data lakehouse went all the source data we ingested from our systems and third-party services, becoming our operational data store (ODS). From here, we could join and transform the data and make it available to our EDW, from where it is available for consumption. But the data warehouse was no longer a separate database – it was just a logically separate area of our data lakehouse, using the same technology. This reduced the effort and costs of the transforms and further increased the accessibility of the data.

The following diagram shows the reference architecture of the modern data stack, with the data lakehouse in the center:

Figure 1.3 – The modern data stack architecture

Figure 1.3 – The modern data stack architecture

This architecture gives us more options to ingest the source data, and one of those is using change data capture (CDC) tooling, for which we have open source implementations such as Debezium and commercial offerings such as Striim and Google Cloud Datastream, as well as in-depth write-ups on closed source solutions at organizations including Airbnb (https://medium.com/airbnb-engineering/capturing-data-evolution-in-a-service-oriented-architecture-72f7c643ee6f) and Netflix (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b). CDC tools connect to the transactional databases of your upstream servers and capture all the changes that happen to each of the tables (i.e., the INSERT, UPDATE, and DELETE statements run against the database). These are sent to the data lakehouse, and from there, you can recreate the database in the lakehouse with the same structure and the same data.

However, this creates a tight coupling between the internal models of the upstream service and database and the data consumers. As that service naturally evolves over time, breaking changes will be made to those models. When these happen – often without any notice – they impact the CDC service and/or downstream data uses, leading to instability and unreliability. This makes it impossible to build on this data with any confidence.

The data is also not structured well for analytical queries and uses. It has been designed to meet the needs of the service and to be optimal for a transactional database, not a data lakehouse. It can take a lot of transformation and joining to take this data and produce something that meets the requirements of your downstream users, which is time-consuming and expensive.

There is often little or no documentation for this data, and so to make use of it you need to have in-depth knowledge of those source systems and the way they model the data, including the history of how that has evolved over time. This typically comes from asking teams who work on that service or relying on institutional knowledge from colleagues who have worked with that data before. This makes it difficult to discover new or useful datasets, or for a new consumer to get started.

The root cause of all these problems is that this data was not built for consumption.

Many of these same problems apply to data ingested from a third-party service through an ELT tool such as Fivetran or Stitch. This is particularly true if you’re ingesting from a complex service such as Salesforce, which is highly customizable with custom objects and fields. The data is in a raw form that mimics the API of the third-party service, lacks documentation, and requires in-depth knowledge of the service to use. Like with CDC, it can still change without notice and requires a lot of transformation to produce something that meets your requirements.

One purported benefit of the modern data stack is that we now have more data available to us than ever before. However, a 2022 report from Seagate (https://www.seagate.com/gb/en/our-story/rethink-data/) found that 68% of the data available to organizations goes unused. We still have our dark data problem from the big data era.

The introduction of dbt and similar tools that run on a data lakehouse has made it easier than ever to process this data using just SQL – one of the most well-known and popular languages around. This should increase the accessibility of the data in the data lakehouse.

However, due to the complexity of the transforms required to make use of this data and the domain knowledge you must build up, we still often end up with a central team of data engineers to build and maintain the hundreds, thousands, or even tens of thousands of models required to produce data that is ready for consumption by other data practitioners and users.

Note

We’ll return to our example for the final time to illustrate how different roles work together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that the data from that database is extracted in a raw form through a CDC service. Certainly, she doesn’t know why.

Ben is a data platform engineer who works on the CDC pipeline. He aims to extract as much of the data as possible into the data lakehouse. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is an analytics engineer building dbt pipelines. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These dbt pipelines now number in the thousands and no one has all the context required to debug them all. The pipelines break regularly, and those breakages often have a wide impact.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding or visibility of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs have access to the data in a structured form, but lack the domain knowledge to use it effectively. They lack the autonomy to ask for the data they need to meet their requirements.

So, despite the improvements in the technology and architecture over three generations of data platform architectures, we still have that bottleneck of a central team with a long backlog of datasets to make available to the organization before we can start using it to drive business value.

The following diagram shows the three generations side by side, with the same bottleneck highlighted in each:

Figure 1.4 – Comparing the three generations of data platform architectures

Figure 1.4 – Comparing the three generations of data platform architectures

It’s that bottleneck that has led us to the state of today’s data platforms and the trouble many of us face when trying to generate business value from our data. In the next section, we’re going to discuss the problems we have when we build data platforms on this architecture.

 

The state of today’s data platforms

The limitations of today’s data architectures, and the data culture they reinforce, result in several problems that are felt almost universally by organizations trying to get value from their data. Let’s explore the following problems in turn and the impact they have:

  • The lack of expectations
  • The lack of reliability
  • The lack of autonomy

The lack of expectations

Users working with source data that has been ingested through an ELT or CDC tool can have very few expectations about what the data is, how it should be used, and how reliable it will be. They also don’t know exactly where this data comes from, who generated it, and how it might change in the future.

In the absence of explicitly defined expectations, users tend to make assumptions that are more optimistic than reality, particularly when it comes to the reliability and availability of the data. This only increases the impact when there is a breaking change in the upstream data, or when that data proves to be unreliable.

It also leads to the data not being used correctly. For example, there could be different tables and columns that relate to the various dimensions around how a customer is billed for their use of the company’s products, and this will evolve over time. The data consumer will need to know that in detail if they are to use this data to produce revenue numbers for the organization. They therefore need to gain in-depth knowledge of the service and the logic it uses so they can reimplement that in their ETL.

Successfully building applications and services on top of the data in our lakehouse would require the active transfusion of this knowledge from the upstream data generators to the downstream data consumers, including the following:

  • The domain models the dataset describes
  • The change history of the dataset
  • The schematics and metadata

However, due to the distance between these groups, there is no feasible way to establish this exchange.

This lack of expectations, and no requirement to fulfill them, is also a problem for the data generators. Often, they don’t even know they are data generators, as they are just writing data to their internal models in their services database or managing a third-party service as best they can to meet their direct users requirements. They are completely unaware of the ELT/CDC processes running to extract their data and its importance to the rest of the organization. This makes it difficult to hold them responsible for the changes they make and their downstream impact, as it is completely invisible to them and often completely unexpected. So, the responsibility falls entirely on the data teams attempting to make use of this data.

This lack of responsibility is shown in the following diagram, which is the same as we saw in the The modern data stack section earlier but annotated with responsibility.

Figure 1.5 – Responsibility in the modern data stack

Figure 1.5 – Responsibility in the modern data stack

This diagram also illustrates another of the big problems with today’s data platforms, which is the complete lack of collaboration between the data generators and the data consumers. The data generators are far removed from the consumption points and have little to no idea of who is consuming their data, why they need the data, and the important business processes and outcomes that are driven by that data. On the other side, the data consumers don’t even know who is generating the data they depend on so much and have no say in what that data should look like in order to meet their requirements. They simply get the data they are given.

The lack of reliability

Many organizations suffer from unreliable data pipelines and have done for years. This could be at a significant cost, with a Gartner survey (https://www.gartner.com/smarterwithgartner/how-to-stop-data-quality-undermining-your-business) suggesting these cost companies millions of dollars a year.

There are many reasons for this unreliability. It could be the lack of quality of the data when ingested, or how the quality of that data has degraded over time as it becomes stale. Or the data could be late or incomplete.

The root cause of so many of these reliability problems is that we are building on data that was not made for consumption.

As mentioned earlier, data being ingested through ELT and CDC tools can change at any time, without warning. These could be schema changes, which typically cause the downstream pipelines to fail loudly with no new data being ingested or populated until the issue has been resolved. It could also be a change to the data itself, or the logic required to use that data correctly. These are often silent failures and may not be automatically detected. The first time we might hear about the issue is when a user brings up some data, maybe as part of a presentation or a meeting, and notices it doesn’t look quite right or looks different to how it did yesterday.

Often, these changes can’t be fixed in the source system. They were made for a good reason and have already been deployed to production. That leaves the data pipeline authors to implement a fix within the pipeline, which in the best case is just pointing to another column but more likely ends up being yet another CASE statement with logic to handle the change, or another IFNULL statement, or IF DATE < x THEN do this ELSE do that. This builds and builds over time, creating ever more complex and brittle data pipelines, and further increasing their unreliability.

All the while, we’re increasing the number of applications built on this data and adding more and more complexity to these pipelines, which again further increases the unreliability.

The cost of these reliability issues is that users lose trust in the data, and once that trust is lost it’s very hard to win back.

The lack of autonomy

For decades we’ve been creating our data platforms with a bottleneck in the middle. The team, typically a central data engineering or BI engineering team, are the only ones who have the ability and the time to attempt to make use of the raw source data, with everyone else consuming their data.

Anyone wanting to have data made available to them will be waiting for that central team to prioritize that ask, with their ticket sitting in a backlog. These central teams will never have the capacity to keep up with these requests and instead can only focus on those deemed the highest priority, which are typically those data sources that drive the company KPIs and other top-level metrics.

That’s not to say the rest of the data does not have value! As we’ll discuss in the following section, it does, and there will be plenty of ways that data could be used to drive decisions or improve data-driven products across the organization. But this data is simply not accessible enough to the people who could make use of this data and therefore sits unused.

To empower a truly data-driven organization, we need to move away from the dependence on a central and limited data engineering team to an architecture that promotes autonomy, opening that dark data up to uses that will never be important enough to prioritize, but that when added up provide a lot of business value to the organization and support new applications that could be critical for its success.

This isn’t a technical limitation. Modern data lakehouses can be queried by anyone who knows SQL, and any data available in the lakehouse can be made available to any reporting tool for use by less technical users. It’s a limitation of the way we have chosen to ingest data through ELT, the lack of quality of that data, and the data culture that embodies.

As we’ll discuss in the next section, organizations are looking to gain a competitive advantage with the ever-increasing use of data in more and more business-critical applications. These limitations in our data architecture are no longer acceptable.

 

The ever-increasing use of data in business-critical applications

Despite all these challenges, data produced on a data platform is being increasingly used in business-critical applications.

This is for good reason! It’s well accepted that organizations that make effective use of data can gain a real competitive advantage. Increasingly, these are not traditional tech companies but organizations across almost all industries, as technology and data become more important to their business. This has led to organizations investing heavily in areas such as data science, looking to gain similar competitive advantages (or at least, not get left behind!).

However, for these data projects to be successful, more of our data needs to be accessible to people across the organization. We can no longer just be using a small percentage of our data to provide top-level business metrics and nothing more.

This can be clearly seen in the consumer sector, where to be competitive you must be providing a state-of-the-art customer experience, and that requires the atomic use of data at every customer touchpoint. A report from McKinsey (https://www.mckinsey.com/industries/retail/our-insights/jumpstarting-value-creation-with-data-and-analytics-in-fashion-and-luxury) estimated that the 25 top-performing retailers were digital leaders. They are 83% more profitable and took over 90% of the sector’s gains in market capitalization.

Many organizations are, of course, aware of this. An industry report by Anmut in 2021 (https://www.anmut.co.uk/wp-content/uploads/2021/05/Amnut-DLR-May2021.pdf) illustrated both the perceived importance of data to organizations and the problems they have utilizing it when it stated this in its executive summary:

We found that 91% of business leaders say data’s critical to their business success, 76% are investing in business transformation around data, and two-thirds of boards say data is a material asset.

Yet, just 34% of businesses manage data assets with the same discipline as other assets, and these businesses are reaping the rewards. This 34% spend most of their data investment creating value, while the rest spend nearly half of their budget fixing data.

It’s this lack of discipline in managing their data assets that is really harming organizations. It manifests itself in the lack of expectations throughout the pipeline and then permeates throughout the entire data platform and into those datasets within the data warehouse, which themselves also have ill-defined expectations for its downstream users or data-driven products.

The following diagram shows a typical data pipeline and how at each stage the lack of defined expectations ultimately results in the consumers losing trust in business-critical data-driven products:

Figure 1.6 – The lack of expectations throughout the data platform

Figure 1.6 – The lack of expectations throughout the data platform

Again, in the absence of these expectations, users will optimistically assume the data is more reliable than it is, but now it’s not just internal KPIs and reporting that are affected by the inevitable downtime but revenue-generating services affecting external customers. Just like internal users, they will start losing trust, but this time they are losing trust in the product and the company, which can eventually cause real damage to the company’s brand and reputation.

As the importance of data continues to increase and it finds its way into more business-critical applications, it becomes imperative that we greatly increase the reliability of our data platforms to meet the expectations of our users.

 

Summary

There’s no doubt that the effective use of data is becoming ever more critical to organizations. No longer is it only expected to drive internal reporting and KPIs, but the use of data is driving key products both internally and externally to customers.

However, while the tools we have available are better than ever, the architecture of the data platforms that underpin all of this have not evolved alongside them. Our data platforms continue to be hampered by a bottleneck that restricts the accessibility of the data. They are unable to provide the reliable, quality data that is needed to those teams who need it when it is needed.

We need to stop working around these problems within the data platform and address them at the source.

We need an architecture that sets expectations around what data is provided, how to use it, and how reliable it will be.

We need a data culture that treats data as a first-class citizen, where responsibility is assigned to those who generate the data.

And so, in the next chapter, we’ll introduce data contracts, a new architecture pattern designed to solve these problems, and provide the foundations we need to empower true data-driven organizations that realize the value of their data.

 

Further reading

For more information on the topics covered in this chapter, please see the following resources:

About the Author
  • Andrew Jones

    Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.

    Browse publications by this author
Driving Data Quality with Data Contracts
Unlock this book and the full library FREE for 7 days
Start now