Reader small image

You're reading from  Driving Data Quality with Data Contracts

Product typeBook
Published inJun 2023
PublisherPackt
ISBN-139781837635009
Edition1st Edition
Right arrow
Author (1)
Andrew Jones
Andrew Jones
author image
Andrew Jones

Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.
Read more about Andrew Jones

Right arrow

The big data platform

As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoop in 2006.

Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.

This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.

The following diagram shows the reference data platform architecture built upon Hadoop:

Figure 1.2 – The big data platform architecture

Figure 1.2 – The big data platform architecture

At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.

Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.

We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.

However, in practice, much of that data was never used. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance.

This is why we started hearing people refer to it as the data swamp. While much of the data was likely valuable, the inaccessibility of the data lake meant it was never used. Gartner introduced the term dark data to describe this, where data is collected and never used, and the costs of storing and managing that data outweigh any value gained from it (https://www.gartner.com/en/information-technology/glossary/dark-data). In 2015, IDC estimated 90% of unstructured data could be considered dark (https://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html).

Another consequence of this architecture was that it moved the end data consumers further away from the data generators. Typically, a central data engineering team was introduced to focus solely on ingesting the data into the data lake, building the tools and the connections required to do that from as many source systems as possible. They were the ones interacting with the data generators, not the ultimate consumers of the data.

So, despite the advance in tools and technologies, in practice, we still had many of the same limitations as before. Only a limited amount of data could be made available for analysis and other uses, and we had that same bottleneck controlling what that data was.

Note

Let’s return to our example to illustrate how different roles worked together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that some of the data from that database is extracted in a raw form, and is unlikely to know exactly what the data is. Certainly, she doesn’t know why.

Ben is a data engineer who works on the ELT pipeline. He aims to extract as much of the data as possible into the data lake. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is another data engineer, specializing in writing MapReduce jobs. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These MapReduce jobs have unpredictable performance and are difficult to debug. The jobs do not run reliably.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs lack the autonomy and the expertise to access it, preventing the use of data for anything other than the most critical business requirements.

The next generation of data architectures began in 2012 with the launch of Amazon Redshift on AWS and the explosion of tools and investment into what became known as the modern data stack (MDS). In the next section, we’ll explore this architecture and see whether we can finally get rid of this bottleneck.

Previous PageNext Page
You have been reading a chapter from
Driving Data Quality with Data Contracts
Published in: Jun 2023Publisher: PacktISBN-13: 9781837635009
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrew Jones

Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.
Read more about Andrew Jones