You're reading from Driving Data Quality with Data Contracts

Product typeBook

Published inJun 2023

PublisherPackt

ISBN-139781837635009

Edition1st Edition

Concepts

Data Engineering

Author (1)

Andrew Jones

The big data platform

As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoop in 2006.

Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.

This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.

The following diagram shows the reference data platform architecture built upon Hadoop:

Figure 1.2 – The big data platform architecture

At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.

Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.

We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.

However, in practice, much of that data was never used. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance.

This is why we started hearing people refer to it as the data swamp. While much of the data was likely valuable, the inaccessibility of the data lake meant it was never used. Gartner introduced the term dark data to describe this, where data is collected and never used, and the costs of storing and managing that data outweigh any value gained from it (https://www.gartner.com/en/information-technology/glossary/dark-data). In 2015, IDC estimated 90% of unstructured data could be considered dark (https://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html).

Another consequence of this architecture was that it moved the end data consumers further away from the data generators. Typically, a central data engineering team was introduced to focus solely on ingesting the data into the data lake, building the tools and the connections required to do that from as many source systems as possible. They were the ones interacting with the data generators, not the ultimate consumers of the data.

So, despite the advance in tools and technologies, in practice, we still had many of the same limitations as before. Only a limited amount of data could be made available for analysis and other uses, and we had that same bottleneck controlling what that data was.

Note

Let’s return to our example to illustrate how different roles worked together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that some of the data from that database is extracted in a raw form, and is unlikely to know exactly what the data is. Certainly, she doesn’t know why.

Ben is a data engineer who works on the ELT pipeline. He aims to extract as much of the data as possible into the data lake. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is another data engineer, specializing in writing MapReduce jobs. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These MapReduce jobs have unpredictable performance and are difficult to debug. The jobs do not run reliably.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs lack the autonomy and the expertise to access it, preventing the use of data for anything other than the most critical business requirements.

The next generation of data architectures began in 2012 with the launch of Amazon Redshift on AWS and the explosion of tools and investment into what became known as the modern data stack (MDS). In the next section, we’ll explore this architecture and see whether we can finally get rid of this bottleneck.

You have been reading a chapter from

Driving Data Quality with Data Contracts

Published in: Jun 2023Publisher: PacktISBN-13: 9781837635009

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Andrew Jones

Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.
Read more about Andrew Jones

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages