Home

Data

Modern Data Architectures with Python

By Brian Lipp

Book

eBook $39.99

Print $49.99

Subscription $15.99

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99

Print $49.99

Subscription $15.99

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Modern Data Architectures with Python will teach you how to seamlessly incorporate your machine learning and data science work streams into your open data platforms. You’ll learn how to take your data and create open lakehouses that work with any technology using tried-and-true techniques, including the medallion architecture and Delta Lake. Starting with the fundamentals, this book will help you build pipelines on Databricks, an open data platform, using SQL and Python. You’ll gain an understanding of notebooks and applications written in Python using standard software engineering tools such as git, pre-commit, Jenkins, and Github. Next, you’ll delve into streaming and batch-based data processing using Apache Spark and Confluent Kafka. As you advance, you’ll learn how to deploy your resources using infrastructure as code and how to automate your workflows and code development. Since any data platform's ability to handle and work with AI and ML is a vital component, you’ll also explore the basics of ML and how to work with modern MLOps tooling. Finally, you’ll get hands-on experience with Apache Spark, one of the key data technologies in today’s market. By the end of this book, you’ll have amassed a wealth of practical and theoretical knowledge to build, manage, orchestrate, and architect your data ecosystems.

Publication date:: September 2023
Publisher: Packt
Pages: 318
ISBN: 9781801070492

Modern Data Processing Architecture

Data architecture has become one of the most discussed topics. This chapter will introduce data architecture and the methodologies for designing a data ecosystem. Architecting a data solution is tricky and often riddled with traps. We will go through the theories for creating a data ecosystem and give some insight into how and why you would apply those theories.

To do so, we will cover the essential concepts, why they are helpful, and when to apply them.

By the end of this chapter, you will have built the foundation of your data solution, and once completed, you should be comfortable with architecture data solutions at a high level.

In this chapter, we’re going to cover the following main topics:

Databases, data warehouses, and data lakes
Data platform architecture at a high level
Lambda versus Kappa architecture
Lakehouse and Delta architectures
Data mesh theory and practice

Technical requirements

You can use many tools to create the diagrams and technical documentation for this chapter.

I suggest the following:

Lucid Chart
Draw.io
OmniGraffle

Databases, data warehouses, and data lakes

The history of data processing is long and has had several unique innovations. As a result, we can look around today and see several patterns in our offices – everything from data stored as a file on a network drive, local hard drive, or technology such as S3 to relational database management systems (RDBMSes) and data warehouses. We will now go through the major types of data storage systems and see some of the benefits and disadvantages of each.

OLTP

When people think of storing data, the first thing that always comes to mind is the traditional relational database. These databases are designed to process and store short data interactions reliably and consistently. On the other hand, online transaction processing (OLTP) systems are very good at handling small interactions, called create, read, update, and delete (CRUD). OLTP systems come in two primary flavors: relational and NoSQL. We will cover the details of each type later, but for now, we should simply understand that within these two types of data stores, we can use any data that has some amount of structure to it. A classic example of a use case for OLTP would be a web application for a grocery store. These types of data actions are small and quick. Therefore, we would typically not see a much-extended workload on an OLTP system in typical data processing usage. Some examples of OLTP systems are MongoDB, Microsoft SQL, PostgreSQL, and CockroachDB, among others. In other words, a data storage system that is used to run a business on a day-to-day basis is an OLTP system. These have frequent insert, update, and delete operations and we are more interested in the throughput of these systems compared to their response time from a performance perspective. Most of these OLTP systems will be ACID-compliant (that is, atomicity, consistency, isolation, and duration).

OLAP

On the other side of the spectrum, we have online analytical processing (OLAP) systems, which are better designed to handle intense data processing workloads. We will avoid being pedantic about what precisely an OLAP is and instead paint a broad picture – two examples are an OLAP data warehouse and a lakehouse. A data warehouse, at its core, is simply an OLAP system that stores and curates data using data warehousing techniques. Data warehouses are trusted sources of data for decision-making. More often than not, a data warehouse will only store structured data. On the other hand, a data lake is a form of storage that stores data in its most native format. Data lakes can also serve as entry points to data warehouses. We will cover lakehouses in more detail later, but they can be understood as hybrids of data warehouses and data lakes.

Data lakes

So, we spoke of data warehouses and databases, which use data that follows a basic structure or schema. Since the cost of disk storage has reduced over time, the desire to keep more and more data has become a reality. This data began to take on less structure, such as audio, video, and text documents. Our current OLTP and OLAP systems were not the ideal tools for enormous amounts of data or data with a wide variety of structures. The data lake emerged as a way to store all the data in systems such as HDFS and AWS S3. Data lakes typically have schema on read, whereas OLTP and OLAP systems generally are schema on read and write. This wide range of flexibility often leads to a data dumping ground or a data swamp. The conventional data warehouses were monolithic, on-premises systems with the compute and storage combined. With the advent of big data, we saw distributed computing, where data was spilt across multiple machines. However, each machine still had combined compute and storage. With the advent of the cloud, this paradigm of splitting both the compute and storage across machines enabled by distributed computing took effect. This is much more efficient. Moreover, on the cloud, the services we use and therefore this OPEX model worked better than the conventional CAPEX model, which was a dead-cost investment.

With the emergence of data lakes came large-scale data processing engines such as Apache Hadoop, Apache Spark, and Apache Flink, to name a few. The most crucial detail to understand about this type of technology is the separation of the storage layer and the compute layer. This pattern exists in all systems designed to handle big data. You may not even know that a system uses this pattern, as with Snowflake or Big Query. There are both significant benefits and negative considerations regarding this type of data processing.

There is one universal rule when understanding data processing – the costs. Moving data is very expensive. Moving data from processor to disk is expensive but moving it across the network is exponentially costly. This must be a design consideration when you’re picking your stack. There are situations where that cost is acceptable, and your application is okay with waiting a few extra seconds. This is one of the reasons you don’t typically see decoupled storage and compute patterns with CRUD web applications.

In the following diagram, the left shows a typical data warehouse or database that has everything built into one system – that is, the storage and compute engines are together. The right-hand side of the diagram shows that they are decoupled, meaning they’re separate systems:

Figure 1.1: Storage and compute

Event stores

Another trend in data storage is using systems such as Kafka, a distributed event store and streaming processing engine. Event stores can be considered data stores with several logs that can be watched or read from start to finish. Event stores are often associated with real-time processing. The term real-time is often used to describe data that is flowing in a relatively real-time-like process. Real-time data is used in many data platforms and can come with its own set of complexities and problems. We will provide a whole chapter on streaming data using both Spark and Kafka. For now, it’s enough to understand that real-time data attempts to store, process, and access data as soon as it’s recorded.

File formats

Whenever you’re working with data, you will inevitably want to save it in a file. The question is, which file should you choose? Some formats are ideal for short-term office work. Other formats, on the other hand, are designed for long-term storage. When you’re selecting what file format to use, first, consider the use case for the file and how long you will keep it.

CSV

Here is an example of a CSV file. In this case, I am using , for the column delimiter and \\n for the line delimiter:

id,first_name,last_name,email\\n
1,Alyce,Creeber,acreeber0@hibu.com\\n
2,Gladi,Fenney,gfenney1@reference.com\\n
3,Mendy,Papen,mpapen2@jalbum.net\\n
4,Gerri,Kernan,gkernan3@berkeley.edu\\n
5,Luca,Skeen,lskeen4@hostgator.com\\n

Comma-separated values (CSV) files are the most common data files that are found in an office (outside of an Excel file). CSV is a text format that uses a set of characters to identify column cells and end-of-line characters. CSVs are used for structured data. Sometimes, you can create semi-structured scenarios, but it’s not very effective. CSVs are text files, so the file’s structure can often have varying characteristics such as headers, banners, or free-form text. CSVs are an excellent short-term office file format for doing work. However, CSVs don’t provide data types, which, combined with the other issues mentioned previously, makes them a terrible choice for long-term storage.

JSON

Here, we have an example JavaScript Object Notation (JSON) object with three objects inside the parent object:

[{
  "id": 1,
  "first_name": "Ermanno",
  "last_name": "Marconi",
  "email": "emarconi0@bbb.org"
}, {
  "id": 2,
  "first_name": "Lory",
  "last_name": "Maskew",
  "email": "lmaskew1@walmart.com"
}, {
  "id": 3,
  "first_name": "Karee",
  "last_name": "Hubbucke",
  "email": "khubbucke2@pagesperso-orange.fr"
}]

JSON is a plain text-based format that uses strict syntax to define semi-structured or structured data. In the preceding example, we are mimicking structured data, but we could have nested more objects or used an array. JSON format overcomes the issue of the poorly defined syntax of CSVs, which makes it a more ideal choice for most use cases. However, JSON is not my first choice for long-term storage because it lacks data types and contains no compression.

Avro

Here is an example schema file for the Avro format. Notice how we have two “columns” of data called Make and ID in the awesome_startup namespace:

{
   "type" : "record",
   "namespace" : "awesome_startup",
   "name" : "cars",
   "fields" : [
      { "name" : "Make" , "type" : "string" },
      { "name" : "ID" , "type" : "int" }
   ]
}

Avro is a popular open standard protocol often found in data lakes. Avro serializes data, making it small and compact compared to formats such as JSON. Avro supports both structured and semi-structured data like JSON. Avro has a secondary file that is in JSON format that defines the data types and structure of your data. With this schema file, you can evolve your schema by making changes but keep backward compatibility. This is a huge advantage when it comes to long-term data storage. Avro is designed to be accessed row by row or row storage. Row storage is ideal for cases when you look up a row and read the whole row. Although Avro is a significant step up from JSON, it still lacks in several ways. What happens when the schema file is lost? Typically, the data is unusable, which is less than ideal. Row storage is perfect for CRUD-style workflows, but many data-intense workflows will read a whole column at a time; this can be costly in Avro.

Parquet

Parquet has emerged as the best-of-breed file format in the majority of cases. Parquet supports semi-structured data and structured data like Avro. Parquet is an open standard format that, like Avro, serializes its data for small footprints. On the other hand, Parquet stores the schema within the file, which overcomes several shortcomings of Avro. Parquet, unlike row-oriented formats like Arvo, is column-oriented. This translates into faster data access and writing for most workflows.

Data platform architecture at a high level

What is data architecture, and why do I care? Data architecture is the process of designing and building complex data platforms. This involves taking a comprehensive view, which includes not only moving and storing data but also all aspects of the data platform. Building a well-designed data ecosystem can be transformative to a business.

What goes into architecting a data platform? In picking your solution, you will evaluate the price, cloud vendors, and multi-cloud options, among many other choices. Using a hosted option for a service might make sense, or it may be inappropriate. You might want to stick with one vendor for everything, or you may decide to get the best-of-breed technologies. Will you need streaming data? How will you manage your data? What is the long-term vision for the project? Is vendor and product lock-in something to consider? All these questions and more are answered by the data architect and the platform that gets designed.

I will now introduce a data platform reference architecture that is used to organize and better understand what exists in your data ecosystem.

Here is a data architecture reference diagram. I have put data governance, processing, and storage across all areas since they interact with everything:

Figure 1.2: Data platform architecture

Let’s break this architecture down.

Storage layer

In this layer, we include technologies that persist data in long-term storage, such as OLTP, OLAP, lakehouses, event stores, and data lakes. We also have file types such as Parquet, Avro, and CSV.

Ingestion layer

The ingestion layer focuses on taking your data from whatever source system it may live in, such as a social media site, and moving that data into the storage layer. This layer may include commercial products such as Fivetran or Stich, but often, it will also involve writing code to accomplish this task.

Analytics layer

In the analytics layer, we will see a variety of work that ranges from data science, machine learning, and artificial intelligence to graph analytics and statical analysis. The output from the analysis will be represented in a “view” that the consumption layer can access. You will also see data modeling in this layer. Data modeling is the process of building tables to better understand and represent your data from various perspectives.

Consumption layer

The consumption layer stores the output views created by the analytics layer. The technology chosen for this layer can vary, but one example might be a machine learning model that’s recorded and tracked in MLflow and stored in S3.

Serving layer

The serving layer consists of business intelligence (BI), dashboards, data visualizations, search engines, and other systems that use data products in the consumption layer.

Data governance layer

The data governance layer contains master data management, data quality enforcement, data security, data auditing, and metadata management. We will cover the fundamental concepts of this layer in a separate chapter.

Processing layer

The processing layer is the workhorse that handles moving data between all the other layers. Technologies such as Apache Spark, Flink, Snowflake, DBT, Dataflow, and Azure Data Factory handle this task.

Semantic view

The semantic view, also known as the virtual layer, is an optional abstraction in your data platform that allows you to decouple user access from data storage. In simple terms, users will have a view of the data stored in its “original” source, but without the need to access the source directly. Data could be stored in anything ranging from relational databases and data lakes to REST APIs. This virtual layer is then used to create new modeled data products. Users might access these data products from various sources, such as BI tooling, notebooks, or even internally developed applications using whatever access method is needed. These new data products are curated from different sources and are tailored to the needs of the users and the use cases. Often, standard terms are normalized across data products. In an ideal world, there is no storage in the semantic layer. However, you may need to store copies of the data for faster access, for example, if you’re integrating with another corporate application. There are several benefits to this added complexity, including central data governance and future proofing for any data storage solution changes.

Comparing the Lambda and Kappa architectures

In the beginning, we started with batch processing or scheduled data processing jobs. When we run data workloads in batches, we are setting a specific chronological cadence for those workloads to be triggered. For most workloads, this is perfectly fine, but there will always be a more significant time delay in our data. As technology has progressed, the ability to utilize real-time processing has become possible.

At the time of writing, there are two different directions architects are taking in dealing with these two workloads.

Lambda architecture

The following is the Lambda architecture, which has a combined batch and real-time consumption and serving layer:

Figure 1.3: Combined Lambda architecture

The following diagram shows a Lambda architecture with separate consumption and serving layers, one for batch and the other for real-time processing:

Figure 1.4: Separate combined Lambda architecture

The Lambda architecture was the first attempt to deal with both streaming and batch data. It grew out of systems that started with just traditional batch data. As a result, the Lambda architecture uses two separate layers for processing data. The first layer is the batch layer, which performs data transformations that are harder to accomplish in real-time processing workstreams. The second layer is a real-time layer, which is meant for processing data as soon as its ingested. As data is transformed in each layer, the data should have a unique ID that allows the data to be correlated, no matter the workstream.

Once the data products have been created from each layer, there can be a separate or combined consumption layer. A combined consumption layer is easier to create but given the technology, it can be challenging to accomplish complex models that span both types of data. In the consumption layer, batch and real-time processing can be combined, which will require matching IDs. The consumption layer is a landing zone for data products made in the batch or real-time layer. The storage mechanism for this layer can range from a data lake or a data lakehouse to a relational database. The serving layer is used to take data products in the consumption layer and create views, run AI, and access data through tools such as dashboards, apps, and notebooks.

The Lambda architecture is relatively easy to adopt, given that the technology and patterns are typically known and fit into a typical engineer’s comfort zone. Most deployments already have a batch layer, so often, the real-time layer is a bolt-on addition to the platform. What tends to happen over time is that the complexity grows in several ways. Two very complex systems must be maintained and coordinated. Also, two distinct sets of software must be developed and maintained. In most cases, the two layers do not have similar technology, which will translate into a variety of techniques and languages for writing software in and keeping it updated.

Kappa architecture

The following diagram shows the Kappa architecture. The essential details are that there is only one set of layers and batch data is extracted via real-time processing:

Figure 1.5: Kappa architecture

The Kappa architecture was designed due to frustrations with the Lambda architecture. With the Lambda architecture, we have two layers, one for batch processing and the other for stream processing. The Kappa architecture has only a single real-time layer, which it uses for all data. Now, if we take a step back, there will always be some amount of oddness because batch data isn’t real-time data. There is still a consumption layer that’s used to store data products and a serving layer for accessing those data products. Again, the caveat is that many batch-based workloads will need to be customized so that they only use streaming data. Kappa is often found in large tech companies that have a wealth of tech talent and the need for fast, real-time data access.

Where Lambda was relatively easy to adopt, Kappa is highly complex in comparison. Often, the minimal use case for a typical company for real-time data does not warrant such a difficult change. As expected, there are considerable benefits to the Kappa architecture. For one, maintenance is reduced significantly, and the code base is also slimmed down. Real-time data can be complex to work with at times. Think of a giant hose that can’t ever be turned off. Issues with data in the Kappa architecture can often be very challenging, given the nature of the data storage. In the batch processing layer, it’s easy to deploy a change to the data, but in the real-time processing layer, reprocessing the data is no trivial matter. What often happens is secondary storage is adopted for data so that data can be accessed in both places. A straightforward example of why having a copy of data in a secondary system is, for example, when in Kafka, you need to constantly adjust the data. We will discuss Kafka later, but I will just mention that having a way to dump a Kafka topic and quickly repopulate it can be a lifesaver.

Lakehouse and Delta architectures

Introduced by Databricks, the lakehouse and Delta architectures are a significant improvement over previous data platform patterns. They are the next evolution by combining what works from previous modalities and improving them.

Lakehouses

What is a lakehouse? Why is it so important? It’s talked about often, but few people can explain the tenets of a lakehouse. The lakehouse is an evolution of the data warehouse and the data lake. A lakehouse takes the lessons learned from both and combines them to avoid the flaws of both. There are seven tenets of the lakehouse, and each is taken from one of the parent technologies.

The seven central tenets

Something that’s not always understood when engineers discuss lakehouses is that they're general sets of ideas. Here, we will walk through all seven essential tenets – openness, data diversity, workflow diversity, processing diversity, language-agnostic, decoupled storage and compute, and ACID transactions.

Openness

The openness principle is fundamental to everything in the lakehouse. It influences the preference for open standards over closed-source technology. This choice affects the long-term life of our data and the systems we choose to connect with. When you say a lakehouse is open, you are saying it uses nonproprietary technologies, but it also uses methodologies that allow for easier collaboration, such as decoupled storage and compute engines.

Data diversity

In a lakehouse, all data is welcome and accessible to users. Semi-structured data is given first-class citizenship alongside structured data, including schema enforcement.

Workflow diversity

With workflow diversity, users can access the data in many ways, including via notebooks, custom applications, and BI tools. How the user interacts with the data shouldn’t be limited.

Processing diversity

The lakehouse prioritizes both streaming and batch equally. Not only is streaming important but it also uses the Delta architecture to compress streaming and batch into one technology layer.

Language-agnostic

The goal of the lakehouse is to support all methods of accessing the data and all programming languages. However, this goal is not possible practically. When implemented, the list of methods and languages supported in Apache Spark is extensive.

Decoupling storage and compute

In a data warehouse, data storage is combined with the same technology. From a speed perspective, this is ideal, but it creates a lack of flexibility. When a new processing engine is desired, or a combination of many storage engines is required, the data warehouse’s model fails to perform. A unique characteristic taken from data lakes is decoupling the storage and compute layers, which creates several benefits. The first is flexibility; you can mix and match technologies. You can have data stored in graph databases, cloud data warehouses, object stores, or even systems such as Kafka. The second benefit is the significant cost reduction. This cost reduction comes when you incorporate technologies such as object stores. Cloud object stores such as AWS’s S3, Azure’s Blob, and GCP’s Object Storage represent cheap, effective, and massively scalable data storage. Lastly, when you follow this design pattern, you can scale at a more manageable rate.

ACID transactions

One significant issue with data lakes is the lack of transactional data processing. Transactions have substantial effects on the quality of your data. They affect anything from writing to the same table at the same time to a log of changes made to your table. With ACID transactions, lakehouses are significantly more reliable and effective compared to data lakes.

The medallion data pattern and the Delta architecture

The medallion data pattern is an approach to storing and serving data that is less based on building a data warehouse and more focused on building your raw data off a single source of truth.

Delta architecture

The following diagram shows the Delta architecture. It looks just like Kappa, but the key difference is that you are not forcing batch processing out of real-time data. Both exist within the same layers and systems:

Figure 1.6: Delta architecture

The Delta architecture is a lessons-learned approach to both the Kappa and Lambda architectures. Delta sees that the complexity of trying to use real-time data as the sole data source can be highly complex and, in the majority of companies, overkill. Most companies want workloads in both batch and real-time, not excluding one over the other. Yet, Delta architectures reduce the footprint to one layer. This processing layer can handle batch and real-time with almost identical code. This is a huge step forward from previous architectural patterns.

The medallion data pattern

The medallion data pattern is an organized naming convention that explains the nature of the data being processed. When referencing your tables, labeling them with tags allows for clear visibility into tables.

The following diagram shows the medallion data pattern, which describes the state of each dataset at rest:

Figure 1.7: Medallion architecture

As you can see, the architecture has different types of tables (data). Let’s take a look.

Bronze

Bronze data is raw data that represents source data without modification, other than metadata. Bronze tables can be used as a single source of truth since the data is in its purest form. Bronze tables are incrementally loaded and can be a combination of streaming and batch data. It might be helpful to add metadata such as source data and processed timestamps.

Silver

Once you have bronze data, you will start to clean, mold, transform, and model your data. Silver data represents the final stage before creating data products. Once your data is in a silver table, it is considered validated, enriched, and ready for consumption.

Gold

Gold data represents your final data products. Your data is curated, summarized, and molded to meet your user’s needs in terms of BI dashboarding, statistical analysis, or machine learning modeling. Often, golden data is stored in separate buckets to allow for data scaling.

Data mesh theory and practice

Zhamak Dehghani created data mesh to overcome many common data platform issues while working for ThoughtWorks in 2019. I often get asked by seasoned data professionals, why bother with data mesh? Isn’t it just data silos all over again? The fundamentals of a data mesh are that it’s a decentralized data domain and scaling-focused but with those ideas also comes rethinking how we organize not only our data but also our whole data practice. By learning and adopting data mesh concepts and techniques, we cannot only produce more valuable data but also better enable users to access our data. No longer do we see orphaned data with little interaction from its creators. Users have direct relationships with data producers, and with that comes higher-quality data.

The following diagram shows the typical spaghetti data pipeline complexity that many growing organizations fall into. Imagine trying to maintain this maze of data pipelines. This is a common scenario for spaghetti data pipelines, which are very brittle and hard to maintain and scale:

Figure 1.8: Classic data pipeline architecture

The following diagram shows our data platform once it’s been decentralized, which means it’s free of brittle pipelines and able to scale:

Figure 1.9: Data mesh architecture

Anyone who has worked on a more extensive organization’s data team can tell you how complex and challenging things get. You will find a maze of complex data pipelines everywhere. Pipelines are the blood of your data warehouse or data lake. As data is shipped and processed, it’s merged into a central warehouse. It’s common to have lots of data with very little visibility into who knows about the data and the quality of that data. The data sits there, and people who know about it use it in whatever state it lives in. So, let’s say you find that data. What exactly is the data? What is the path or lineage the data has taken to get to its current state? Who should you contact to correct issues with the data? So many questions, and often, the answers are less than ideal.

These were the types of problems Zhamak Dehghani was trying to tackle when she first came up with the idea of a data mesh. Dehghani noticed all the limitations of the current landscape in data organizations and developed a decentralized philosophy heavily influenced by Eric Evans’s domain-driven design. A data mesh is arguably a mix of organizational and technological changes. These changes, when adopted, allow your teams to have a better data experience. One thing I want to make very clear is that data mesh does not involve creating data silos. It involves creating an interconnected network of data that isn’t focused on the technical process but on the functionality concerning the data. Organizationally, a domain in data mesh will have a cross-functional team of data experts, software experts, data product owners, and infrastructure experts.

Defining terms

A data mesh has several terms that need to be explained for us to understand the philosophy fully. The first term that stands out is the data product owner. The data product owner is a member of the business team who takes on the role of the data steward or overseer of the data and is responsible for data governance. If there is an issue with data quality or privacy concerns, the data steward would be the person accountable for that data. Another term that’s often used in a data mesh is domain, which can be understood as an organizational group of commonly focused entities. Domains publish data for other domains to consume. Data products are the heart of the data mesh philosophy. The data products should be self-service data entities that are offered close to creators. What does it mean to have self-service data? Your data is self-service when other domains can search, find, and access your data without having to have any administrative steps. This should all live on a data platform, which isn’t one specific technology but a cohesive network of technologies.

The four principles of data mesh

Let’s look at these four principles – that is, data ownership, data as a product, data availability, and data governance.

Data ownership

Data ownership is a fundamental concept in a data mesh. Data ownership is a partnership between the cross-functional teams within a domain and other domains that are using the data products in downstream apps and data products. In the traditional model, data producers allow a central group of engineers to send their data to a single repository for consumption. This created a scenario where data warehouse engineers were the responsible parties for data. These engineers tried to be the single source of truth when it came to this data. The source is now intimately involved with the data consumption process. This reduces and improves the data quality and removes the need for a central engineering group to manage data. To accomplish this task, teams within a domain have a wide range of skill sets, including software engineers, product owners, DevOps engineers, data engineers, and analytics engineers. So, who ultimately owns the data? It’s very simple – if you produce the data, you own it and are responsible for it.

Data as a product

Data as a product is a fundamental concept that transforms the data that is offered to users. When each domain treats the data that others consume as an essential product, it reinvents our data. We market our data and have a vested interest in that data. Consumers want to know all about the data before they buy our product. So, when we advertise our data, we list the characteristics users need to make that educated decision. These characteristics are things such as lineage, access statistics, data contract, the data product owners, privacy level, and how to access the data. Each product is a uniquely curated creation and versioned to give complete visibility to consumers. Each version of our data product represents a new data contract with downstream users. What is a data contract? It’s an agreement between the producers and consumers of the data. Not only is the data expected to be clean and kept in high quality but the schema of the data is also guaranteed. When a new version comes out, any schema changes must be backward-compatible to avoid breaking changes. This is called schema evolution and is a cornerstone to developing a trusted data product.

Data is available

As a consumer of data in an organization, I should be able to find data easily in some type of registry. The data producers should have accurate metadata about the data products within this registry. When I want to access this data, there should be an automated process that is ideally role-based. In an ideal world, this level of self-service exists to create an ecosystem of data. When we build data systems with this level of availability, we see our data practice grow, and we evolve our data usage.

Data governance

Data governance is a very loaded term with various meanings, depending on the person and the context. In the context of a data mesh, data governance is applied within each domain. Each domain will guarantee quality data that fulfills the data contract, all data meets the privacy level advertised, and appropriate access is granted or revoked based on company policies. This concept is often called federated data governance. In a federated model, governance standards are centrally defined but executed within each domain. Like any other area of a data mesh, the infrastructure can be shared but in a distributed manner. This distributed approach allows for standards across the organization but only via domain-specific implementations.

Summary

In this chapter, we covered the data architecture at a very high level, which can be overwhelming. Designing a solution is complex, but we covered many fundamental topics such as data meshes and lakehouses that should help you build data platforms. In the following few chapters, we will walk you through building the components of a data platform, such as BI tooling, MLOps, and data processing.

Practical lab

This isn’t an ideal chapter for a lab because it’s a chapter on theory. So, let’s introduce a common thread in our chapters: Mega Awesome Toys. Mega Awesome Toys has hired us to consult on its cloud data transformation. The chief data officer has explained that Mega Awesome Toys has factories across the globe, with lots of IoT data coming in from machinery building toys. It also has substantial financial data from sales and toy production. It is also expanding into online sales and has a growing amount of web-based data coming in large amounts. The company has settled on AWS as its cloud provider. Its websites use MongoDB, and it has several other relational databases. Its data warehouse is on a small Microsoft SQL deployment. It has several data analysts, data scientists, and data engineers who all use Python and SQL. It also has a team of software engineers. Its long-term goal is to leverage machine learning and statistics from its data into all areas of its business. It is desperate for technology leadership as it migrates off on-premises solutions.

Solution

There are several key details to take note of:

AWS
MongoDB
Microsoft SQL and other relational databases
Python and SQL for data usage
Data scientists, analysts, and engineers
A team of software engineers

One possible solution is as follows:

MongoDB has a hosted cloud offering called Atlas that is 100% compatible with AWS. So, yes, there are AWS-native choices, but given there is no need to choose an AWS product here, I would suggest a best-of-breed solution.
Relational databases are a dime a dozen, and AWS RDS is perfect. Therefore, I would suggest choosing a flavor. I recommend PostgreSQL on RDS unless global scaling is an essential requirement; then, I would look at CockroachDB or AWS Aurora. Since there isn’t much magic in relational databases, using RDS is easy in most cases.
Given the skills, roles, and long-term goals that have been set, I would recommend the lakehouse architecture combined with a data mesh approach. Since streaming (real-time) was directly mentioned, I would shy away from having that over Kafka and instead use Databricks as a component for the data platform. For a long time, Databricks has set itself as the front-runner for machine learning, artificial intelligence, statistics, data lakes, data warehouses, and more. It is compatible with all major cloud vendors.

About the Author

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Browse publications by this author