You're reading from Data Engineering with Google Cloud Platform

Product type Book

Published in Mar 2022

Publisher Packt

ISBN-13 9781800561328

Pages 440 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Adi Wijaya

Table of Contents (17) Chapters

Preface

Section 1: Getting Started with Data Engineering with GCP

Chapter 1: Fundamentals of Data Engineering

Chapter 2: Big Data Capabilities on GCP

Section 2: Building Solutions with GCP Components

Chapter 3: Building a Data Warehouse in BigQuery

Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer

Chapter 5: Building a Data Lake Using Dataproc

Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow

Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio

Chapter 8: Building Machine Learning Solutions on Google Cloud Platform

Section 3: Key Strategies for Architecting Top-Notch Data Pipelines

Chapter 9: User and Project Management in GCP

Chapter 10: Cost Strategy in GCP

Chapter 11: CI/CD on Google Cloud Platform for Data Engineers

Chapter 12: Boosting Your Confidence as a Data Engineer

Other Books You May Enjoy

Understanding the data life cycle

The first principle to learn to become a data engineer is understanding the data life cycle. If you've worked with data, you must know that data doesn't stay in one place; it moves from one storage to another, from one database to other databases. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:

Who will consume the data?
What data sources should I use?
Where should I store the data?
When should the data arrive?
Why does the data need to be stored in this place?
How should the data be processed?

To answer all those questions, we'll start by looking back a little bit at the history of data technologies.

Understanding the need for a data warehouse

Data warehouse is not a new concept; I believe you've at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.

So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.

Data warehouses were first developed in the 1980s to transform data from operational systems to decision-making support systems. The key principle of a data warehouse is combining data from many different sources to a single location and then transforming it into a format the data warehouse can process and store.

For example, in the financial industry, say a bank wants to know how many credit card customers also have mortgages. It is a simple enough question, yet it's not that easy to answer. Why?

Most traditional banks that I have worked with had different operating systems for each of their products, including a specific system for credit cards and specific systems for mortgages, saving products, websites, customer service, and many other systems. So, in order to answer the question, data from multiple systems needs to be stored in one place first.

See the following diagram on how each department is independent:

Figure 1.1 – Data silos

Often, independence not only applies to the organization structure but also to the data. When data is located in different places, it's called data silos. This is very common in large organizations where each department has different goals, responsibilities, and priorities.

In summary, what we need to understand from the data warehouse concept is the following:

Data silos have always occurred in large organizations, even back in the 1980s.
Data comes from many operating systems.
In order to process the data, we need to store the data in one place.

What does a typical data warehouse stack look like?

This diagram represents the four logical building blocks in a data warehouse, which are Storage, Compute, Schema, and SQL Interface:

Figure 1.2 – Data warehouse main components

Data warehouse products are mostly able to store and process data seamlessly and the user can use the SQL language to access the data in tables with a structured schema format. It is basic knowledge, but an important point to be aware of is that the four logical building blocks in the data warehouse are designed as one monolithic software that evolved over the later years and was the start of the data lake.

Getting familiar with the differences between a data warehouse and a data lake

Fast forward to 2008, when an open source data technology named Hadoop was first published, and people started to use the data lake terminology. If you try to find the definition of data lake on the internet, it will mostly be described as a centralized repository that allows you to store all your structured and unstructured data.

So, what is the difference between a data lake and a data warehouse? Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn't?

What if I say some data warehouse products can now store and process unstructured data? Does the data warehouse become a data lake? The answer is no.

One of the key differences from a technical perspective is that data lake technologies separate most of the building blocks, in particular, the storage and computation, but also the other blocks, such as schema, stream, SQL interface, and machine learning. This evolves the concept of a monolithic platform into a modern and modular platform consisting of separated components, as illustrated in the following diagram:

Figure 1.3 – Data warehouse versus data lake components

For example, in a data warehouse, you insert data by calling SQL statements and query the data through SQL tables, and there is nothing you can do as a user to change that pattern.

In a data lake, you can access the underlying storage directly, for example, by storing a text file, choosing your own computation engine, and choosing not to have a schema. There are many impacts of this concept, but I'll summarize it into three differences:

Figure 1.4 – Table comparing data lakes and data warehouses

Large organizations start to store any data in the data lake system for two reasons, high scalability and cheap storage. In modern data architecture, both data lakes and data warehouses complete each other, rather than replacing each other.

We will dive deeper and carry out some practical examples throughout the book, such as trying to build a sample in Chapter 3, Building a Data Warehouse in BigQuery, and Chapter 5, Building a Data Lake Using Dataproc.

The data life cycle

Based on our understanding of the history of the data warehouse, now we know that data does not stay in one place. As an analogy, data is very similar to water; it flows from upstream to downstream. Later in this book, both of these terms, upstream and downstream, will be used often since they are common terminologies in data engineering.

When you think about water flowing upstream and downstream, one example that you can think of is a waterfall; the water falls freely without any obstacles.

Another example in different water life cycle circumstances is a water pipeline; upstream is the water reservoir and downstream is your kitchen sink. In this case, you can imagine the different pipes, filters, branches, and knobs in the middle of the process.

Data is very much like water. There are scenarios where you just need to copy data from one storage to another storage, or in more complex scenarios, you may need to filter, join, and split multiple steps downstream before the data can be consumed by the end users.

As illustrated in the following diagram, the data life cycle mostly starts from frontend applications, and flows up to the end for data users as information in the dashboard or ad hoc queries:

Figure 1.5 – Data life cycle diagram

Let's now look at the elements of the data life cycle in detail:

Apps and databases: The application is the interface from the human to the machine. The frontend application in most cases acts as the first data upstream. Data at this level is designed to serve application transactions as fast as possible.
Data lake: Data from multiple databases needs to be stored in one place. The data lake concept and technologies suit the needs. The data lake stores data in a file format such as a CSV file, Avro, or Parquet.

The advantage of storing data in a file format is that it can accept any data sources; for example, MySQL Database can export the data to CSV files, image data can be stored as JPEG files, and IoT device data can be stored as JSON files. Another advantage of storing data in a data lake is it doesn't require a schema at this stage.

Data warehouse: When you know any data in a data lake is valuable, you need to start thinking about the following:
1. What is the schema?
2. How do you query the data?
3. What is the best data model for the data?

Data in a data warehouse is usually modeled based on business requirements. With this, one of the key requirements to build the data warehouse is that you need to know the relevance of the data to your business and the expected information that you want to generate from the data.

Data mart: A data mart is an area for storing data that serves specific user groups. At this stage, you need to start thinking about the final downstream of data and who the end user is. Each data mart is usually under the control of each department within an organization. For example, a data mart for a finance team will consist of finance-related tables, while a data mart for data scientists might consist of tables with machine learning features.
Data end consumer: The last stage of data will be back to humans as information. The end user of data can have various ways to use the data but at a very high level, these are the three most common usages:
1. Reporting and dashboard
2. Ad hoc query
3. Machine learning

Are all data life cycles like this? No. Similar to the analogy of water flowing upstream and downstream, in different circumstances, it will require different data life cycles, and that's where data engineers need to be able to design the data pipeline architecture. But the preceding data life cycle is a very common pattern. In the past 10 years as a data consultant, I have had the opportunity to work with more than 30 companies from many industries, including financial, government, telecommunication, and e-commerce. Most of the companies that I worked with followed this pattern or were at least going in that direction.

As an overall summary of this section, we've learned that since historical times, data is mostly in silos, and it drives the needs of the data warehouse and data lake. The data will move from one system to others as specific needs have specific technologies and, in this section, we've learned about a very common pattern in data engineering. In the next section, let's try to understand the role of a data engineer, who should be responsible for this.