Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Building a Data Warehouse in BigQuery

The power of a data warehouse is that organizations can combine multiple sources of information into a single place that becomes a single source of truth. The dream of data analytics is the time when every single business aspect of an organization relies on data. That condition can be met when all business decision-makers know how to access data, trust the data, and can make decisions based on it.

Unfortunately, most of the time, dreams are far removed from reality. There are many challenges along the way. Based on my experience, there are three main challenges – technology bottlenecks, data consistency, and the ability to serve multiple business purposes.

The preceding challenges are natural when we build a data warehouse. They’re not limited to certain technologies and organizations. In this chapter, we will learn about those challenges through two hands-on scenarios. We will mainly use BigQuery as the GCP data warehouse cloud...

Technical requirements

In this chapter’s exercises, we will use the following GCP services: BigQuery and GCS. If you have never opened any of these services in your GCP console, open them and enable the API.

Make sure you have your GCP console, Cloud Shell, and Cloud Editor ready.

Before starting, you must have a basic knowledge of Python programming, SQL, Linux commands, and Git.

All the example Python scripts are developed using Python 3. Make sure you run the Python commands using Python 3 instead of Python (Python 2) from Cloud Shell.

Download the example code and the dataset here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-3

Introduction to GCS and BigQuery

GCS is an object storage service. It’s a serverless service that is fully managed by GCP, which means we don’t need to think about any underlying infrastructure of GCS. For example, we don’t need to think about pre-sizing the storage, network bandwidth, number of nodes, or any other infrastructure-related stuff.

What is object storage? Object storage is a highly scalable data storage architecture that can store very large amounts of data in any format.

Because the technology can store data in almost any size and format, GCS is often used by developers to store large files, for example, images, videos, and large CSV data. But, from the data engineering perspective, we will often use GCS to store files, for example, as dump storage from databases, for exporting historical data from BigQuery, for storing machine learning model files, and for any other purpose related to storing files.

BigQuery is a serverless data warehouse...

Introduction to the BigQuery console

The best way to understand BigQuery is by using it, so let’s start with a simple exercise. This first exercise will focus on understanding the BigQuery console. The BigQuery console is your main user interface for developing and using BigQuery. In this exercise, we will create a dataset and table and query the table using the BigQuery console.

Let’s start our exercise by opening the BigQuery console from our browser:

  1. Open your GCP console.
  2. Go to the navigation bar and choose BigQuery. This is the BigQuery main page, called the BigQuery console:
Figure 3.1 – BigQuery console

Figure 3.1 – BigQuery console

  1. As you can see, there are many buttons and panels in the BigQuery console. There are three main sections on the page. On the left-hand side, there is the BigQuery menu bar, where you can see SQL Workspace, Data Transfers, Scheduled Queries, and the other menus. The second section is the Explorer section...

Preparing the prerequisites before developing our data warehouse

Before starting our practice exercise, let’s carry out these small but important steps for authentication purposes. In this section, we will do the following:

  1. Accessing Cloud Shell.
  2. Checking our credentials using gcloud info.
  3. Initializing our credentials using gcloud init.
  4. Downloading example codes and datasets from git.
  5. Uploading data to GCS from git.

Let’s look at each of these steps in detail.

Step 1 – Accessing Cloud Shell

Revisit Chapter 2, Big Data Capabilities on GCP, specifically the Using GCP Cloud Shell section, if you haven’t accessed your Cloud Shell in the GCP console.

Step 2 – Checking the current setup using the command line

We want to check our current setup in Cloud Shell. To do that in Cloud Shell, type the following:

$ gcloud info

Click Authorize if prompted to.

This command will give you information about installed...

Practicing developing a data warehouse

Now we are set and ready to build our first data warehouse. We will proceed with the help of two scenarios. Each scenario will have different learning purposes.

In the first scenario, we are going to focus on how to use the tools. After understanding the tools, in the second scenario, we will focus on the practice. Practice here means learning how to build a solution, not only learning how to use the tools.

We understand that even though we know how to use the tools, there are many possibilities for using them. But when it comes to practice, usually we learn from common best practices, patterns, and different theories from practitioners.

We will use the San Francisco bike-sharing dataset. The dataset relates to a bike-sharing company. The company records the trip data of its members. Each bike trip contains information about the stations, and each station is located in certain regions. This dataset is very simple compared to a real-world...

BigQuery’s useful features

Having completed the exercise from the previous section, where we used BigQuery in a practical scenario where we were developing a data warehouse, in this last section of the chapter, let’s bring back our perspective to see BigQuery as a tool.

BigQuery is an integral tool in the GCP data engineering ecosystem, so even though BigQuery at its core is a database, on top of that, there are many more features, such as ones for machine learning, transferring data, and scheduling queries.

I will use this section to guide you quickly through the important additional BigQuery features that you might use in the future. As usual, I will focus on helping you narrow down what is important to you as a data engineer.

As a reminder, the best source for detailed information about each feature is the GCP public documentation. This book will only give a high-level description of features so that you can understand when certain features are relevant to...

Summary

In this chapter, we’ve practiced using BigQuery to build a data warehouse. In general, we’ve covered the three main aspects of how to use the tools, how to load the data to BigQuery, and the data modeling aspect of a data warehouse.

After following all the steps in this chapter, you will have a better understanding of the data life cycle and you will understand that data moves from place to place. We also practiced the ELT process in this chapter, extracting data from a MySQL database, loading it to BigQuery, and doing some transformations to answer business questions. And on top of that, we did it all on a fully managed service in the cloud, spending zero time worrying about any infrastructure aspects.

By way of a footnote for this chapter, I want to remind you that, even though we have covered the common practices of using BigQuery, we haven’t covered all of its features. There are a lot of other features in BigQuery that are worth checking out...

Exercise – Scenario 3

As a final activity in this chapter, you can do a self-exercise to solve an additional business question from business users. Our operational user from scenario 2 wants to ask this additional question:

Show me the top three regions that have the most female riders as of the most recent date (2018-01-02).

Because the gender of members is not yet included in our fact and dimension table, you need to create a different fact and dimension table.

Remember that the data model is subjective, especially in the Kimball method. There is no right or wrong answer to the question. As we’ve discussed in this chapter, people can use different data models to represent the real world.

Try to solve it yourself and compare it to the solution in the Git code example:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya