You're reading from Data Engineering with Google Cloud Platform - Second Edition

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781835080115

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Adi Wijaya

Building a Data Warehouse in BigQuery

The power of a data warehouse is that organizations can combine multiple sources of information into a single place that becomes a single source of truth. The dream of data analytics is the time when every single business aspect of an organization relies on data. That condition can be met when all business decision-makers know how to access data, trust the data, and can make decisions based on it.

Unfortunately, most of the time, dreams are far removed from reality. There are many challenges along the way. Based on my experience, there are three main challenges – technology bottlenecks, data consistency, and the ability to serve multiple business purposes.

The preceding challenges are natural when we build a data warehouse. They’re not limited to certain technologies and organizations. In this chapter, we will learn about those challenges through two hands-on scenarios. We will mainly use BigQuery as the GCP data warehouse cloud...

Technical requirements

In this chapter’s exercises, we will use the following GCP services: BigQuery and GCS. If you have never opened any of these services in your GCP console, open them and enable the API.

Make sure you have your GCP console, Cloud Shell, and Cloud Editor ready.

Before starting, you must have a basic knowledge of Python programming, SQL, Linux commands, and Git.

All the example Python scripts are developed using Python 3. Make sure you run the Python commands using Python 3 instead of Python (Python 2) from Cloud Shell.

Download the example code and the dataset here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-3

Introduction to GCS and BigQuery

GCS is an object storage service. It’s a serverless service that is fully managed by GCP, which means we don’t need to think about any underlying infrastructure of GCS. For example, we don’t need to think about pre-sizing the storage, network bandwidth, number of nodes, or any other infrastructure-related stuff.

What is object storage? Object storage is a highly scalable data storage architecture that can store very large amounts of data in any format.

Because the technology can store data in almost any size and format, GCS is often used by developers to store large files, for example, images, videos, and large CSV data. But, from the data engineering perspective, we will often use GCS to store files, for example, as dump storage from databases, for exporting historical data from BigQuery, for storing machine learning model files, and for any other purpose related to storing files.

BigQuery is a serverless data warehouse...

Introduction to the BigQuery console

The best way to understand BigQuery is by using it, so let’s start with a simple exercise. This first exercise will focus on understanding the BigQuery console. The BigQuery console is your main user interface for developing and using BigQuery. In this exercise, we will create a dataset and table and query the table using the BigQuery console.

Let’s start our exercise by opening the BigQuery console from our browser:

Open your GCP console.
Go to the navigation bar and choose BigQuery. This is the BigQuery main page, called the BigQuery console:

Figure 3.1 – BigQuery console

As you can see, there are many buttons and panels in the BigQuery console. There are three main sections on the page. On the left-hand side, there is the BigQuery menu bar, where you can see SQL Workspace, Data Transfers, Scheduled Queries, and the other menus. The second section is the Explorer section...

Preparing the prerequisites before developing our data warehouse

Before starting our practice exercise, let’s carry out these small but important steps for authentication purposes. In this section, we will do the following:

Accessing Cloud Shell.
Checking our credentials using gcloud info.
Initializing our credentials using gcloud init.
Downloading example codes and datasets from git.
Uploading data to GCS from git.

Let’s look at each of these steps in detail.

Step 1 – Accessing Cloud Shell

Revisit Chapter 2, Big Data Capabilities on GCP, specifically the Using GCP Cloud Shell section, if you haven’t accessed your Cloud Shell in the GCP console.

Step 2 – Checking the current setup using the command line

We want to check our current setup in Cloud Shell. To do that in Cloud Shell, type the following:

$ gcloud info

Click Authorize if prompted to.

This command will give you information about installed...

Practicing developing a data warehouse

Now we are set and ready to build our first data warehouse. We will proceed with the help of two scenarios. Each scenario will have different learning purposes.

In the first scenario, we are going to focus on how to use the tools. After understanding the tools, in the second scenario, we will focus on the practice. Practice here means learning how to build a solution, not only learning how to use the tools.

We understand that even though we know how to use the tools, there are many possibilities for using them. But when it comes to practice, usually we learn from common best practices, patterns, and different theories from practitioners.

We will use the San Francisco bike-sharing dataset. The dataset relates to a bike-sharing company. The company records the trip data of its members. Each bike trip contains information about the stations, and each station is located in certain regions. This dataset is very simple compared to a real-world...

BigQuery’s useful features

Having completed the exercise from the previous section, where we used BigQuery in a practical scenario where we were developing a data warehouse, in this last section of the chapter, let’s bring back our perspective to see BigQuery as a tool.

BigQuery is an integral tool in the GCP data engineering ecosystem, so even though BigQuery at its core is a database, on top of that, there are many more features, such as ones for machine learning, transferring data, and scheduling queries.

I will use this section to guide you quickly through the important additional BigQuery features that you might use in the future. As usual, I will focus on helping you narrow down what is important to you as a data engineer.

As a reminder, the best source for detailed information about each feature is the GCP public documentation. This book will only give a high-level description of features so that you can understand when certain features are relevant to...

Summary

In this chapter, we’ve practiced using BigQuery to build a data warehouse. In general, we’ve covered the three main aspects of how to use the tools, how to load the data to BigQuery, and the data modeling aspect of a data warehouse.

After following all the steps in this chapter, you will have a better understanding of the data life cycle and you will understand that data moves from place to place. We also practiced the ELT process in this chapter, extracting data from a MySQL database, loading it to BigQuery, and doing some transformations to answer business questions. And on top of that, we did it all on a fully managed service in the cloud, spending zero time worrying about any infrastructure aspects.

By way of a footnote for this chapter, I want to remind you that, even though we have covered the common practices of using BigQuery, we haven’t covered all of its features. There are a lot of other features in BigQuery that are worth checking out...

Exercise – Scenario 3

As a final activity in this chapter, you can do a self-exercise to solve an additional business question from business users. Our operational user from scenario 2 wants to ask this additional question:

Show me the top three regions that have the most female riders as of the most recent date (2018-01-02).

Because the gender of members is not yet included in our fact and dimension table, you need to create a different fact and dimension table.

Remember that the data model is subjective, especially in the Kimball method. There is no right or wrong answer to the question. As we’ve discussed in this chapter, people can use different data models to represent the real world.

Try to solve it yourself and compare it to the solution in the Git code example:

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Engineering with Google Cloud Platform - Second Edition

Building a Data Warehouse in BigQuery

Technical requirements

Introduction to GCS and BigQuery

Introduction to the BigQuery console

Preparing the prerequisites before developing our data warehouse

Step 1 – Accessing Cloud Shell

Step 2 – Checking the current setup using the command line

Practicing developing a data warehouse

BigQuery’s useful features

Summary

Exercise – Scenario 3

See also

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook