Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Building a Data Lake Using Dataproc

A data lake shares similarities with a data warehouse, yet its fundamental distinction lies in the nature of stored content. Unlike a data warehouse, a data lake is designed to manage extensive raw data, agnostic to its eventual value or purpose. This pivotal divergence reshapes approaches to data storage and retrieval within a data lake, setting it apart from the principles that we learned in Chapter 3, Building a Data Warehouse in BigQuery.

This chapter helps you understand how to build a data lake using Dataproc, which is a managed Hadoop cluster in Google Cloud Platform (GCP). But, more importantly, it helps you understand the key benefit of using a data lake in the cloud, which is allowing the use of ephemeral clusters.

Here is a high-level outline of this chapter:

  • Introduction to Dataproc
  • Exercise – Building a data lake on a Dataproc cluster
  • Exercise – Creating and running jobs on a Dataproc cluster
  • Understanding...

Technical requirements

Before we begin to learn in this chapter, make sure you have the following prerequisites ready.

In this chapter’s exercises, the GCP services that we will use are Dataproc, Google Cloud Storage (GCS), BigQuery, and Cloud Composer. If you have never opened any of these services in your GCP console, open them and enable the APIs.

Make sure you have your GCP console, Cloud Shell, and Cloud Shell Editor ready.

Download the example code and the dataset from https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-5.

Be aware of costs you might incur from Dataproc and the Cloud Composer cluster. Make sure you delete all the environments after the exercises to prevent any unexpected costs.

Introduction to Dataproc

Dataproc is a Google-managed service for Hadoop environments. It manages the underlying virtual machines (VMs), operating systems, and Hadoop software installations. Using Dataproc, Hadoop developers can focus on developing jobs and submitting them to Dataproc.

From a data engineering perspective, understanding Dataproc is equal to understanding Hadoop and the data lake concept. If you are not familiar with Hadoop, let’s learn about it in the next section.

A brief history of the data lake and Hadoop ecosystem

The popularity of the data lake rose in the 2010s. Companies started to talk about this concept a lot more, compared to the data warehouse, which is similar but different in principle. The concept of storing data as files in a centralized system makes a lot of sense in the modern era, compared to the old days when companies stored and processed data typically for regular reporting. In the modern era, people use data for exploration from...

Exercise – Building a data lake on a Dataproc cluster

In this exercise, we will use Dataproc to store and process log data. Log data is a good representation of unstructured data. Organizations often need to analyze log data to understand their users’ behavior.

In the exercise, we will learn how to use HDFS and PySpark using different methods. In the beginning, we will use Cloud Shell to get a basic understanding of the technologies. In the later sections, we will use Cloud Shell Editor and submit jobs to Dataproc. But for the first step, let’s create our Dataproc cluster.

Creating a Dataproc cluster on GCP

To create a Dataproc cluster, access your navigation menu and find Dataproc. If this is the first time you’re accessing this page, please click the Enable API button. After that, you will find the CREATE CLUSTER button. There are two options, Cluster on Compute Engine and Cluster on GKE. Choose Cluster on Compute Engine, which leads to this Create...

Exercise – Creating and running jobs on a Dataproc cluster

In this exercise, we will try three different methods to submit a Dataproc job: running on a permanent Dataproc cluster, running on an ephemeral cluster, and running on Dataproc Serverless.

In the previous exercise, we used the Spark shell to run our Spark syntax, which is common when practicing but not common in real development. Usually, we would only use the Spark shell for initial checking or testing simple things. In this exercise, we will code Spark jobs in editors and submit them as jobs.

Here are the scenarios that we want to try:

  • Preparing log data in GCS and HDFS
  • Developing a Spark ETL job from HDFS to HDFS
  • Developing a Spark ETL job from GCS to GCS
  • Developing a Spark ETL job from GCS to BigQuery

Let’s look at each of these scenarios in detail.

Preparing log data in GCS and HDFS

The log data is in our GitHub repository, located here: https://github.com/PacktPublishing...

Understanding the concept of an ephemeral cluster

After running the previous exercises, you may notice that Spark is very useful for processing data, but it has little to no dependence on HDFS. It’s very convenient to use data as is from GCS or BigQuery compared to using HDFS.

What does this mean? It means that we may choose not to store any data in the Hadoop cluster (more specifically, in HDFS) and only use the cluster to run jobs. For cost efficiency, we can smartly turn on and off the cluster only when a job is running.

Furthermore, we can destroy the entire Hadoop cluster when the job is finished and create a new one when we submit a new job. This concept is what’s called an ephemeral cluster.

An ephemeral cluster means the cluster is not permanent. A cluster will only exist when it’s running jobs. There are two main advantages to using this approach:

  • Highly efficient infrastructure cost: With this approach, you don’t need to have a...

Building an ephemeral cluster using Dataproc and Cloud Composer

Another option to manage ephemeral clusters is using Cloud Composer. We learned about Airflow in the previous chapter to orchestrate BigQuery data loading. But as we’ve already learned, Airflow has many operators, and one of them is, of course, Dataproc.

You should use this approach compared to a workflow template if your jobs are complex, in terms of developing a pipeline that contains many branches, backfilling logic, and dependencies to other services, since workflow templates can’t handle these complexities.

For this exercise, if your Cloud Composer environment is no longer available, you don’t need to execute it. Simply go through the following example code. Once you’ve completed Chapter 4, Building Workflows for Batch Data Loading Using Cloud Composer, you’ll understand the complete concept.

In the following example exercise, we will use Airflow to create a Dataproc cluster...

Summary

This chapter covered one component of GCP that allows you to build a data lake, called Dataproc. As we’ve learned in this chapter, learning about Dataproc means learning about Hadoop. We learned about and practiced the core and most popular Hadoop components, HDFS and Spark.

By combining the nature of Hadoop with all the benefits of using the cloud, we also learned about new concepts. A Hadoop ephemeral cluster is relatively new and is only possible because of cloud technology. In a traditional on-premises Hadoop cluster, this highly efficient concept is never an option.

In this chapter, we focused on the core concepts of Spark. We learned about using RDDs and Spark DataFrames. These two concepts are the first entry point before learning about other features such as Spark ML and Spark Streaming. As you get more and more experienced, you will need to start to think about optimization – for example, how to manage parallelism, how to fasten-join, and how to...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya