Reader small image

You're reading from  Data Engineering with Google Cloud Platform

Product typeBook
Published inMar 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800561328
Edition1st Edition
Languages
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Chapter 5: Building a Data Lake Using Dataproc

A data lake is a concept similar to a data warehouse, but the key difference is what you store in it. A data lake's role is to store as much raw data as possible without knowing first what the value or end goal of the data is. Given this key differentiation, how to store and access data in a data lake is different compared to what we learned in Chapter 3, Building a Data Warehouse in BigQuery.

This chapter helps you understand how to build a data lake using Dataproc, which is a managed Hadoop cluster in Google Cloud Platform (GCP) But, more importantly, it helps you understand the key benefit of using a data lake in the cloud, which is allowing the use of ephemeral clusters.

Here is the high-level outline of this chapter:

  • Introduction to Dataproc
  • Building a data lake on a Dataproc cluster
  • Creating and running jobs on a Dataproc cluster
  • Understanding the concept of the ephemeral cluster
  • Building an ephemeral...

Technical requirements

Before we begin the chapter, make sure you have the following prerequisites ready.

In this chapter's exercises, we will use these GCP services: Dataproc, GCS, BigQuery, and Cloud Composer. If you never open any of these services in your GCP console, open it and enable the API.

Make sure you have your GCP console, Cloud Shell, and Cloud Shell Editor ready.

Download the example code and the dataset here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform/tree/main/chapter-5.

Be aware of the cost that might occur from Dataproc and the Cloud Composer cluster. Make sure you delete all the environments after the exercises to prevent any unexpected costs.

Introduction to Dataproc

Dataproc is a Google-managed service for Hadoop environments. It manages the underlying virtual machines, operating systems, and Hadoop software installations. Using Dataproc, Hadoop developers can focus on developing jobs and submitting them to Dataproc. 

From a data engineering perspective, understanding Dataproc is equal to understanding Hadoop and the data lake concept. If you are not familiar with Hadoop, let's learn about it in the next section.

A brief history of the data lake and Hadoop ecosystem

The popularity of the data lake rose in the 2010s. Companies started to talk about this concept a lot more, compared to the data warehouse, which is similar but different in principle. The concept of storing data as files in a centralized system makes a lot of sense in the modern era, compared to the old days when companies stored and processed data typically for regular reporting. In the modern era, people use data for exploration from...

Exercise – Building a data lake on a Dataproc cluster

In this exercise, we will use Dataproc to store and process log data. Log data is a good representation of unstructured data. Organizations often need to analyze log data to understand their users' behavior. 

In the exercise, we will learn how to use HDFS and PySpark using different methods. In the beginning, we will use Cloud Shell to get a basic understanding of the technologies. In the later sections, we will use Cloud Shell Code Editor and submit the jobs to Dataproc. But for the first step, let's create our Dataproc cluster.

Creating a Dataproc cluster on GCP

To create a Dataproc cluster, access your navigation menu and find Dataproc. You will find the CREATE CLUSTER button, which leads to this Create a cluster page:

Figure 5.2 – Create a cluster page

There are many configurations in Dataproc. We don't need to set everything. Most of them are optional. For...

Exercise: Creating and running jobs on a Dataproc cluster

In this exercise, we will try two different methods to submit a Dataproc job. In the previous exercise, we used the Spark shell to run our Spark syntax, which is common when practicing but not common in real development. Usually, we would only use the Spark shell for initial checking or testing simple things. In this exercise, we will code Spark jobs in editors and submit them as jobs. 

Here are the scenarios that we want to try:

  • Preparing log data in GCS and HDFS
  • Developing Spark ETL from HDFS to HDFS
  • Developing Spark ETL from GCS to GCS
  • Developing Spark ETL from GCS to BigQuery

Let's look at each of these scenarios in detail.

Preparing log data in GCS and HDFS

The log data is in our GitHub repository, located here:

https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform/tree/main/chapter-5/dataset/logs_example

If you haven't cloned the repository...

Understanding the concept of the ephemeral cluster

After running the previous exercises, you may notice that Spark is very useful to process data, but it has little to no dependence on Hadoop storage (HDFS). It's very convenient to use data as is from GCS or BigQuery compared to using HDFS. 

What does this mean? It means that we may choose not to store any data in the Hadoop cluster (more specifically, in HDFS) and only use the cluster to run jobs. For cost efficiency, we can smartly turn on and turn off the cluster only when a job is running. Furthermore, we can destroy the entire Hadoop cluster when the job is finished and create a new one when we submit a new job. This concept is what's called an ephemeral cluster.

An ephemeral cluster means the cluster is not permanent. A cluster will only exist when it's running jobs. There are two main advantages to using this approach:

  • Highly efficient infrastructure cost: With this approach, you don't...

Building an ephemeral cluster using Dataproc and Cloud Composer

Another option to manage ephemeral clusters is using Cloud Composer. We learned about Airflow in the previous chapter to orchestrate BigQuery data loading. But as we've already learned, Airflow has many operators and one of them is of course Dataproc. 

You should use this approach compared to a workflow template if your jobs are complex, in terms of developing a pipeline that contains many branches, backfilling logic, and dependencies to other services, since workflow templates can't handle these complexities.

In this section, we will use Airflow to create a Dataproc cluster, submit a pyspark job, and delete the cluster when finished.

Check the full code in the GitHub repository:

Link to be updated

To use the Dataproc operators in Airflow, we need to import the operators, like this:

from airflow.providers.google.cloud.operators.dataproc import (
    DataprocCreateClusterOperator...

Summary 

This chapter covered one component of GCP that allows you to build a data lake, called Dataproc. As we've learned in this chapter, learning about Dataproc means learning about Hadoop. We learned about and practiced the core and most popular Hadoop components, HDFS and Spark. 

By combining the nature of Hadoop with all the benefits of using the cloud, we also learned about new concepts. The Hadoop ephemeral cluster is relatively new and is only possible because of cloud technology. In a traditional on-premises Hadoop cluster, this highly efficient concept is never an option.  

From the perspective of a data engineer working on GCP, using Hadoop or Dataproc is optional. Similar functionality is doable using full serverless components on GCP; for example, use GCS and BigQuery as storage rather than HDFS and use Dataflow for processing unstructured data rather than Spark. But the popularity of Spark is one of the main reasons for people using Dataproc...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform
Published in: Mar 2022Publisher: PacktISBN-13: 9781800561328
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya