Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Engineering with Google Cloud Platform

You're reading from  Data Engineering with Google Cloud Platform

Product type Book
Published in Mar 2022
Publisher Packt
ISBN-13 9781800561328
Pages 440 pages
Edition 1st Edition
Languages
Author (1):
Adi Wijaya Adi Wijaya
Profile icon Adi Wijaya

Table of Contents (17) Chapters

Preface Section 1: Getting Started with Data Engineering with GCP
Chapter 1: Fundamentals of Data Engineering Chapter 2: Big Data Capabilities on GCP Section 2: Building Solutions with GCP Components
Chapter 3: Building a Data Warehouse in BigQuery Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer Chapter 5: Building a Data Lake Using Dataproc Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio Chapter 8: Building Machine Learning Solutions on Google Cloud Platform Section 3: Key Strategies for Architecting Top-Notch Data Pipelines
Chapter 9: User and Project Management in GCP Chapter 10: Cost Strategy in GCP Chapter 11: CI/CD on Google Cloud Platform for Data Engineers Chapter 12: Boosting Your Confidence as a Data Engineer Other Books You May Enjoy

Exercise: Creating and running jobs on a Dataproc cluster

In this exercise, we will try two different methods to submit a Dataproc job. In the previous exercise, we used the Spark shell to run our Spark syntax, which is common when practicing but not common in real development. Usually, we would only use the Spark shell for initial checking or testing simple things. In this exercise, we will code Spark jobs in editors and submit them as jobs. 

Here are the scenarios that we want to try:

  • Preparing log data in GCS and HDFS
  • Developing Spark ETL from HDFS to HDFS
  • Developing Spark ETL from GCS to GCS
  • Developing Spark ETL from GCS to BigQuery

Let's look at each of these scenarios in detail.

Preparing log data in GCS and HDFS

The log data is in our GitHub repository, located here:

https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform/tree/main/chapter-5/dataset/logs_example

If you haven't cloned the repository...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}