Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Engineering with Google Cloud Platform - Second Edition

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product type Book
Published in Apr 2024
Publisher Packt
ISBN-13 9781835080115
Pages 476 pages
Edition 2nd Edition
Languages
Author (1):
Adi Wijaya Adi Wijaya
Profile icon Adi Wijaya

Table of Contents (19) Chapters

Preface 1. Part 1: Getting Started with Data Engineering with GCP
2. Chapter 1: Fundamentals of Data Engineering 3. Chapter 2: Big Data Capabilities on GCP 4. Part 2: Build Solutions with GCP Components
5. Chapter 3: Building a Data Warehouse in BigQuery 6. Chapter 4: Building Workflows for Batch Data Loading Using Cloud Composer 7. Chapter 5: Building a Data Lake Using Dataproc 8. Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow 9. Chapter 7: Visualizing Data to Make Data-Driven Decisions with Looker Studio 10. Chapter 8: Building Machine Learning Solutions on GCP 11. Part 3: Key Strategies for Architecting Top-Notch Solutions
12. Chapter 9: User and Project Management in GCP 13. Chapter 10: Data Governance in GCP 14. Chapter 11: Cost Strategy in GCP 15. Chapter 12: CI/CD on GCP for Data Engineers 16. Chapter 13: Boosting Your Confidence as a Data Engineer 17. Index 18. Other Books You May Enjoy

User and Project Management in GCP

In this chapter, we will learn how to design and structure users and projects in Google Cloud Platform (GCP). By understanding user and project management in GCP, you will learn how to turn a development solution into a production-ready one.

In a production-ready solution, it’s very important to manage security by only allowing access to the right users. However, to do it efficiently, we need to understand the principle and strategy.

Managing production-ready solutions is almost impossible without understanding how a GCP project works. Understanding how to design GCP projects is another important aspect of an efficient solution.

In addition, this chapter will also include an example approach to provision GCP’s services automatically using an infrastructure-building tool, Terraform.

Specifically, in this chapter, we will cover the following topics:

  • Understanding Identity and Access Management (IAM) in GCP
  • Planning...

Technical requirements

In this chapter’s exercises, we will use the following GCP services:

  • IAM
  • BigQuery
  • Google Cloud Storage (GCS)

If you’ve never opened any of these services in your GCP console, open them and enable the application programming interface (API). We will also use an open source software called Terraform to help us provision the GCP services using code. It can be downloaded from their public website at https://www.terraform.io/downloads.html. The step-by-step installation will be discussed in the Exercise – creating and running basic Terraform scripts section.

Make sure you have your GCP console, Cloud Shell, and Cloud Shell Editor ready.

Download the example code and the dataset from https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-9/code.

Understanding IAM in GCP

IAM is a central manager that manages who can access what – in other words, authorization. IAM manages all authorization within GCP. The concept is simple – you grant roles to accounts so that the accounts have the required permission to access specific GCP services. Here is a diagram for an account that needs to query a table in BigQuery:

Figure 9.1 – IAM roles, permissions, and GCP service correlation

Figure 9.1 – IAM roles, permissions, and GCP service correlation

In the example shown in the preceding diagram, to access a BigQuery table, an account needs, at a minimum, two roles: data viewer and job user. These roles contain multiple permissions to specifically perform an operation in BigQuery.

Let’s go through each of the important terms that we use in the IAM space:

  • Account: An account in GCP can be divided into two – a user account and a service account:
    • User account: This is the user email. It can be corporate email or personal email, depending...

Planning a GCP project structure

After practicing a lot of exercises from the previous chapters, I believe you have become familiar with GCP. From those exercises, you’ve learned about GCP services, their positioning, and how to use them. In this section, we will take a step back and look at those GCP services from a higher-level point of view.

In all the previous exercises throughout this book, we used only one project. All the GCP services, including BigQuery, GCS buckets, Cloud Composer, and the other services that we used, are enabled and provisioned in one project. For me, I have a project called packt-gcp-data-eng. The same from your side – you must have your own project, either using the default project or a new one that we created in Chapter 2, Big Data Capabilities on GCP. That’s a good enough starting point for learning and development, but in reality, an organization usually has more than one project. There are many scenarios and variations on how...

Understanding the GCP organization, folder, and project hierarchy

A GCP project organizes all your Google Cloud resources. Resources in GCP can be services, billing, accounts, authentications, logs, and monitoring. Resources from one project can be used and accessed by other resources from other projects. So long as the permissions to resources are set correctly, there is no restriction on accessing them between projects.

For example, look at Figure 9.3. The cloud SQL database from the core-apps-and-db project can be accessed by Cloud Composer in dwh-project. Let’s look at another example – a user account that was created in the core-apps-and-db project can access data from BigQuery in the data project. Note that accounts and authentications are also resources. The key point here is that resources in GCP projects are not isolated.

Now, let’s talk about the GCP folder. One GCP folder can contain one to many GCP projects. GCP folders can also contain one to...

Controlling user access to our data warehouse

Now that we’ve learned about user access at the organization, folder, and project levels, we will look specifically at access control lists (ACLs) in BigQuery. An ACL is the same concept as IAM, but the ACL terminology is more commonly used when talking about the data space. Planning an ACL in BigQuery means planning who can access what in BigQuery.

At a very high level, there are two main types of GCP permission in BigQuery, as follows:

  • Job permissions: BigQuery has job-level permissions. For example, for a user to be able to run a query inside the project, they need bigquery.jobs.create.

    Note that being able to run a query job doesn’t mean having access to the data. Access to the data is managed by the other permissions, which will be explained next.

  • Access permissions: This one is a little bit more complicated compared to job permissions. If we talk about data access, we need to understand that the main goal...

Practicing the concept of IaC using Terraform

IaC is the process of provisioning and managing resources using code. In our GCP case, the resources can be the GCP project, BigQuery datasets, GCS buckets, IAM, and all other resources that we’ve learned about throughout this book.

So far, we’ve created our resources using the GCP console’s user interface (UI) or the gcloud command. Imagine that you need to do that manually one by one using the UI for hundreds to thousands of objects throughout a large organization. That can be very painful – not only from a provisioning point of view but also in terms of managing it.

The common issues without the IaC approach are missing consistency, such as naming conventions, forgetting to configure some parameters, such as location, and losing track of resources that have been created.

With an IaC approach, we can use code to provision our resources. The advantage of using code is that you can implement software...

Summary

In this chapter, we covered three important topics in GCP – namely, IAM, project structure, and BigQuery ACLs. Additionally, we learned about IaC.

Understanding these four topics lifts your knowledge from being a data engineer to becoming a cloud data architect. People with these skills can think not only about the data pipeline but also the higher-level architecture, which is a very important role in any organization.

Always remember the principle of least privilege, which is the foundation for architecting all the topics of IAM, project structure, and BigQuery ACLs. Always make sure you only give the right access to the right user.

In the next chapter, you’ll discover how data governance using GCP services can unlock the full potential of your data, ensuring usability, security, and accountability.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024 Publisher: Packt ISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}