Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Fundamentals of Data Engineering

Years ago, when I initially entered the world of data analytics, I used to think data was clean – clean in terms of readiness and neatly organized. I was so excited to experiment with machine learning models, find unusual patterns in data, and play around with clean data. But after years of experience working with data, I realized that data analytics in big organizations isn’t straightforward.

Most of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you’ve noticed something similar. But the good news is that we know that all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering would be the most critical role in the future of the data science world.

To develop a successful data ecosystem in any organization, the most crucial part is how they design the...

Understanding the data life cycle

Understanding the data life cycle is the first principle in becoming a data engineer. If you’ve worked with data, you must know that data doesn’t stay in one place; it moves from one storage to another, from one database to another database. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:

  • Who will consume the data?
  • What data sources should I use?
  • Where should I store the data?
  • When should the data arrive?
  • Why does the data need to be stored in this place?
  • How should the data be processed?

To answer all those questions, we’ll start by looking back a little bit at the history of data technologies.

Understanding the need for a data warehouse

Data warehouse is not a new concept; I believe you’ve at least heard of it. In fact, this terminology is no longer appealing. In my experience, no...

Start with knowing the roles of a data engineer

In the later chapters, we will spend much of our time doing practical exercises to understand data engineering concepts. But before that, let’s quickly take a look at the data engineer role.

The job role is getting more and more popular now, but the terminology itself is relatively new compared to well-established job roles, such as accountant, lawyer, and doctor. The impact is that sometimes there is still a debate about what a data engineer should and shouldn’t do.

For example, if you came to a hospital and met a doctor, you know for sure that the doctor would do the following:

  1. Examine your condition.
  2. Make a diagnosis of your health issues.
  3. Prescribe medicine.

The doctor wouldn’t do the following:

  1. Clean the hospital.
  2. Make the medicine.
  3. Manage hospital administration.

It’s clear, and it applies to most well-established job roles. But how about data engineers...

Going through the foundational concepts for data engineering

Even though there are many data engineering concepts that we will learn throughout the book by using Google Cloud Platform (GCP), there are some basic concepts that you need to know as data engineers. In my experience of interviewing in data companies, I discovered that these foundational concepts are often asked to assess how much you know about data engineering. Take the following examples:

  • What is ETL?
  • What’s the difference between ETL and Extract, Load, and Transform (ELT)?
  • What is big data?
  • How do you handle large volumes of data?

These questions are quite common, yet particularly important to deeply understand the concepts since they may affect our decisions on architecting our data life cycles.

ETL concept in data engineering

ETL is the key foundation of data engineering. Everything in the data life cycle is ETL; any part that happens from upstream to downstream is ETL. Let&...

Summary

As a summary of the first chapter, we’ve learned the fundamental knowledge we need as data engineers. Here are some key takeaways from this chapter. First, data doesn’t stay in one place. Data moves from one place to another, called the data life cycle. We also understand that data in a big organization is mostly in silos, and we can solve these data silos using the concepts of a data warehouse and data lake.

As someone who has started to look into data engineer roles, you may be a little bit lost. The role of data engineers may vary. The key takeaway is not to be confused about the broad expectations in the market. First, you should focus on the core and then expand as you get more experience from the core. In this chapter, we’ve learned what the core of a data engineer is. At the end of the chapter, we learned some of the key concepts. There are three key concepts as a data engineer that you need to be familiar with. These concepts are ETL, big data...

Exercise

You are a data engineer at a book publishing company and your product manager has asked you to build a dashboard to show the total revenue and customer satisfaction index in a single dashboard.

Your company doesn’t have any data infrastructure yet, but you know that your company has these three applications that contain TBs of data:

  • The company website
  • A book sales application using MongoDB to store sales transactions, including transactions, book IDs, and author IDs
  • An author portal application using a MySQL Database to store authors’ personal information, including age

Do the following:

  1. List down important follow-up questions for your manager
  2. List down your technical thinking process of how to do it at a high level
  3. Draw a data pipeline architecture

There is no right or wrong answer to this practice. The important thing is that you can imagine how the data flows from upstream to downstream, how it should be processed...

Further Reading

You can visit the following links to explore more about the topics discussed in this chapter:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya