You're reading from Data Engineering with Google Cloud Platform - Second Edition

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781835080115

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Adi Wijaya

Fundamentals of Data Engineering

Years ago, when I initially entered the world of data analytics, I used to think data was clean – clean in terms of readiness and neatly organized. I was so excited to experiment with machine learning models, find unusual patterns in data, and play around with clean data. But after years of experience working with data, I realized that data analytics in big organizations isn’t straightforward.

Most of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you’ve noticed something similar. But the good news is that we know that all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering would be the most critical role in the future of the data science world.

To develop a successful data ecosystem in any organization, the most crucial part is how they design the...

Understanding the data life cycle

Understanding the data life cycle is the first principle in becoming a data engineer. If you’ve worked with data, you must know that data doesn’t stay in one place; it moves from one storage to another, from one database to another database. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:

Who will consume the data?
What data sources should I use?
Where should I store the data?
When should the data arrive?
Why does the data need to be stored in this place?
How should the data be processed?

To answer all those questions, we’ll start by looking back a little bit at the history of data technologies.

Understanding the need for a data warehouse

Data warehouse is not a new concept; I believe you’ve at least heard of it. In fact, this terminology is no longer appealing. In my experience, no...

Start with knowing the roles of a data engineer

In the later chapters, we will spend much of our time doing practical exercises to understand data engineering concepts. But before that, let’s quickly take a look at the data engineer role.

The job role is getting more and more popular now, but the terminology itself is relatively new compared to well-established job roles, such as accountant, lawyer, and doctor. The impact is that sometimes there is still a debate about what a data engineer should and shouldn’t do.

For example, if you came to a hospital and met a doctor, you know for sure that the doctor would do the following:

Examine your condition.
Make a diagnosis of your health issues.
Prescribe medicine.

The doctor wouldn’t do the following:

Clean the hospital.
Make the medicine.
Manage hospital administration.

It’s clear, and it applies to most well-established job roles. But how about data engineers...

Going through the foundational concepts for data engineering

Even though there are many data engineering concepts that we will learn throughout the book by using Google Cloud Platform (GCP), there are some basic concepts that you need to know as data engineers. In my experience of interviewing in data companies, I discovered that these foundational concepts are often asked to assess how much you know about data engineering. Take the following examples:

What is ETL?
What’s the difference between ETL and Extract, Load, and Transform (ELT)?
What is big data?
How do you handle large volumes of data?

These questions are quite common, yet particularly important to deeply understand the concepts since they may affect our decisions on architecting our data life cycles.

ETL concept in data engineering

ETL is the key foundation of data engineering. Everything in the data life cycle is ETL; any part that happens from upstream to downstream is ETL. Let&...

Summary

As a summary of the first chapter, we’ve learned the fundamental knowledge we need as data engineers. Here are some key takeaways from this chapter. First, data doesn’t stay in one place. Data moves from one place to another, called the data life cycle. We also understand that data in a big organization is mostly in silos, and we can solve these data silos using the concepts of a data warehouse and data lake.

As someone who has started to look into data engineer roles, you may be a little bit lost. The role of data engineers may vary. The key takeaway is not to be confused about the broad expectations in the market. First, you should focus on the core and then expand as you get more experience from the core. In this chapter, we’ve learned what the core of a data engineer is. At the end of the chapter, we learned some of the key concepts. There are three key concepts as a data engineer that you need to be familiar with. These concepts are ETL, big data...

Exercise

You are a data engineer at a book publishing company and your product manager has asked you to build a dashboard to show the total revenue and customer satisfaction index in a single dashboard.

Your company doesn’t have any data infrastructure yet, but you know that your company has these three applications that contain TBs of data:

The company website
A book sales application using MongoDB to store sales transactions, including transactions, book IDs, and author IDs
An author portal application using a MySQL Database to store authors’ personal information, including age

Do the following:

List down important follow-up questions for your manager
List down your technical thinking process of how to do it at a high level
Draw a data pipeline architecture

There is no right or wrong answer to this practice. The important thing is that you can imagine how the data flows from upstream to downstream, how it should be processed...

Understand more about Hadoop and its distributed filesystem: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
Understand more about how MapReduce works: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Key facts about data engineers and why the role was becoming more popular than the data scientist role in 2021: https://www.kdnuggets.com/2021/02/dont-need-data-scientists-need-data-engineers.html

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Google Cloud Platform - Second Edition

Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Engineering with Google Cloud Platform - Second Edition

Fundamentals of Data Engineering

Understanding the data life cycle

Understanding the need for a data warehouse

Start with knowing the roles of a data engineer

Going through the foundational concepts for data engineering

ETL concept in data engineering

Summary

Exercise

Further Reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook