Years ago, when I first entered the data science world, I used to think data was clean. Clean in terms of readiness, available in one place, and ready for fun data science purposes. I was so excited to experiment with machine learning models, finding unusual patterns in data and playing around with clean data. But after years of experience working with data, I realized that data science in big organizations isn't straightforward.
Eighty percent of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you've noticed something similar. But the good news is, we know that almost all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering will be the most critical role from that day to the future of the data science world.
To develop a successful data ecosystem in any organization, the most crucial part is how they design the data architecture. If the organization fails to make the best decision on the data architecture, the future process will be painful. Here are some common examples: the system is not scalable, querying data is slow, business users don't trust your data, the infrastructure cost is very high, and data is leaked. There is so much more that can go wrong without proper data engineering practice.
In this chapter, we are going to learn the fundamental knowledge behind data engineering. The goal is to introduce you to common terms that are often used in this field and will be mentioned often in the later chapters.
In particular, we will be covering the following topics:
- Understanding the data life cycle
- Know the roles of a data engineer before starting
- Foundational concepts for data engineering
Understanding the data life cycle
The first principle to learn to become a data engineer is understanding the data life cycle. If you've worked with data, you must know that data doesn't stay in one place; it moves from one storage to another, from one database to other databases. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:
- Who will consume the data?
- What data sources should I use?
- Where should I store the data?
- When should the data arrive?
- Why does the data need to be stored in this place?
- How should the data be processed?
To answer all those questions, we'll start by looking back a little bit at the history of data technologies.
Understanding the need for a data warehouse
Data warehouse is not a new concept; I believe you've at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.
So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.
Data warehouses were first developed in the 1980s to transform data from operational systems to decision-making support systems. The key principle of a data warehouse is combining data from many different sources to a single location and then transforming it into a format the data warehouse can process and store.
For example, in the financial industry, say a bank wants to know how many credit card customers also have mortgages. It is a simple enough question, yet it's not that easy to answer. Why?
Most traditional banks that I have worked with had different operating systems for each of their products, including a specific system for credit cards and specific systems for mortgages, saving products, websites, customer service, and many other systems. So, in order to answer the question, data from multiple systems needs to be stored in one place first.
See the following diagram on how each department is independent:
Often, independence not only applies to the organization structure but also to the data. When data is located in different places, it's called data silos. This is very common in large organizations where each department has different goals, responsibilities, and priorities.
In summary, what we need to understand from the data warehouse concept is the following:
- Data silos have always occurred in large organizations, even back in the 1980s.
- Data comes from many operating systems.
- In order to process the data, we need to store the data in one place.
What does a typical data warehouse stack look like?
This diagram represents the four logical building blocks in a data warehouse, which are Storage, Compute, Schema, and SQL Interface:
Data warehouse products are mostly able to store and process data seamlessly and the user can use the SQL language to access the data in tables with a structured schema format. It is basic knowledge, but an important point to be aware of is that the four logical building blocks in the data warehouse are designed as one monolithic software that evolved over the later years and was the start of the data lake.
Getting familiar with the differences between a data warehouse and a data lake
Fast forward to 2008, when an open source data technology named Hadoop was first published, and people started to use the data lake terminology. If you try to find the definition of data lake on the internet, it will mostly be described as a centralized repository that allows you to store all your structured and unstructured data.
So, what is the difference between a data lake and a data warehouse? Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn't?
What if I say some data warehouse products can now store and process unstructured data? Does the data warehouse become a data lake? The answer is no.
One of the key differences from a technical perspective is that data lake technologies separate most of the building blocks, in particular, the storage and computation, but also the other blocks, such as schema, stream, SQL interface, and machine learning. This evolves the concept of a monolithic platform into a modern and modular platform consisting of separated components, as illustrated in the following diagram:
For example, in a data warehouse, you insert data by calling SQL statements and query the data through SQL tables, and there is nothing you can do as a user to change that pattern.
In a data lake, you can access the underlying storage directly, for example, by storing a text file, choosing your own computation engine, and choosing not to have a schema. There are many impacts of this concept, but I'll summarize it into three differences:
Large organizations start to store any data in the data lake system for two reasons, high scalability and cheap storage. In modern data architecture, both data lakes and data warehouses complete each other, rather than replacing each other.
We will dive deeper and carry out some practical examples throughout the book, such as trying to build a sample in Chapter 3, Building a Data Warehouse in BigQuery, and Chapter 5, Building a Data Lake Using Dataproc.
The data life cycle
Based on our understanding of the history of the data warehouse, now we know that data does not stay in one place. As an analogy, data is very similar to water; it flows from upstream to downstream. Later in this book, both of these terms, upstream and downstream, will be used often since they are common terminologies in data engineering.
When you think about water flowing upstream and downstream, one example that you can think of is a waterfall; the water falls freely without any obstacles.
Another example in different water life cycle circumstances is a water pipeline; upstream is the water reservoir and downstream is your kitchen sink. In this case, you can imagine the different pipes, filters, branches, and knobs in the middle of the process.
Data is very much like water. There are scenarios where you just need to copy data from one storage to another storage, or in more complex scenarios, you may need to filter, join, and split multiple steps downstream before the data can be consumed by the end users.
Let's now look at the elements of the data life cycle in detail:
- Apps and databases: The application is the interface from the human to the machine. The frontend application in most cases acts as the first data upstream. Data at this level is designed to serve application transactions as fast as possible.
- Data lake: Data from multiple databases needs to be stored in one place. The data lake concept and technologies suit the needs. The data lake stores data in a file format such as a CSV file, Avro, or Parquet.
The advantage of storing data in a file format is that it can accept any data sources; for example, MySQL Database can export the data to CSV files, image data can be stored as JPEG files, and IoT device data can be stored as JSON files. Another advantage of storing data in a data lake is it doesn't require a schema at this stage.
- Data warehouse: When you know any data in a data lake is valuable, you need to start thinking about the following:
- What is the schema?
- How do you query the data?
- What is the best data model for the data?
Data in a data warehouse is usually modeled based on business requirements. With this, one of the key requirements to build the data warehouse is that you need to know the relevance of the data to your business and the expected information that you want to generate from the data.
- Data mart: A data mart is an area for storing data that serves specific user groups. At this stage, you need to start thinking about the final downstream of data and who the end user is. Each data mart is usually under the control of each department within an organization. For example, a data mart for a finance team will consist of finance-related tables, while a data mart for data scientists might consist of tables with machine learning features.
- Data end consumer: The last stage of data will be back to humans as information. The end user of data can have various ways to use the data but at a very high level, these are the three most common usages:
- Reporting and dashboard
- Ad hoc query
- Machine learning
Are all data life cycles like this? No. Similar to the analogy of water flowing upstream and downstream, in different circumstances, it will require different data life cycles, and that's where data engineers need to be able to design the data pipeline architecture. But the preceding data life cycle is a very common pattern. In the past 10 years as a data consultant, I have had the opportunity to work with more than 30 companies from many industries, including financial, government, telecommunication, and e-commerce. Most of the companies that I worked with followed this pattern or were at least going in that direction.
As an overall summary of this section, we've learned that since historical times, data is mostly in silos, and it drives the needs of the data warehouse and data lake. The data will move from one system to others as specific needs have specific technologies and, in this section, we've learned about a very common pattern in data engineering. In the next section, let's try to understand the role of a data engineer, who should be responsible for this.
Knowing the roles of a data engineer before starting
The job role is getting more and more popular now, but the terminology itself is relatively new compared to other job roles, such as accountant, lawyer, doctor, and many other well-established job roles. The impact is that sometimes there is still a debate of what a data engineer should and shouldn't do.
For example, if you came to a hospital and met a doctor, you know for sure that the doctor would do the following:
- Examine your condition.
- Make a diagnosis of your health issues.
- Prescribe medicine.
The doctor wouldn't do the following:
- Clean the hospital.
- Make the medicine.
- Manage hospital administration.
It's clear, and it applies to most well-established job roles. But how about data engineers?
- Handle all big data infrastructures and software installation.
- Handle application databases.
- Design the data warehouse data model.
- Analyze big data to transform raw data into meaningful information.
- Create a data pipeline for machine learning.
The unclear condition is unavoidable since it's a new role and I believe it will be more and more established following the maturity of data science. In this section, let's try to understand what a data engineer is and despite many combinations of responsibilities, what you should focus on as a data engineer.
Data engineer versus data scientist
A data engineer is someone who designs and builds data pipelines.
The definition is that simple, but I found out that the question about the different between a data engineer versus a data scientist is still one of the most frequently asked questions when someone wants to start their data career. The hype of data scientists on the internet is one of the drivers; for example, up until today people still like to quote the following:
The data scientist role was originally invented to refer to groups of people who are highly curious and able to utilize big data technologies for business purposes back in 2008. But since the technologies are maturing and becoming more complex, people start to realize that it's too much. It's very rare for a company to hire someone who knows how to do all of the following:
- How to handle big data infrastructure
- Properly design and build ETL pipelines
- Train machine learning models
- Understand deeply about the company's business
Not that it's impossible, some people do have this knowledge, but from a company's point of view, it's not practical.
These days, for better focus and scalability, the data scientist role can be split into many different roles, for example, data analyst, machine learning engineer, and business analyst. But one of the most popular and realized to be very important roles is data engineer.
The focus of data engineers
In the diagram, I added two underlying components:
- Job Orchestrator: Design and build a job dependency and scheduler that runs data movement from upstream to downstream.
- Infrastructure: Provision the required data infrastructure to run the data pipelines.
And on each step, I added numbers from 1 to 3. The numbers will help you to identify which components are the data engineer's main responsibility. This diagram works together with Figure 1.7, a data engineer-focused diagram to map the numbering. First, let's check this data life cycle diagram that we discussed before with the numbering on it:
After seeing the numbering on the data life cycle, check this diagram that illustrates the focus points of a data engineer:
The diagram shows the distribution of the knowledge area from the end-to-end data life cycle. At the center of the diagram (number 3) are the jobs that are the key focus of data engineers, and I will call it the core.
Those numbered 2 are the good to have area. For example, it's still common in small organizations that data engineers need to build a data mart for business users.
Designing and building a data mart is not as simple as creating tables in a database. Someone who builds a data mart needs to be able to talk to business people and gather requirements to serve tables from a business perspective, which is one of the reasons it's not part of the core.
While how to collect data to a data lake is part of the data engineer's responsibility, exporting data from operational application databases is often done by the application development team, for example, dumping MySQL tables as CSV in staging storage.
Those numbered 1 are the good to know area. For example, it's rare that a data engineer needs to be responsible for building application databases, developing machine learning models, maintaining infrastructure, and creating dashboards. It is possible, but less likely. The discipline needs knowledge that is a little bit too far from the core.
After learning about the three focus areas, now let's retrospect our understanding and vision about data engineers. Study the diagram carefully and answer these questions.
- What are your current focus areas as an individual?
- What are your current job's role focus areas (or if you are a student, your study areas)?
- What is your future goal in the data science world?
Depending on your individual answers, check with the diagram – do you have all the necessary skills at the core? Does your current job give you experience in the core? Are you excited if you could master all subjects at the core in the near future?
From my experience, what is important to data engineers is the core. Even though there are a variety of data engineers' expectations, responsibilities, and job descriptions in the market, if you are new to the role, then the most important thing is to understand what the core of a data engineer is.
The diagram gives you guidance on what type of data engineers you are or will be. The closer you are to the core, the more of a data engineer you are. You are on the right track and in the right environment to be a good data engineer.
In scenarios where you are at the core, plus other areas beside it, then you are closer to a full-stack data expert; as long as you have a strong core, if you are able to expand your expertise to the good to have and good to know areas, you will have a good advantage in your data engineering career. But if you focus on other non-core areas, I suggest you find a way to master the core first.
In this section, we learned about the role of a data engineer. If you are not familiar with the cores, the next section will be your guidance to the fundamental concepts in data engineering.
Foundational concepts for data engineering
Even though there are many data engineering concepts that we will learn throughout the book by using Google Cloud Platform (GCP), there are some concepts that are basic and you need to know as data engineers. In my experience interviewing in data companies, I found out that these foundational concepts are often asked to test how much you know about data engineering. Take the following examples:
- What is Extract-Transform-Load (ETL)?
- What's the difference between ETL and Extract-Load-Transform (ELT)?
- What is big data?
- How do you handle large volumes of data?
These questions are very common, yet very important to deeply understand the concepts since it may affect our decisions on architecting our data life cycles.
ETL concept in data engineering
ETL is the key foundation of data engineering. All things in the data life cycle are ETL; any part that happens from upstream to downstream is ETL. Let's take a look at the upstream to downstream flows that has an ETL process in between here:
- What is extract? This is the step to get the data from the upstream system. For example, if the upstream system is an RDBMS, then the extract step will be dumping or exporting data from the RDBMS.
- What is transform? This is the step to apply any transformation to the extracted data. For example, the file from the RDBMS needs to be joined with a static CSV file, then the transform step will process the extracted data, load the CSV file, and finally, join both information together in an intermediary system.
- What is load? This is the step to put the transformed data to the downstream system. For example, if the downstream system is BigQuery, then the load step will call BigQuery load job to store the data into BigQuery's table.
Back in Figure 1.5, Data life cycle diagrams, each of the individual steps may have a different ETL process. For example, at the application database to data lake step, the upstream is the application database and the data lake is the downstream. But at the data lake to data warehouse step, the data lake becomes the upstream and the data warehouse as its downstream. So, you need to think about how you want to do the ETL process in every data life cycle step.
The difference between ETL and ELT
ETL is extract, transform, load and ELT is extract, load, transform. From the acronym itself, the difference between ETL and ELT is only the ordering of the letters T and L. Should you transform first and then load the data to the downstream or load the data to the downstream first and then transform the data inside the downstream system?
Easy! What's the big deal?
Even though it's a very simple difference in the acronym, deciding on the method can really affect your choice of technology products, system performance, scalability, and cost. For example, not all downstream systems are powerful enough to transform large volumes of data; in this case, ETL is preferred since using the ELT pattern will introduce issues in your downstream system.
In other cases, the downstream system is a lot more powerful compared to any intermediary system, so you want to choose the ELT pattern. This mostly happens after the data lake era where the downstream are products such as Hadoop, BigQuery, or other scalable data processing products. But this is not the absolute answer; depending on your available choice of technology, you may change your ETL versus ELT strategy.
You will understand this better after running through the content of this book with a lot of ETL and ELT examples, but at this point, the important thing to keep in mind is, as a data engineer, you have two options of where to transform your data: in an intermediary system or in the target system.
What is NOT big data?
After learning about ETL and ELT, the other most common terminology is big data. Since big data is still one of the highly correlated concepts close to data engineering, it is important how you interpret the terminology as a data engineer. Note that the word big data itself refers to two different subjects:
- The data itself is big.
- The big data technology.
With so much hype in the media about the words, both in the context of data is getting bigger and big data technology, I don't think I need to tell you the definition of the word big data. Instead, I will focus on eliminating the non-relevant definitions of big data for data engineers. Here are some definitions in media or from people that I have met personally:
- All people already use social media, the data in social media is huge, and we can use the social media data for our organization. That's big data.
- My company doesn't have data. Big data is a way to use data from the public internet to start my data journey. That's big data.
- The five Vs of data: volume, variety, velocity, veracity, and value. That's big data.
All the preceding definitions are correct but not really helpful to us as data engineers. So instead of seeing big data as general use cases, we need to focus on the how questions; think about what actually works under the hood. Take the following examples:
- How do you store 1 PB of data in storage, while the size of common hard drives is in TBs?
- How do you average a list of numbers, when the data is stored in multiple computers?
- How can you continuously extract data from the upstream system and do aggregation as a streaming process?
These kinds of questions are what are important for data engineers. Data engineers need to know when a condition (the data itself is big) should be handled using big data or non-big data technology.
A quick look at how big data technologies store data
Knowing that answering the how question is what is important to understanding big data, the first question we need to answer is how does it actually store the data? What makes it different from non-big data storage?
The word big in big data is relative. For example, say you analyze Twitter data and then download the data as JSON files with a size of 5 GB, and your laptop storage is 1 TB with 16 GB memory.
I don't think that's big data. But if the Twitter data is 5 PB, then it becomes big data because you need a special way to store it and a special way to process it. So, the key is not about whether it is social media data or not, or unstructured or not, which sometimes many people still get confused by. It's more about the size of the data relative to your system.
Big data technology needs to be able to distribute the data in multiple servers. The common terminology for multiple servers working together is a cluster. I'll give an illustration to show you how a very large file can be distributed into multiple chunks of file parts on multiple machines:
In a distributed filesystem, a large file will be split into multiple small parts. In the preceding example, it is split into nine parts, and each file is a small 128 MB file. Then, the multiple file parts are distributed into three machines randomly. On top of the file parts, there will be metadata to store information about how the file parts formed the original file, for example, a large file is a combination of file part 1 located in machine 1, file part 2 located in machine 2, and more.
The distributed parts can be stored in any format that isn't necessarily a file format; for example, it can be in the form of data blocks, byte arrays in memory, or some other data format. But for simplicity, what you need to be aware of is that in a big data system, data can be stored in multiple machines and in order to optimize performance, sometimes you need to think about how you want to distribute the parts.
- How do I process the files?
- What if I want to aggregate some numbers from the files?
- How does each part know the records value from other parts while it is stored in different machines?
There are many approaches to answer these three questions. But one of the most famous concepts is MapReduce.
A quick look at how to process multiple files using MapReduce
Historically speaking, MapReduce is a framework that was published as a white paper by Google and is widely used in the Hadoop ecosystem. There is an actual open source project called MapReduce mainly written in Java that still has a large user base, but slowly people have started to change to other distributed processing engine alternatives, such as Spark, Tez, and Dataflow. But MapReduce as a concept itself is still relevant regardless of the technology.
In a short summary, the word MapReduce can refer to two definitions:
- MapReduce as a technology
- MapReduce as a concept
What is important for us to understand is MapReduce as a concept. MapReduce is a combination of two words: map and reduce.
Let's take a look at an example, if you have a file that's divided into two file parts:
Each of the parts contains one or more words, which in this example are fruit. The file parts are stored on different machines. So, each machine will have these three file parts:
- File Part 1 contains two words: Banana and Apple.
- File Part 2 contains three words: Melon, Apple, and Banana.
- File Part 3 contains one word: Apple.
- Apple = 3
- Banana = 2
- Melon = 1
Since the file parts are separated in different machines, we cannot just count the words directly. We need MapReduce. Let's take a look at the following diagram, where file parts are mapped, shuffled, and lastly reduced to get the final result:
There are four main steps in the diagram:
- Map: Add to each individual record a static value of 1. This will transform the word into a key-value pair when the value is always 1.
- Shuffle: At this point, we need to move the fruit words between machines. We want to group each word and store it in the same machine for each group.
- Reduce: Because each fruit group is already in the same machine, we can count them together. The Reduce step will sum up the static value 1 to produce the count results.
- Result: Store the final results back in the single machine.
The key idea here is to process any possible process in a distributed manner. Looking back at the diagram, you can imagine each box on each step is a different machine.
Each step, Map, Shuffle, and Reduce, always maintains three parallel boxes. What does this mean? It means that the processes happened in parallel on three machines. This paradigm is different from calculating all processes in a single machine. For example, we can simply download all the file parts into a pandas DataFrame in Python and do a count using the pandas DataFrame. In this case, the process will happen in one machine.
MapReduce is a complex concept. The concept is explained in a 13-page-long document by Google. You can find the document easily on the public internet. In this book, I haven't added much deeper explanation about MapReduce. In most cases, you don't need to really think about it; for example, if in a later chapter you use BigQuery to process 1 PB of data, you will only need to run a SQL query and BigQuery will process it in a distributed manner in the background.
As a matter of fact, all technologies in GCP that we will use in this book are highly scalable and without question able to handle big data out of the box. But understanding the underlying concepts helps you as a data engineer in many ways, for example, choosing the right technologies, designing data pipeline architecture, troubleshooting, and improving performance.
As a summary of the first chapter, we've learned the fundamental knowledge we need as data engineers. Here are some key takeaways from this chapter. First, data doesn't stay in one place. Data moves from one place to another, called the data life cycle. We also understand that data in a big organization is mostly in silos, and we can solve these data silos using the concepts of a data warehouse and data lake.
As someone who has started to look into data engineer roles, you may be a little bit lost. The role of data engineers may vary. The key takeaway is not to be confused about the broad expectation in the market. First, you should focus on the core and then expand as you get more and more experience from the core. In this chapter, we've learned what the core for a data engineer is. At the end of the chapter, we learned some of the key concepts. There are three key concepts as a data engineer that you need to be familiar with. These concepts are ETL, big data, and distributed systems
In the next chapter, we will visit GCP, a cloud platform provided by Google that has a lot of services to help us as data engineers. We want to understand its preposition and what the services are that are relevant to big data, and lastly, we will start using the GCP console.
Now let's put the knowledge from this chapter into practice.
You are a data engineer at a book publishing company and your product manager has asked you to build a dashboard to show the total revenue and customer satisfaction index in a single dashboard.
Your company doesn't have any data infrastructure yet, but you know that your company has these three applications that contain TBs of data:
- The company website
- A book sales application using MongoDB to store sales transactions, including transactions, book ID, and author ID
- An author portal application using MySQL Database to store authors' personal information, including age
Do the following:
- List important follow-up questions for your manager.
- List your technical thinking process of how to do it at a high level.
- Draw a data pipeline architecture.
There is no right or wrong answer to this practice. The important thing is that you can imagine how the data flows from upstream to downstream, how it should be processed at each step, and finally, how you want to serve the information to end users.
- Learn more about Hadoop and its distributed filesystem: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf.
- Learn more about how MapReduce works: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.
- Key facts about data engineers and why the role is getting more popularity than data scientists in 2021: https://www.kdnuggets.com/2021/02/dont-need-data-scientists-need-data-engineers.html.