Reader small image

You're reading from  Data Lakehouse in Action

Product typeBook
Published inMar 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781801815932
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Pradeep Menon
Pradeep Menon
author image
Pradeep Menon

Pradeep Menon is a seasoned data analytics professional with more than 18 years of experience in data and AI. Pradeep can balance business and technical aspects of any engagement and cross-pollinate complex concepts across many industries and scenarios. Currently, Pradeep works as a data and AI strategist at Microsoft. In this role, he is responsible for driving big data and AI adoption for Microsoft’s strategic customers across Asia. Pradeep is also a distinguished speaker and blogger and has given numerous keynotes on cloud technologies, data, and AI.
Read more about Pradeep Menon

Right arrow

Chapter 3: Ingesting and Processing Data in a Data Lakehouse

In the previous chapter, we provided an overview of the architectural components of a data lakehouse. That chapter provided a bird's-eye view of the seven layers and described these layers in considerable detail. This chapter will cover the architectural patterns for the first two layers of a data lakehouse:

  • The data ingestion layer
  • The data processing layer

These two layers need to be covered together as they are interlinked. Data is relayed from the ingestion layer to the processing layer. Many of the tools and technologies that are used in both these layers are the same.

This chapter is divided into five sections. We will start by exploring the differences between the extract, transform, load (ETL) and extract, load, transform (ELT) data transformation patterns. Then, we will dive deeper into the methods for ingesting and processing batch data. After that, we will do the same for streaming data...

Ingesting and processing batch data

Let's start by looking at the logical architecture of a data lakehouse:

Figure 3.1 – Data lakehouse logical architecture

The preceding diagram depicts the seven logical layers. Data from the data providers needs to be ingested and transformed. Traditionally, there are two types of batch data ingestion and transformation patterns:

  • ETL
  • ELT

Understanding these patterns is vital if you wish to understand how they can be combined for batch ingestion and processing in a data lakehouse.

Let's discuss these patterns in detail.

Differences between the ETL and ELT patterns

Let's discuss the differences between these patterns in detail. On the surface, these patterns may seem similar. However, there are differences in their philosophy and the services that are employed to transform data.

ETL

The first pattern is ETL. The following diagram depicts a typical ETL pattern:

...

Ingesting and processing streaming data

The following diagram depicts the components required for stream data ingestion and processing:

Figure 3.7 – The streaming data ingestion and processing pattern

Now, let's discuss how to stream data processing through the lens of the ELTL process.

Streaming data sources

Streaming data is a data source that continuously emanates data. Social media feeds, IoT devices, and event-driven processes such as swiping a credit card are examples of streaming data. The data is continually produced, and the goal of stream processing is to tap into that stream of data and gain insights as quickly as possible. Stream data ingestion and processing facilitate real-time analytics. This implies that analytics is performed on the data without the data being persisted on disk.

Extraction-load

Stream data is extracted using an event publishing-subscribing service. An event publishing-subscribing service enables creating...

Bringing it all together

So far, we have covered the essential elements of batch and stream ingestion and processing. Now, let's bring these two types of processing together to define the Lambda architecture pattern.

Figure 3.12 – Lambda architecture pattern

The preceding diagram depicts a Lambda architecture pattern. A Lambda architecture pattern has three layers: the batch layer, the speed layer, and the serving layer.

The batch layer

The following diagram illustrates batch layer processing in a Lambda architecture:

Figure 3.13 – The batch layer in a Lambda architecture

Batch layer processing consists of ingesting the data into the raw data store of the data lake using pull or push methodologies through a batch data ingestion service. Once the data has been ingested in the raw data store, a batch processing service is initiated. The batch processing service employs a distributed computing engine for faster...

Summary

This chapter covered data ingestion and processing. We started by exploring the different patterns for batch data ingestion: ETL and ELT.

Then, we delved into the different components of the ELTL pattern, which is used to ingest and process batch data in a data lakehouse. Then, we discussed how to push or pull data into a raw data store. Finally, we discussed the pivotal role that the raw data store layer plays in data ingestion and processing.

Next, we delved into distributed computing and how it is used for processing batch data at scale.

After discussing batch data ingestion and processing, we discussed patterns for ingesting and processing stream data. Then, we discussed how to ingest stream data by publishing it to a topic and subscribing to it for processing. Finally, we learned how to micro batch the streams and exercise actions on a micro batch or a specific event of interest.

Finally, we brought all the concepts we'd discussed together and weaved...

Further reading

For more information regarding the topics that were covered in this chapter, take a look at the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Lakehouse in Action
Published in: Mar 2022Publisher: PacktISBN-13: 9781801815932
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Pradeep Menon

Pradeep Menon is a seasoned data analytics professional with more than 18 years of experience in data and AI. Pradeep can balance business and technical aspects of any engagement and cross-pollinate complex concepts across many industries and scenarios. Currently, Pradeep works as a data and AI strategist at Microsoft. In this role, he is responsible for driving big data and AI adoption for Microsoft’s strategic customers across Asia. Pradeep is also a distinguished speaker and blogger and has given numerous keynotes on cloud technologies, data, and AI.
Read more about Pradeep Menon