Data Lakehouse in Action

Chapter 1: Introducing the Evolution of Data Analytics Patterns

Data analytics is an ever-changing field. A little history will help you appreciate the strides in this field and how data architectural patterns have evolved to fulfill the ever-changing need for analytics.

First, let's start with some definitions:

What is analytics? Analytics is defined as any action that converts data into insights.
What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.

Analytics and the data architecture that enables analytics goes a long way. Let's now explore some of the patterns that have evolved over the last few decades.

This chapter explores the genesis of data growth and explains the need for a new paradigm in data architecture. This chapter starts by examining the predominant paradigm, the enterprise data warehouse, popular in the 1990s and 2000s. It explores the challenges associated with this paradigm and then covers the drivers that caused an explosion in data. It further examines the rise of a new paradigm, the data lake, and its challenges. Furthermore, this chapter ends by advocating the need for a new paradigm, the data lakehouse. It clarifies the key benefits delivered by a well-architected data lakehouse.

We'll cover all of this in the following topics:

Discovering the enterprise data warehouse era
Exploring the five factors of change
Investigating the data lake era
Introducing the data lakehouse paradigm

Discovering the enterprise data warehouse era

The Enterprise Data Warehouse (EDW) pattern, popularized by Ralph Kimball and Bill Inmon, was predominant in the 1990s and 2000s. The needs of this era were relatively straightforward (at least compared to the current context). The focus was predominantly on optimizing database structures to satisfy reporting requirements. Analytics was synonymous with reporting. Machine learning was a specialized field and was not ubiquitous in enterprises.

A typical EDW pattern is depicted in the following figure:

Figure 1.1 – A typical EDW pattern

As shown in Figure 1.1, the pattern entailed source systems composed of databases or flat-file structures. The data sources are predominantly structured, that is, rows and columns. A process called Extract-Transform-Load (ETL) first extracts the data from the source systems. Then, the process transforms the data into a shape and form that is conducive for analysis. Once the data is transformed, it is loaded into an EDW. From there, the subsets of data are then populated to downstream data marts. Data marts can be conceived of as mini data warehouses that cater to the business requirements of a specific department.

As you can imagine, this pattern primarily was focused on the following:

Creating a data structure that is optimized for storage and modeled for reporting
Focusing on the reporting requirements of the business
Harnessing the structured data into actionable insights

Every coin has two sides. The EDW pattern is not an exception. It has its pros and it has its cons. This pattern has survived the test of time. It was widespread and well adopted because of the following key advantages:

Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.
Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.
Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.

However, it also had its own set of challenges that surfaced as the data volumes grew and new data formats started emerging. A few challenges associated with the EDW pattern are as follows:

This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.
As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable.
As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.
The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.

While the EDW pattern was becoming mainstream, a perfect storm was flourishing that changed the landscape. The following section will focus on five different factors that came together to change the data architecture pattern for good.

Exploring the five factors of change

The year 2007 changed the world as we know it; the day Steve Jobs took the stage and announced the iPhone launch was a turning point in the age of data. That day brewed the perfect "data" storm.

A perfect storm is a meteorological event that occurs as a result of a rare combination of factors. In the world of data evolution, such a perfect storm occurred in the last decade, one that has catapulted data as a strategic enterprise asset. Five ingredients caused the perfect "data" storm.

Figure 1.2 – Ingredients of the perfect "data" storm

As depicted in Figure 1.2, there were five factors to the perfect storm. An exponential growth of data and an increase in computing power were the first two factors. These two factors coincided with a decrease in storage cost. The rise of AI and the advancement of cloud computing coalesced at the same time to form the perfect storm.

These factors developed independently and converged together, changing and shaping industries. Let's look into each of these factors briefly.

The exponential growth of data

The exponential growth of data is the first ingredient of the perfect storm.

Figure 1.3 – Estimated data growth between 2010 and 2020

According to the International Data Corporation (IDC), by 2025, the total data volumes generated will reach around 163 ZB (zettabytes), that is, a trillion gigabytes. In 2010, that number was approximately 0.5 ZB. This exponential growth of data is attributed to a vast improvement in internet technologies that have fueled the growth of many industries. The telecommunications industry was the major industry that was transformed. This, in turn, transformed many other industries. Data became ubiquitous and every business craved more data bandwidth. Social media platforms started to be used as well. The likes of Facebook, Twitter, and Instagram flooded the internet space with more data. Streaming services and e-commerce also generated tons of data. This generated data was used to forge and influence consumer behaviors. Last, but not least, the technological leaps in the Internet of Things (IoT) space generated loads of data.

The traditional EDW pattern was not able to cope with this growth in data. They were designed for structured data. Big data had changed the definition of usable data. The data now was big (volume); some of them were continuously flowing (velocity), generated in different shapes and forms (variety), and from a plethora of sources with noise (veracity).

The increase in compute

The exponential increase in computing power is the second ingredient of the perfect storm.

Figure 1.4 – Estimated growth in transistors per microprocessors between 2010 and 2020

Moore's law is the prediction made by American engineer Gordon Moore in 1965 that the number of transistors per silicon chip doubles every year. This law has been faithful to its forecast so far. In 2010, the number of transistors in a microprocessor was around 2 billion. In 2020, that number stood at 54 billion. This exponential increase in computing power dovetails with the rise of cloud computing technologies that provide limitless compute at an affordable price point.

The increase in computing power at a reasonable price point provided a much-needed impetus for big data. Organizations can now procure more and more compute at a much lower price point. The compute available in cloud computing can now be used to process and analyze data on demand.

The decrease in storage cost

The rapid decrease in storage cost is the third ingredient of the perfect storm.

Figure 1.5 – The estimated decrease in storage cost between 2010 and 2020

The cost of storage has also exponentially decreased. In 2010, the average cost of storing a GB of data in a Hard Disk Drive (HDD) was around $0.1. That number has reduced to approximately $0.01 in 10 years. In the traditional EDW pattern, organizations had to be picky about which data they had to store for analysis and which data could be discarded. Holding data was an expensive proposition. However, the exponential decrease in storage cost meant that all data could now be stored at a fraction of the previous cost. There was now no need to pick and choose what should be stored and what should be discarded. Data in whatever shape or form could now be kept at a fraction of price. The mantra of store first and analyze later could now be implemented.

The rise of artificial intelligence

Artificial Intelligence (AI) systems are not new to the world. In fact, their genesis goes back to the 1950s, when statistical models were used to estimate values of data points based on past data. This field was out of focus for an extended period, as the computing power and large corpus of data required to run these models were not available.

Figure 1.6 – Timeline of the evolution of AI

However, after a long hibernation, AI technologies saw a resurgence in the early 2010s. This resurgence was partly due to the abundance of powerful computing resources and the equal availability of data. AI models now could be trained faster, and the results were stunningly accurate.

The factor of reduced storage cost and more available computing resources was a boon for AI. More and more complex models could now be trained.

Figure 1.7 – Accuracy of AI systems in matching humans for image recognition

This was especially true for deep learning algorithms. For instance, a deep learning technique called Convoluted Neural Networks (CNNs) has become very popular for detecting images. Over a period, deeper and deeper neural networks were created. Now, AI systems have surpassed human beings in detecting objects.

As AI systems became more accurate, they gained in popularity. This fueled cyclic behavior, and more and more businesses were employing AI in their digital transformation agenda.

The advancement of cloud computing

The fifth ingredient for the perfect "data" storm is the rise of cloud computing. Cloud computing is the on-demand availability of computing and storage resources. The typical public cloud service providers include big technology companies such as Amazon (AWS), Microsoft (Azure), and Google (GCP). Cloud computing eliminates the need to host large servers for computing and storage on the organization's data center. Depending on the service subscribed to in the cloud, organizations can also reduce their dependencies on software and hardware maintenance. Cloud provides a plethora of on-demand services at a very economical price point. The cloud computing landscape has constantly been rising since 2010. Worldwide spending on public clouds started at around $77 billion in 2010 and has reached around $441 billion in 2020. Cloud computing also enabled the rise of the Digitally Native Business (DNB). It propelled the rise of organizations such as Uber, Deliveroo, TikTok, and Instagram, to name a few.

Cloud computing has been a boon for data. With the rise of cloud computing, data can now be stored at a fraction of the cost. The comparatively limitless compute power that the cloud provides translates into the ability to rapidly transform data. Cloud computing also provides innovative data platforms that can be utilized at a click of a button.

These five ingredients crossed paths at an opportune moment to challenge the existing data architecture patterns. The perfect "data" storm facilitated the rise of a new data architecture paradigm focused on big data, the data lake.

Investigating the data lake era

The genesis of the data lake starts in 2004. In 2004, Google researchers Jeffery Dean and Sanjay Ghemawat published a paper titled MapReduce: Simplified Data Processing on Large Clusters. This paper laid the foundation of a new technology that evolved into Hadoop, whose original authors are Doug Cutting and Mike Cafarella.

Hadoop was later incorporated into Apache Software Foundation, a decentralized open source community of developers. Hadoop has been one of the top open source projects within the Apache ecosystem.

Hadoop was based on a simple concept – divide and conquer. The idea entailed three steps:

Distribute data into multiple files and distribute them across the various nodes in a cluster.
Use compute nodes to process the data locally in the nodes of each cluster.
Use an orchestrator that communicates with each node and aggregates data for the final output.

Over the years, this concept gained traction, and a new kind of paradigm emerged for analytics. This architecture paradigm is the data lake paradigm. A typical data lake pattern can be depicted in the following figure:

Figure 1.8 – A typical data lake pattern

This pattern addressed the challenges prevalent in the EDW pattern. The advantages that the data lake architecture pattern can offer are evident. The key advantages are as follows:

The data lake caters to both structured and unstructured data. The Hadoop ecosystem was primarily developed to store and process data formats such as JSON, text, and images. The EDW pattern was not designed to store or analyze these data types.
The data lake pattern can process large volumes of data at a relatively cheaper cost. The volumes of data that data lakes can store and process are in the order of high Terabytes (TBs) or Petabytes (PB). The EDW pattern found these large volumes of data challenging to store and process efficiently.
Data lakes can better address fast-changing business requirements. The evolving AI technologies can leverage data lakes better.

This pattern is widely adopted as it is the need of the hour. However, it has its own challenges. A few challenges associated with this pattern are as follows:

It is easy for a data lake to become a data swamp. Data lakes take in data, any form of data, and store it in its raw form. The philosophy is to ingest data first and then figure out what to do with it. This causes easy slippage of governance, and it becomes challenging to govern the data lake. With no proper data governance, data starts to mushroom all over the place in a data lake, and soon it becomes a data swamp.
Data lakes also have challenges with the rapid evolution of technology. The data lake paradigm mainly relies on open source software. Open source software evolves rapidly into behemoths that can become too difficult to manage. The software is predominantly community-driven, and it doesn't have proper enterprise support. This causes a lot of maintenance overhead and implementation complexities. Many features that are demanded by enterprises are missing from open source software, for example, a robust security framework.
Data lakes focus a lot more on AI enablement than BI. It was natural that the open source software evolution focused more on enabling AI. AI was having its own journey and was riding the wave, cresting together with Hadoop. BI was seen as retro, as it was already mature in its life cycle.

Soon, it became evident that the data lake pattern alone wouldn't be sustainable in the long run. There was a need for a new paradigm that fuses these two patterns.

Introducing the data lakehouse paradigm

In 2006, Clive Humbly, a British mathematician, coined the now-famous phrase, "Data is the new oil." It was akin to peering through a crystal ball and peeking into the future. Data is the lifeblood of organizations. The competitive advantage is defined by how an organization uses data. Data management is paramount in this age of digital transformation. More and more organizations are embracing digital transformation programs, and data is at the core of these transformations.

As discussed earlier, the paradigms of the EDW and data lakes were opportune for their times. They had their benefits and their challenges. A new paradigm needed to emerge that was disciplined at its core and flexible at its edges.

Figure 1.9 – Data lakehouse paradigm

The new data architectural paradigm is called the data lakehouse. It strives to combine the advantages of both the data lake and the EDW paradigms while minimizing their challenges.

An adequately architected data lakehouse delivers four key benefits.

Figure 1.10 – Benefits of the data lakehouse

It derives insights from both structured and unstructured data: The data lakehouse architecture should be able to store, transform, and integrate structured and unstructured data. It should be able to fuse them together and enable the extraction of valuable insights from the data.
It caters to different personas of the organizations: Data is a dish with different tastes for different personas. The data lakehouse should be able to cater to the needs of these personas. The data lakehouse caters to a range of organizational personas and fulfills their requirements for insights. A data scientist should get their playground for testing their hypothesis. An analyst should be able to analyze data using their tools of choice, and business users should be able to get their reports accurately and on time. It democratizes data for analytics.
It facilitates the adoption of a robust governance framework: The primary challenge with the data lake architecture pattern was the lack of a strong governance framework. It was easy for a data lake to become a data swamp. In contrast, an EDW architecture was stymied by too much governance for too little content. The data lakehouse architecture strives to hit the governance balance. It seeks to achieve the proper governance for the correct data type with access to the right stakeholder.
It leverages cloud computing: Data lakehouse architecture needs to be agile and innovative. The pattern needs to adapt to the changing organizational requirements and reduce the data to insight turnover time. To achieve this agility, it is imperative to adopt cloud computing technology. The cloud computing platforms offer the innovativeness required. It provides the appropriate technology stack with scalability and flexibility, and fulfills the demands of a modern data analytics platform.

The data lakehouse paradigm addresses the challenges faced by the EDW and the data lake paradigm. Yet, it does have its own set of challenges that needs to be managed. A few of those challenges are as follows:

Architectural complexity: Given that the data lakehouse pattern amalgamates the EDW and the data lake pattern, it is inevitable that it will have its fair share of architectural complexity. The complexity manifests in the form of multiple components required to fruition the pattern. Architectural patterns are quid pro quo; it is vital to carefully trade off architectural complexity with the potential business benefit. The data lakehouse architecture needs to tread that path carefully.
Required holistic data governance: The challenges pertinent to the data lake paradigm do not magically go away with the data lakehouse paradigm. The biggest challenge of a data lake was that it was prone to becoming a data swamp. As the data lakehouse grows in its scope and complexity, the lack of a holistic governance framework is a sure-shot way of creating a swamp out of a data lakehouse.
Balancing flexibility with discipline: The data lakehouse paradigm strives to be flexible and to adapt to changing business requirements with agility. The ethos under which it operates is to have discipline at the core and flexibility at the edges. Achieving this objective is a careful balancing act that clearly defines the limits of flexibility and the strictness of discipline. The data lakehouse stewards play an essential role in ensuring this balance.

Let's recap what we've discussed in this chapter.