Data Lake Development with Big Data

Chapter 1. The Need for Data Lake

In this chapter, we will understand the rationale behind building a Data Lake in an organization that has huge data assets. The following topics will be covered in this chapter:

Explore the emerging need for Data Lake by understanding the limitations of the traditional architectures
Decipher how a Data Lake addresses the inadequacies of traditional architectures and provides significant benefits in terms of time and cost
Understand what a Data Lake is and also its architecture
Practical guidance on the key points to consider before deciding to build a Data Lake
Understand the key components that could be a part of a Data Lake and comprehend how crucial each of these components are to build a successful Data Lake

Before the Data Lake

In this section, let us quickly look at how the Data Lake has evolved from a historical perspective.

From the time data-intensive applications were used to solve business problems, we have seen many evolutionary steps in the way data has been stored, managed, analyzed, and visualized.

The earlier systems were designed to answer questions about the past; questions such as what were my total sales in the last year?, were answered by machines built around monolithic processors that ran COBOL, accessing data from tapes and disks. Since the dawn of faster processors and better storage, businesses were able to slice and dice data to find fine-grained answers from subsets of data; these questions resembled: what was the sales performance of x unit in y geography in z timeframe?

If we extract one common pattern, all the earlier systems were developed for business users, in order to help them make decisions for their businesses. The current breed of data systems empowers people like you and me to make decisions and improve the way we live. This is an ultimate paradigm shift brought by the advances in myriad technologies.

For many of us, the technologies that run in the background are transparent, while we consult applications that help us make decisions that alter our immediate future profoundly. We use applications to help us navigate to an address (mapping), decide on our holidays (weather and holiday planning sites), get a summary of product reviews (review sites), get similar products (recommendation engines), connect and grow professionally (professional social networks), and the list goes on.

All these applications use enabling technologies that understand natural languages, process humungous amounts of data, store and effortlessly process our personal data such as images and audio, and even extract intelligence from them by tagging our faces and finding relationships. Each of us, in a way, contributes to the flooding of these application servers with our personal data in the form of our preferences, likes, affiliations, networks, hobbies, friends, images, and videos.

If we can attribute one fundamental cause for today's explosion of data, it should be the proliferation of ubiquitous internet connectivity and the Smartphone; with it comes the exponential number of applications that transmit and store a variety of data.

Juxtaposing the growth of Smartphones and the internet with the rapid decline of storage costs and cloud computing, which also bring down the processing costs, we can immediately comprehend that the traditional data architectures do not scale to handle the volume and variety of data; thus cannot, answer questions that you and I want. They work well, extremely well for business users, but not directly for us.

In order to democratize the value hidden in data and thus empower common customers to use data for day-to-day decision making, organizations should first store and extract value from the different types of data being collected in such a huge quantities. For all this to happen, the following two key developments have had a revolutionary impact:

The development of distributed computing architectures that can scale linearly and perform computations at an unbelievable pace
The development of new-age algorithms that can analyze natural languages, comprehend the semantics of the spoken words and special types, run Neural Nets, perform deep learning, graph social network interactions, perform constraint-based stochastic optimization, and so on

Earlier systems were simply not architected to scale linearly and store/analyze these many types of data. They are good for the purpose they were initially built for. They excelled as a historical data store that can offload structured data from Online Transaction Processing (OLTP) systems, perform transformations, cleanse it, slice-dice and summarize it, and then feed it to Online Analytical Processing (OLAP) systems. Business Intelligence tools consume the exhaust of the OLAP systems and spew good-looking reports religiously at regular intervals so that the business users can make the decisions.

We can immediately grasp the glaring differences between the earlier systems and the new age systems by looking at these major aspects:

The storage and processing differs in the way it scales (distributed versus monolithic)
In earlier systems, data is managed in relational systems versus NoSQL, MPP, and CEP systems in the new age Big Data systems
Traditional systems cannot handle high-velocity data that is efficiently ingested and processed by Big Data applications
Structured data is predominantly used in earlier systems versus unstructured data being used in Big Data systems along with structured data
Traditional systems have limitations around the scale of data that they can handle; Big Data systems are scalable and can handle humongous amounts of data
Traditional analytic algorithms such as linear/logistic regressions versus cutting edge algorithms such as random forests-ensemble methods, stochastic optimizations, deep learning, and NLP being regularly used in Big Data systems
Reports and drilldowns are the mainstay, versus the advanced visualizations such as Tag cloud and Heat map, which are some of the choicest reporting advances in the Big Data era

Data Lake is one such architecture that has evolved to address the need of the organizations to adapt to the new business reality. Organizations today listen to the customer's voice more than ever; they are sensitive to customer feedback and negative remarks—it hurts their bottom line if they don't. Organizations understand their customers more intimately than ever before—they know your every move, literally, through behavioral profiling. Finally, organizations use all the data at their disposal to help customers leverage it for their personal benefit. In order to catch up with the changing business landscape, there is immense potential for building a Data Lake to store, process, and analyze huge amounts of structured and unstructured data.

The following figure elucidates the vital differences between traditional and Big Data systems:

Traditional versus Big Data systems

Need for Data Lake

Now that we have glimpsed the past and understood how various systems evolved in time, let us explore in this section, a few important reasons why Data Lakes have evolved and what problems they try to address. Let's start with a contextual overview.

One of the key driving forces behind the onslaught of Big Data is the rapid spread of unstructured data (which constitutes 90 percent of the data). The increase in mobile phones, wider internet coverage, faster data networks, cheaper cloud storage, and falling compute/storage prices, all contribute to the spurt of Big Data in recent years. A few studies reveal that we produce as much data every 15 minutes, as was created from the beginning of the time, to the year 2003. This roughly coincides with the mobility/cloud usage proliferation.

Big data is not only about massive data capture and storage at a cheaper price point, but the real value of storing Big Data comes from intelligently combining the historical data that already exists inside an organization with the unstructured data. This helps in gaining new and better insights that improve business performance.

For example, in retail, it could imply better and rapid services to customers; in R&D, it could imply performing iterative tests over much larger samples in a faster way; in healthcare, it could imply quicker and more precise diagnoses of illnesses.

For an organization to be really successful to reap the latent benefits of Big Data, it needs two basic capabilities:

Technology should be in place to enable organizations to acquire, store, combine, and enrich huge volumes of unstructured and structured data in raw format
Ability to perform analytics, real-time and near-real-time analysis at scale, on these huge volumes in an iterative way

To address the preceding two business needs, the concept of Data Lake has become one of the empowering data captures and processing capabilities for Big Data analytics.

The Data Lake makes it possible to store all the data, ask complex and radically bigger business questions, and find out hidden patterns and relationships from the data.

Using a traditional system, an enterprise may not have the solution to find out whether there is any hidden value in the data that the enterprise is not storing right now or letting go as waste. We don't really know what hidden value this data contains at the time of data acquisition. We might know a miniscule percentage of questions to ask at the data acquisition time, but we can never know what questions could materialize at a later point of time. Essentially, a Data Lake tries to address this core business problem.

While reasons abound to explain the need of a Data Lake, one of the core reasons is the dramatic decrease of storage costs and thus enabling organizations to store humungous amounts of data.

Let us look at a few reasons for the emergence of the Data Lake with reference to the aspects in which the traditional approaches fail:

The traditional data warehouse (DW) systems are not designed to integrate, scale, and handle this exponential growth of multi-structured data. With the emergence of Big Data, there is a need to bring together data from disparate sources and to generate a meaning out of it; new types of data ranging from social text, audio, video, sensors, and clickstream data have to be integrated to find out complex relationships in the data.
The traditional systems lack the ability to integrate data from disparate sources. This leads to proliferation of data silos, due to which, business users view data in various perspectives, which eventually thwarts them from making precise and appropriate decisions.
The schema-on-write approach followed by traditional systems mandate the data model and analytical framework to be designed before any data is loaded. Upfront data modeling fails in a Big Data scenario as we are unaware of the nature of incoming data and the exploratory analysis that has to be performed in order to gain hidden insights. Analytical frameworks are designed to answer only specific questions identified at the design time. This approach does not allow for data discovery.
With traditional approaches, optimization for analytics is time consuming and incurs huge costs. Such optimization enables known analytics, but fails when there are new requirements.
In traditional systems, it is difficult to identify what data is available and to integrate the data to answer a question. Metadata management and lineage tracking of data is not available or difficult to implement; manual recreation of data lineage is error-prone and consumes a lot of time.

Defining Data Lake

In the preceding sections, we had a quick overview of how the traditional systems evolved over time and understood their shortcomings with respect to the newer forms of data. In this section, let us discover what a Data Lake is and how it addresses the gaps masquerading as opportunities.

A Data Lake has flexible definitions. At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out.

The following list details out in a nutshell what a Data Lake is:

Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.
Data Lake is not Hadoop. It uses different tools. Hadoop only implements a subset of functionalities.
Data Lake is not a database in the traditional sense of the word. A typical implementation of Data Lake uses various NoSQL and In-Memory databases that could co-exist with its relational counterparts.
A Data Lake cannot be implemented in isolation. It has to be implemented alongside a data warehouse as it complements various functionalities of a DW.
It stores large volumes of both unstructured and structured data. It also stores fast-moving streamed data from machine sensors and logs.
It advocates a Store-All approach to huge volumes of data.
It is optimized for data crunching with a high-latency batch mode and it is not geared for transaction processing.
It helps in creating data models that are flexible and could be revised without database redesign.
It can quickly perform data enrichment that helps in achieving data enhancement, augmentation, classification, and standardization of the data.
All of the data stored in the Data Lake can be utilized to get an all-inclusive view. This enables near-real-time, more precise predictive models that go beyond sampling and aid in generating multi-dimensional models too.
It is a data scientist's favorite hunting ground. He gets to access the data stored in its raw glory at its most granular level, so that he can perform any ad-hoc queries, and build an advanced model at any time—Iteratively. The classic data warehouse approach does not support this ability to condense the time between data intake and insight generation.
It enables to model the data, not only in the traditional relational way, but the real value from the data can emanate from modeling it in the following ways:
- As a graph to find the interactions between elements; for example, Neo4J
- As a document store to cluster similar text; for example, MongoDB
- As a columnar store for fast updates and search; for example, HBase
- As a key-value store for lightning the fast search; for example, Riak

A key attribute of a Data Lake is that data is not classified when it is stored. As a result, the data preparation, cleansing, and transformation tasks are eliminated; these tasks generally take a lion's share of time in a Data Warehouse. Storing data in its rawest form enables us to find answers from the data for which we do not know the questions yet; whereas a traditional data warehouse is optimized for answering questions that we already know—thus preparation of the data is a mandatory step here.

This reliance on raw data makes it easy for the business to consume just what it wants from the lake and refine it for the purpose. Crucially, in the Data Lake, the raw data makes multiple perspectives possible on the same source so that everyone can get their own viewpoints on the data, in a manner that enables their local business's success.

This flexibility of storing all data in a single Big Data repository and creating multiple viewpoints require that the Data Lake implements controls for corporate data consistency. To achieve this, targeted information governance policies are enforced. Using Master Data Management (MDM), Research Data Management (RDM), and other security controls, corporate collaboration and access controls are implemented.

Key benefits of Data Lake

Having understood the need for the Data Lake and the business/technology context of its evolution, let us now summarize the important benefits in the following list:

Scale as much as you can: Theoretically, the HDFS-based storage of Hadoop gives you the flexibility to support arbitrarily large clusters while maintaining a constant price per performance curve even as it scales. This means, your data storage can be scaled horizontally to cater to any need at a judicious cost. To gain more space, all you have to do is plug in a new cluster and then Hadoop scales seamlessly. Hadoop brings you the incredible facility to run the code close to storage, allowing quicker processing of massive data sets. The usage of Hadoop for underlying storage makes the Data Lake more scalable at a better price point than Data Warehouses by an order of magnitude. This allows for the retention of huge amounts of data.
Plug in disparate data sources: Unlike a data warehouse that can only ingest structured data, a Hadoop-powered Data Lake has an inherent ability to ingest multi-structured and massive datasets from disparate sources. This means that the Data Lake can store literally any type of data such as multimedia, binary, XML, logs, sensor data, social chatter, and so on. This is one huge benefit that removes data silos and enables quick integration of datasets.
Acquire high-velocity data: In order to efficiently stream high-speed data in huge volumes, the Data Lake makes use of tools that can acquire and queue it. The Data Lake utilizes tools such as Kafka, Flume, Scribe, and Chukwa to acquire high-velocity data. This data could be the incessant social chatter in the form of Twitter feeds, WhatsApp messages, or it could be sensor data from the machine exhaust. This ability to acquire high-velocity data and integrate with large volumes of historical data gives Data Lake the edge over Data Warehousing systems, which could not do any of these as effectively.
Add a structure: To make sense of vast amounts of data stored in the Data Lake, we should create some structure around the data and pipe it into analysis applications. Applying a structure on unstructured data can be done while ingesting or after being stored in the Data Lake. A structure such as the metadata of a file, word counts, parts of speech tagging, and so on, can be created out of the unstructured data. The Data Lake gives you a unique platform where we have the ability to apply a structure on varied datasets in the same repository with a richer detail; hence, enabling you to process the combined data in advanced analytic scenarios.
Store in native format: In a Data Warehouse, the data is premodeled as cubes that are the best storage structures for predetermined analysis routines at the time of ingestion. The Data Lake eliminates the need for data to be premodeled; it provides iterative and immediate access to the raw data. This enhances the delivery of analytical insights and offers unmatched flexibility to ask business questions and seek deeper answers.
Don't worry about schema: Traditional data warehouses do not support the schemaless storage of data. The Data Lake leverages Hadoop's simplicity in storing data based on schemaless write and schema-based read modes. This is very helpful for data consumers to perform exploratory analysis and thus, develop new patterns from the data without worrying about its initial structure and ask far-reaching, complex questions to gain actionable intelligence.
Unleash your favorite SQL: Once the data is ingested, cleansed, and stored in a structured SQL storage of the Data Lake, you can reuse the existing PL-SQL scripts. The tools such as HAWQ and IMPALA give you the flexibility to run massively parallel SQL queries while simultaneously integrating with advanced algorithm libraries such as MADLib and applications such as SAS. Performing the SQL processing inside the Data Lake decreases the time to achieving results and also consumes far less resources than performing SQL processing outside of it.
Advanced algorithms: Unlike a data warehouse, the Data Lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms to recognize items of interest that will power real-time decision analytics.
Administrative resources: The Data Lake scores better than a data warehouse in reducing the administrative resources required for pulling, transforming, aggregating, and analyzing data.

When to go for a Data Lake implementation

In the preceding section, the top benefits of Data Lake were brought to light and we looked at how their application takes on the strategic importance in an organization.

In this section, we will try to enumerate a few key quick reference scenarios where Data Lake can be recommended as a go-to solution. Here are a few scenarios:

Your organization is planning to extract insights from huge volumes or a high-velocity of data that a traditional data warehouse is incapable of handling.
The business landscape is forcing your organization to adapt to market challenges by making you handle the demand for new products at a moment's notice and you have to get insights really fast.
Your organization needs to build data products that use new data that is not yet prepared and structured. As new data becomes available, you may need to incorporate it straightaway, it probably can't wait for a schema change, building the extension and lots of delay, and it needs the insight right now.
Your organization needs a dynamic approach in extracting insights from data where business units can tap or purify the required information when they need it.
Your organization is looking for ways to reduce the total ownership cost of a data warehouse implementation by leveraging a Data Lake that significantly lowers storage, operational, network, and computing costs and produces better insights.
Your organization needs to improve its topline and wants to augment internal data (such as customer data) with external data (social media and nontraditional data) from a variety of sources. This can get a broader customer view and better behavioral profile of the customer, resulting in quicker customer acquisition.
The organization's data science/advanced analytics teams seek preserving of the original data's integrity/fidelity and need lineage tracking of data transformations to capture the origin of a specific datum and to track the lifecycle of the data as it moves through the data pipeline.
There is a pressing need for the structuring and standardization of Big Data for new and broader data enrichment.
There is a need for near real-time analytics for faster/better decisions and point-of-service use.
Your organization needs an integrated data repository for plug-and-play implementation of new analytics tools and data products.
Your data science/advanced analytics teams regularly need quick provisioning of data without having to be in an endless queue; Data Lake's capability called Data as a Service (DaaS) could be a solution.

Data Lake architecture

The previous sections made an effort to introduce you to the high-level concepts of the whys and whats of a Data Lake. We have now come to the last section of this chapter where you will be exposed to the internals of a Data Lake. We will take a deep dive into the architecture of Data Lake and understand the key components.

Architectural considerations

In our experience, it is practically difficult to come up with a one-size-fit-all architecture for a Data Lake. In every assignment that we have worked on earlier, we had to deal with specific tailored requirements that made us adapt the architecture to the use case.

The reason why there are multiple interpretations of the Data Lake architecture is that it totally depends on the following factors that are specific to an organization and also the business questions that the Data Lake ought to solve. To realize any of the combinations of these factors in the Data Lake, we tweaked the architecture. Here is a quick list:

Type of data ingest (real-time ingest, micro-batch ingest, and batch ingest)
Storage tier (raw versus structured)
Depth of metadata collection
Breadth of data governance
Ability to search for data
Structured data storage (SQL versus a variety of NoSQL databases such as graph, document, and key-value stores)
Provisioning of data access (external versus internal)
Speed to insights (optimized for real-time versus batch)

As you can decipher from the preceding points, there are many competing and contradictory requirements that go into building Data Lake capability. Architecting a full-blooded, production-ready Data Lake in reality takes these combinations of requirements into consideration and puts the best foot forward.

For the purpose of this book, we prefer taking a median approach for architecting a Data Lake. We believe that this approach would appeal to most of the readers who want to grasp the overarching potential of a fully blown Data Lake in the way it should be in its end state, rather than being tied down to narrow the interpretations of specific business requirements and overfit the architecture.

Architectural composition

For the ease of understanding, we might consider abstracting much detail and think of the Data Lake as composed of three layers and tiers.

Layers are the common functionality that cut across all the tiers. These layers are listed as follows:

Data Governance and Security Layer
Metadata Layer
Information Lifecycle Management Layer

Tiers are abstractions for a similar functionality grouped together for the ease of understanding. Data flows sequentially through each tier. While the data moves from tier to tier, the layers do their bit of processing on the moving data. The following are the three tiers:

Intake Tier
Management Tier
Consumption Tier

The following figure simplifies the representation of each tier in relation to the layers:

Data Lake end state architecture

Architectural details

In this section, let us go deeper and understand each of the layers and tiers of the Data Lake.

Understanding Data Lake layers

In this section, we will gain a high-level understanding of the relevance of the three horizontal layers

The Data Governance and Security Layer

The Data Governance and Security layer fixes the responsibility for governing the right data access and the rights for defining and modifying data. This layer makes sure that there is a well-documented process for the change and access control of all the data artifacts. The governance mechanism oversees methods for creation, usage, and tracking of the data lineage across various tiers of the Data Lake so that it can be combined with the security rules.

As the Data Lake stores a lot of data from various sources, the Security layer ensures that the appropriate access control and authentication provides the access to data assets on a need-to-know basis. In a practical scenario; if the data consists of both transaction and historical data, along with customer, product, and finance data, which is internally sourced, as well as from third-party sources, the security layer ensures that each subject area of the data has the applicable level of security.

This layer ensures appropriate provisioning of data with relevant security measures put in place. Hadoop's security is taken care by the inbuilt integration with Kerberos, and it is possible to ensure that the users are authenticated before they access the data or compute resources.

The following figure shows the capabilities of the Data Governance and Security layer:

The Data Governance and Security layer

The Information Lifecycle Management layer

As the Data Lake advocates a Store-All approach to huge volumes of Big Data, it is exciting to store everything in it. The Information Lifecycle Management (ILM) layer ensures that there are rules governing what we can or cannot store in the Data Lake. This is because over longer periods of time, the value of data tends to decrease and the risks associated with storage increases. It does not make practical sense to fill the lake continuously, without some plan to down tier the data that has lost its use-by date; this is exactly what the ILM layer strives to achieve.

This layer primarily defines the strategy and policies for classifying which data is valuable and how long we should store a particular dataset in the Data Lake. These policies are implemented by tools that automatically purge, archive, or down tier data based on the classified policy.

The following figure depicts the high-level functionalities of the Information Lifecycle Management layer:

Information Lifecycle Management layer

The Metadata Layer

The Data Lake stores large quantities of structured and unstructured data and there should be a mechanism to find out the linkages between what is stored and what can be used by whom. The Metadata Layer is the heart of the Data Lake. The following list elucidates the essence of this layer:

The Metadata layer captures vital information about the data as it enters the Data Lake and indexes this information so that users can search metadata before they access the data itself. Metadata capture is fundamental to make data more accessible and to extract value from the Data Lake.
This layer provides vital information to the users of the Data Lake about the background and significance of the data stored in the Data Lake. For instance, data consumers could also use the metadata and find out whether a million tweets are more valued than a thousand customer records. This is accomplished by intelligently tagging every bit of data as it is ingested.
A well-built metadata layer will allow organizations to harness the potential of the Data Lake and deliver the following mechanisms to the end users to access data and perform analytics:
- Self-Service BI (SSBI)
- Data as a Service (DaaS)
- Machine Learning as a Service (MLaaS)
- Data Provisioning (DP)
- Analytics Sandbox Provisioning (ASP)
The Metadata layer defines the structure for files in a Raw Zone and describes the entities inside the files. Using this base-level description, the schema evolution of the file/record is tracked by a versioning scheme. This will eventually allow you to create associations among several entities and, thereby, facilitate browsing and searching.

The following figure illustrates the various capabilities of the Metadata Layer:

The Metadata Layer

Understanding Data Lake tiers

In the following section, let us take a short tour of each of the three tiers in the Data Lake.

The Data Intake tier

The Data Intake tier includes all the processing services that connect to external sources and the storage area for acquiring variant source data in consumable increments.

The Intake tier has three zones, and the data flows sequentially through these zones. The zones in the Intake tier are as follows:

Source System Zone
Transient landing zone
Raw Zone

Let us examine each zone of this tier in detail:

The Source System Zone

The processing services that are needed to connect to external systems are encapsulated in the Source System Zone. This zone primarily deals with the connectivity and acquires data from the external source systems.

In the Source System Zone, the timeliness of data acquisition from the external sources is determined by specific application requirements. In certain classes of applications, it is required to pull log/sensor data in near-real-time and flag anomalies in real-time. In other classes of applications, it is fine to live with batch data trickling at intervals as long as a day—this class uses all the historical data to perform analysis. The Data Intake tier, therefore, should be architected in consideration to the wide latitude in storage requirements of the aforementioned application needs.

The following figure depicts the three broad types of data that would be ingested and categorized by their timeliness:

The timeliness of Data

The Data Intake tier also contains the required processing that can "PULL" data from external sources and also consume the "PUSHED" data from external sources.

The data sources from which data can be "PULLED" by the Intake tier include the following:

Operational Data Stores ODS
Data Warehouses
Online Transaction Processing Systems (OLTP)
NoSQL systems
Mainframes
Audio
Video

Data sources that can "PUSH" data to the Intake tier include the following:

Clickstream and machine logs such as Apache common logs
Social media data from Twitter and so on
Sensor data such as temperature, body sensors (Fitbit), and so on

The Transient Zone

A Transient landing zone is a predefined, secured intermediate location where the data from various source systems will be stored before moving it into the raw zone. Generally, the transient landing zone is a file-based storage where the data is organized by source systems. Record counts and file-size checks are carried out on the data in this zone before it is moved into the raw zone.

In the absence of a Transient Zone, the data will have to go directly from the external sources to the Raw Zone, which could severely hamper the quality of data in the Raw Zone. It also offers a platform for carrying out minimal data validation checks. Let us explore the following capabilities of the Transient Zone:

A Transient Zone consolidates data from multiple sources, waits until a batch of data has "really" arrived, creates a basic aggregate by grouping together all the data from a single source, tags data with a metadata to indicate the source of origin, and generates timestamps and other relevant information.
It performs a basic validity check on the data that has just arrived and signals the retry mechanism to kick in if the integrity of data is at question. MD5 checks and record counts can be employed to facilitate this step.
It can even perform a high-level cleansing of data by removing/updating invalid data acquired from source systems (purely an optional step). It is a prime location for validating data quality from an external source for eventually auditing and tracking down data issues.
It can support data archiving. There are situations in which the freshly acquired data is deemed not-so-important and thus can be relegated to an archive directly from the Transient Zone.

The following figure depicts the high-level functionality of the Transient Zone:

Transient Zone capabilities

The Raw Zone

The Raw Zone is a place where data lands from the Transient Zone. This is typically implemented as a file-based storage (Hadoop–HDFS). It includes a "Raw Data Storage" area to retain source data for the active use and archival. This is the zone where we have to consider storage options based on the timeliness of the data.

Batch Raw Storage

Batch intake data is commonly pull-based; we can leverage the power of HDFS to store massive amounts of data at a lower cost. The primary reason for the lower cost is that the data is stored on a low-cost commodity disk. One of the key advantages of Hadoop is its inherent ability to store data without the need for it to comply with any structure at the time of ingestion; the data can be refined and structured as and when needed. This schema-on-read ability avoids the need for upfront data modeling and costly extract transform load (ETL) processing of data before it is stored into the Raw Zone. Parallel processing is leveraged to rapidly place this data into the Raw Zone. The following are the key functionalities of this zone:

This is the zone where data is deeply validated and watermarked to track and lineage lookup purposes.
Metadata about the source data is also captured at this stage. Any relevant security attributes that have a say in the access control of the data are also captured as metadata. This process will ensure that history is rapidly accessible, enabling the tracking of metadata to allow users to easily understand where the data was acquired from and what types of enrichments are applied as information moves through the Data Lake.
The Data Usage rights related to data governance are also captured and applied in this zone.

This zone enables reduced integration timeframes. In the traditional data warehouse model, information is consumed after it has been enriched, aggregated, and formatted to meet specific application needs. You can only consume the canned and aggregated data exhaust of a Data Warehouse. The Data Lake is architected differently to be modular, consisting of several distinct zones. These zones provide multiple consumption opportunities resulting in flexibility for the consumer. Applications needing minimal enrichment can access data from a zone (such as the Raw Zone) found early in the process workflow; bypassing "downstream" zones (such as the Data Hub Zone) reduces the cycle time to delivery. This is time-saving and can be significant to customers and consumers, such as data scientists with the need for fast-paced delivery and minimal enrichment.

The real-time Raw Storage

In many applications, it is mandatory to consume data and react to stimulus in real time. For these applications, the latency of writing the data to disk in a file-based system such as HDFS introduces unacceptable delay. Examples of these classes of applications as discussed earlier, include the GPS-aware mobile applications, or applications that have to respond to events from sensors. An in-memory solution called Gemfire can be used for real-time storage and to respond to events; it responds with a low latency and stores the data at rest in HDFS.

The following figure illustrates the choices we make in the Raw Zone based on the type of data:

Raw Zone capabilities

The Data Management tier

In the preceding section, we discussed the ability of the Data Lake to intake and persist raw data as a precursor to prepare that data for migration to other zones of the Lake. In this section, we will see how that data moves from Raw to the Data Management tier in preparation for consumption and more sophisticated analytics.

The Management tier has three zones: the data flows sequentially from the Raw Zone to the Integration Zone through the Enrichment Zone and then finally after all the processes are complete, the final data is stored in a ready-to-use format in the Data Hub that is a combination of relational or NOSQL databases. The zones in the Management tier are as follows:

The Integration Zone
The Enrichment Zone
The Data Hub Zone

As the data moves into the Management Zone, metadata is added and attached to each file. Metadata is a kind of watermark that tracks all changes made to each individual record. This tracking information, as well as activity logging and quality monitoring are stored in metadata that is persisted as the data moves through each of the zones. This information is extremely important to be able to report on the progress of the data through the Lake and will be used to expedite the investigation of anomalies and corrections needed to ensure quality information is delivered to the consuming applications. This metadata also helps in data discovery.

The Integration Zone

The Integration Zone's main functionality is to integrate various data and apply common transformations on the raw data into a standardized, cleansed structure that is optimized for data consumers. This zone eventually paves the way for storing the data into the Data Hub Zone. The key functionalities of the Integration Zone are as follows:

Processes for automated data validation
Processes for data quality checks
Processes for integrity checks
Associated operational management's audit logging and reporting

Here is a visual representation of the key functionalities of the Integration Zone:

Integration Zone capabilities

The Enrichment Zone

The Enrichment Zone provides processes for data enhancement, augmentation, classification, and standardization. It includes processes for automated business rules' processing and processes to derive or append new attributes to the existing records from internal and external sources.

Integration and enrichments are performed on a file-based HDFS rather than a traditional relational data structure, because a file-based computing is advantageous—as the usage patterns of the data have not been determined yet, we have extreme flexibility within a file system. HDFS natively implements a schemaless storage system. The absence of a schema and indexes means you do not need to preprocess the data before you can use it. This means it loads faster and the structure is extensible, allowing it to flex as business needs change.

The following figure depicts the key functionalities of the Enrichment Zone:

The Enrichment Zone's capabilities

The Data Hub Zone

The Data Hub Zone is the final storage location for cleaned and processed data. After the data is transformed and enriched in the downstream zones, it is finally pushed into the Data Hub for consumption.

The Data Hub is governed by a discovery process that is internally implemented as search, locate, and retrieve functionality through tools such as Elasticsearch or Solr/Lucene. A discovery is made possible by the extensive metadata that has been collected in the previous zones.

The data hub stores relational data in common relational databases such as Oracle and MS SQL server. It stores non-relational data in related technologies (for example, Hbase, Cassandra, MongoDB, Neo4J, and so on.)

The following figure depicts the capabilities of the Data Hub Zone:

Data Hub Zone capabilities

The Data Consumption tier

In the preceding section, we discussed the capability of the zones in the Data Lake to move data from Raw to the Data Integration Zone. In this section, we will discuss the ways in which data is packaged and provisioned for consumption for more sophisticated analytics.

The Consumption tier is where the data is accessed either in raw format from the Raw Zone or in the structured format from the Data Hub. The data is provisioned through this tier for external access for analytics, visualization, or other application access through web services. The data is discovered by the data catalog published in the consumption zone and this actual data access is governed by security controls to limit unwarranted access.

The Data Discovery Zone

The Data Discovery Zone is the primary gateway for external users into the Data Lake. The key to implement a functional consumption tier is the amount and quality of Metadata that we would have collected in the preceding zones and the intelligent way in which we will expose this metadata for search and data retrieval. Too much governance on the metadata might miss the relevant search results and too little governance could jeopardize the security and integrity of the data.

Data discovery also uses data event logs that is a part of the Metadata, in order to query the data. All services that act on data in all the zones are logged along with their statuses, so that the consumers of data can understand the complete lineage of how data was impacted overtime. The Data Event Logging combined with metadata will enable extensive data discovery and allow users to explore and analyze data. In summary, this zone provides a facility to data consumers to browse, search, and discover the data.

Data discovery provides an interface to search data using the metadata or the data content. This interface provides flexible, self-driven data discovery capabilities that enable the users to efficiently find and analyze relevant information.

The Data Provisioning Zone

Data Provisioning allows data consumers to source/consume the data that is available in the Data Lake. This tier is designed to allow you to use the metadata that specify the "publications" that need to be created, the "subscription" specific customization requirements, and the end delivery of the requested data to the "data consumer." The Data Provisioning is done on the entire data that is residing in the Data Lake. The data that is provisioned can be either in the Raw Zone or in the Data Hub Zone.

The following figure depicts the important features of the Consumption tier: