Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Lake for Enterprises

You're reading from  Data Lake for Enterprises

Product type Book
Published in May 2017
Publisher Packt
ISBN-13 9781787281349
Pages 596 pages
Edition 1st Edition
Languages
Authors (3):
Vivek Mishra Vivek Mishra
Profile icon Vivek Mishra
Tomcy John Tomcy John
Profile icon Tomcy John
Pankaj Misra Pankaj Misra
Profile icon Pankaj Misra
View More author details

Table of Contents (23) Chapters

Title Page
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together
Introduction to Data Comprehensive Concepts of a Data Lake Lambda Architecture as a Pattern for Data Lake Applied Lambda for Data Lake Data Acquisition of Batch Data using Apache Sqoop Data Acquisition of Stream Data using Apache Flume Messaging Layer using Apache Kafka Data Processing using Apache Flink Data Store Using Apache Hadoop Indexed Data Store using Elasticsearch Data Lake Components Working Together Data Lake Use Case Suggestions

Chapter 3. Lambda Architecture as a Pattern for Data Lake

In the previous chapter, while going through the concepts of Data Lakes, you were introduced a bit to Lambda Architecture. In this chapter, we will go into a bit of detail on Lambda Architecture and also try to explain the significance of this important architecture pattern in this book's Data Lake implementation.

This chapter, though it tries to cover this architecture paradigm in detail, doesn't give any technology implementation. This is intentional to make sure that you understand the concepts of these patterns first; once that is achieved, the following chapters will detail this pattern with technology backing.

After going through this chapter, you will learn the Lambda Architecture pattern in detail. Once you learn this pattern, you will also see how it forms an integral part of our Data Lake construction.

What is Lambda Architecture?


Lambda Architecture is not technology dependent; rather it is agnostic of technology and defines some practical and well-versed principles to handle and cater to big data. It is a very generic pattern that tries to cater to common requirements raised by most big data applications. The pattern allows us to deal with both historical data and real-time data alongside each other. We used to have two different applications catering to transactional –– OnLine Transaction Processing (OLTP) and analytical –– OnLine Analytical Processing (OLAP) data, but we couldn't mix these together; rather they live separately and don't talk to each other.

These bullet points describe what a Lambda Architecture is:

  • Set of patterns and guidelines. This defines a set of patterns and guidelines for the big data kind of applications. More importantly, it allows the queries to consider both historical and newly generated data alike and gives the desired view for the analysts.
  • Deals with both...

History of Lambda Architecture


Nathan Marz coined the term Lambda Architecture (LA) to describe a generic pattern for data processing that is scalable and fault-tolerant. He gathered this expertise working extensively with big-data-related technologies at BackType and Twitter. The pattern is conceptualized to handle/process a huge amount of data by using two of its important components, namely batch and speed layer. Nathan generalized his findings and experience in the form of this pattern, which should cater to some of the important architecture principles, such as these:

  • Linear scalable: It should scale out and not up and should cater to different kinds of use cases
  • Fault-tolerant: Capable of a wide range of workloads, it should also shield the system from hardware and software failure and inherent human mistakes
  • Backtype: Reads and updates
  • Extensibility: Manageable, easy to extend, and easy to keep adding new features and data elements

There is a wealth of details documented at http://lambda...

Principles of Lambda Architecture


Nathan Marz, in his Big Data book, has given full-fledged details on the Lambda Architecture pattern. The following are the three main principles on which his pattern has been developed. Some of these have been briefly covered in the previous section.

  • Fault-tolerant
    • Hardware
    • Software
    • Human
  • Immutable Data
  • Re-computation

Lets detail each one of these principles in the next sections.

Fault-tolerant principle

Hardware, software, and human fault tolerance should be part of this pattern. The pattern is for catering to big data, and because of this, any of these faults can be a big problem to recover from. So, data loss and data corruption don’t have any  place in this pattern because of the data vastness. If it does have this, in most cases, it is irrecoverable; so this principle is quite a strong need.

One of the important parts of this is human fault tolerance. Some of the very common mistakes are typical operational mistakes made in day-to-day operations; the next most...

Components of a Lambda Architecture


We have been talking about the various components of Lambda Architecture in multiple sections of this book already, and I am sure you will have some idea after going through those sections. This section and the following section detail each and every component of the Lambda Architecture. But this will avoid any dependency on technologies because we need to go under this layer, and once we are through, we can use any technology available in the market and create this pattern without much trouble. Understanding each layer and its significance along with the lead function that it has to take care of is very much required, as this is the basis that you would get when going through future chapters. In the context of Data Lake, the components of Lambda Architecture just form one of the layers, which is termed as the Lambda Layer. We will now go through various layers in this Lambda Layer in detail. The main layers constituting the Lambda Layer are as follows...

Complete working of a Lambda Architecture


The following figure pictorially shows the complete working of Lambda Architecture:

Figure 10: Complete working of Lambda Architecture

As briefly explained earlier, the master data set is maintained and managed in the batch layer. When new data arrives, it is despatched to both, batch and speed layer. Once it reaches the batch layer, at regular batch interval batch views are generated and recomputed from scratch each time. Similarly, the speed layer using the new data generates the speed view whenever the new data arrives in the layer. The serving layer when queried, merges both the speed and batch layer views to generate the appropriate query results.

Once the batch view is generated, the speed view is discarded and till the time new data arrives only bath view needs to be queried as all the data is available in the batch layer itself.

Advantages of Lambda Architecture


There are various advantages because of which we chose Lambda Architecture for construction Data Lake for the enterprise. Some of these advantages can be given as:

  • Data stored is in raw format. Because of this, at any time, new algorithms, analytics, or new business use cases can be applied to the Data Lake by simply creating new batch and speed views. This is one of the biggest advantages of traditional data warehouses in which data is cleansed and stored. Because of this, new use cases would need to change the data schema, and this is usually time and effort consuming.
  • One of its very own important principles, namely recomputation, helps correct fault tolerance without much trouble. As more and more data comes into the lake, data loss and corruption can be something that cannot be afforded. Because of this recomputation, at any moment we can recompute, roll back, or flush data to correct these errors.
  • Lambda Architecture separates different responsibilities...

Disadvantages of Lambda Architectures


Choosing a Lambda Architecture to develop a Data Lake for your enterprise does incur some inherent disadvantages if some of its aspects are not fully thought through. Some of these are as follows:

  • Due to its different layers, it is generally considered to be complex. Keeping sync between these two layers incurs cost and effort, and this has to be thought through and handled.
  • Because of these two distinct and fully distributed layers (batch and speed), maintenance and support activities are quite hard.
  • There are a good number of technologies that have to be mastered to construct a Lambda-Architecture-driven Data Lake. Getting people who have expertise in these technologies can be troublesome for your recruitment division.
  • Implementing a Lambda Architecture with open sources technologies and then deploying in the cloud can be troublesome. To avoid this, you could very well use cloud technologies to implement Lambda Architecture, but by doing so, the enterprise...

Technology overview for Lambda Architecture


As explained briefly earlier, Lambda Architecture is a pattern with well defined guidelines and is technology agnostic. Looking at its various components/layer, any technology can be brought in to do the required job.

With, emergence of various cloud providers, you could even get ready-made components in cloud (many are cloud dependent) which actually implements the Lambda Architecture. In this book, we are marching ahead to actually create a Data Lake in which the lambda pattern just covers one layer, called Lambda Layer.

Since there are so many choices for technologies, the future chapters are a bit opinionated. When we make each technology, we would give the rationale for our choice, but keeping it as open as possible. We would also give our other technology choices, so that if needed, these technologies can indeed be swapped by the reader if required. Having said that, we will actually implement the Data Lake using the selected technology. The...

Applied lambda


Enterprise-level Data Lake is one of the applications of the Lambda Architecture pattern. In this book, we are going to cover this in more detail. However, there are other use cases where this pattern can be applied and this section tries to cover these.

Enterprise-level log analysis

One of the very common use cases for this pattern is log ingestion and various analytics that surround it. The ELK (Elasticsearch, Logstash, Kibana) stack is a leading one in this space, but this pattern could very well be used. The logs can vary from conventional application logs to different types of logs produced by various software and hardware components. If we need to have an enterprise level log management and analytical capability this pattern is indeed a good choice. These logs are produced in large quantities and at very high velocity. Also these are immutable in nature and does need to have an order in place for analyst (may be a developer of an application or a security data scientists...

Working examples of Lambda Architecture


Here are some of the working examples where Lambda Architecture has been used as a way by which certain use cases have been handled:

  • Multiple use cases on Twitter: One of the use cases where modified lambda is used in the area of sentiment analysis of tweets.
  • Multiple use cases in Groupon.
  • Answers by Crashlytics: Deals with mobile analytics, use Lambda Architecture layers of batch and speed effectively to produce meaningful analytics.
  • Stack Overflow: A well-known question-answer forum with a huge user community and plenty of activity. For a logged-in user, recommended questions make a new section, where the Lambda Architecture is used. There are other analytics too, such as voting, which uses batch views.
  • Flickr Magic View: Revised Lambda Architecture to create a magic view by combining bulk and real-time compute (courtesy: code.flickr.com).

Kappa architecture


This book is about building Data Lakes using Lambda Architecture as one of the main layers (Lambda Layer). However, we feel that the readers also need to learn about another minimalist Lambda Architecture under active discussion, namely Kappa architecture. It is more or less similar to lambda, but for the sake of simplicity, the batch layer is removed and only the speed layer is kept. The main idea is to avoid having to compute a batch layer from scratch all the time and try doing almost all of these in real-time or the speed layer. One of the disadvantages of the Lambda Architecture, as detailed above, is to have to keep coding and executing the same logic twice, and this is avoided in the Kappa Architecture.

An image speaks more than a thousand words, and the next diagram compares both the Kappa and Lambda Architectures side by side. In this, you can clearly see that in Kappa, the only missing part is the all-important batch layer:

Figure 11: Kappa (left) and Lambda (right...

Summary


In this chapter, you learned about the Lambda Architecture in detail. In our Data Lake implementation, the Lambda Layer (an implementation of the Lambda Architecture) forms an integral part.

We have taken care to introduce only the theoretical aspects of Lambda Architecture in this chapter, and stayed away from all technologies; we want you to understand that this is a pattern and any technology can be used to implement various parts of this architecture without much trouble. In the next chapter, however, we will start introducing technologies and will introduce places where it can be used.

The first few sections of this chapter introduced you to this pattern and later sections detailed a bit more of each of the components/layers forming a Lambda Architecture. We stated the advantages and disadvantages associated with this pattern and gave a full picture of this pattern in detail before wrapping up.

We hope you now have enough of a background on Data, Data Lake, and Lambda Architecture...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Lake for Enterprises
Published in: May 2017 Publisher: Packt ISBN-13: 9781787281349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}

Batch Layer

  • Stored immutable data
  • Constantly growing in size
  • Recomputes views all the time

Speed Layer

  • Constant stream of...