Packt+ | Advance your knowledge in tech

You're reading from Data Lake for Enterprises

Product type Book

Published in May 2017

Publisher Packt

ISBN-13 9781787281349

Pages 596 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (3):

Vivek Mishra

Tomcy John

Pankaj Misra

View More author details

Table of Contents (23) Chapters

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Introduction to Data

Comprehensive Concepts of a Data Lake

Lambda Architecture as a Pattern for Data Lake

Applied Lambda for Data Lake

Data Acquisition of Batch Data using Apache Sqoop

Data Acquisition of Stream Data using Apache Flume

Messaging Layer using Apache Kafka

Data Processing using Apache Flink

Data Store Using Apache Hadoop

Indexed Data Store using Elasticsearch

Data Lake Components Working Together

Data Lake Use Case Suggestions

Chapter 3. Lambda Architecture as a Pattern for Data Lake

In the previous chapter, while going through the concepts of Data Lakes, you were introduced a bit to Lambda Architecture. In this chapter, we will go into a bit of detail on Lambda Architecture and also try to explain the significance of this important architecture pattern in this book's Data Lake implementation.

This chapter, though it tries to cover this architecture paradigm in detail, doesn't give any technology implementation. This is intentional to make sure that you understand the concepts of these patterns first; once that is achieved, the following chapters will detail this pattern with technology backing.

After going through this chapter, you will learn the Lambda Architecture pattern in detail. Once you learn this pattern, you will also see how it forms an integral part of our Data Lake construction.

What is Lambda Architecture?

Lambda Architecture is not technology dependent; rather it is agnostic of technology and defines some practical and well-versed principles to handle and cater to big data. It is a very generic pattern that tries to cater to common requirements raised by most big data applications. The pattern allows us to deal with both historical data and real-time data alongside each other. We used to have two different applications catering to transactional –– OnLine Transaction Processing (OLTP) and analytical –– OnLine Analytical Processing (OLAP) data, but we couldn't mix these together; rather they live separately and don't talk to each other.

These bullet points describe what a Lambda Architecture is:

Set of patterns and guidelines. This defines a set of patterns and guidelines for the big data kind of applications. More importantly, it allows the queries to consider both historical and newly generated data alike and gives the desired view for the analysts.
Deals with both...

History of Lambda Architecture

Nathan Marz coined the term Lambda Architecture (LA) to describe a generic pattern for data processing that is scalable and fault-tolerant. He gathered this expertise working extensively with big-data-related technologies at BackType and Twitter. The pattern is conceptualized to handle/process a huge amount of data by using two of its important components, namely batch and speed layer. Nathan generalized his findings and experience in the form of this pattern, which should cater to some of the important architecture principles, such as these:

Linear scalable: It should scale out and not up and should cater to different kinds of use cases
Fault-tolerant: Capable of a wide range of workloads, it should also shield the system from hardware and software failure and inherent human mistakes
Backtype: Reads and updates
Extensibility: Manageable, easy to extend, and easy to keep adding new features and data elements

There is a wealth of details documented at http://lambda...

Principles of Lambda Architecture

Nathan Marz, in his Big Data book, has given full-fledged details on the Lambda Architecture pattern. The following are the three main principles on which his pattern has been developed. Some of these have been briefly covered in the previous section.

Fault-tolerant
- Hardware
- Software
- Human
Immutable Data
Re-computation

Lets detail each one of these principles in the next sections.

Fault-tolerant principle

Hardware, software, and human fault tolerance should be part of this pattern. The pattern is for catering to big data, and because of this, any of these faults can be a big problem to recover from. So, data loss and data corruption don’t have any place in this pattern because of the data vastness. If it does have this, in most cases, it is irrecoverable; so this principle is quite a strong need.

One of the important parts of this is human fault tolerance. Some of the very common mistakes are typical operational mistakes made in day-to-day operations; the next most...

Components of a Lambda Architecture

We have been talking about the various components of Lambda Architecture in multiple sections of this book already, and I am sure you will have some idea after going through those sections. This section and the following section detail each and every component of the Lambda Architecture. But this will avoid any dependency on technologies because we need to go under this layer, and once we are through, we can use any technology available in the market and create this pattern without much trouble. Understanding each layer and its significance along with the lead function that it has to take care of is very much required, as this is the basis that you would get when going through future chapters. In the context of Data Lake, the components of Lambda Architecture just form one of the layers, which is termed as the Lambda Layer. We will now go through various layers in this Lambda Layer in detail. The main layers constituting the Lambda Layer are as follows...

Complete working of a Lambda Architecture

The following figure pictorially shows the complete working of Lambda Architecture:

Figure 10: Complete working of Lambda Architecture

As briefly explained earlier, the master data set is maintained and managed in the batch layer. When new data arrives, it is despatched to both, batch and speed layer. Once it reaches the batch layer, at regular batch interval batch views are generated and recomputed from scratch each time. Similarly, the speed layer using the new data generates the speed view whenever the new data arrives in the layer. The serving layer when queried, merges both the speed and batch layer views to generate the appropriate query results.

Once the batch view is generated, the speed view is discarded and till the time new data arrives only bath view needs to be queried as all the data is available in the batch layer itself.

Advantages of Lambda Architecture

There are various advantages because of which we chose Lambda Architecture for construction Data Lake for the enterprise. Some of these advantages can be given as:

Data stored is in raw format. Because of this, at any time, new algorithms, analytics, or new business use cases can be applied to the Data Lake by simply creating new batch and speed views. This is one of the biggest advantages of traditional data warehouses in which data is cleansed and stored. Because of this, new use cases would need to change the data schema, and this is usually time and effort consuming.
One of its very own important principles, namely recomputation, helps correct fault tolerance without much trouble. As more and more data comes into the lake, data loss and corruption can be something that cannot be afforded. Because of this recomputation, at any moment we can recompute, roll back, or flush data to correct these errors.
Lambda Architecture separates different responsibilities...

Disadvantages of Lambda Architectures

Choosing a Lambda Architecture to develop a Data Lake for your enterprise does incur some inherent disadvantages if some of its aspects are not fully thought through. Some of these are as follows:

Due to its different layers, it is generally considered to be complex. Keeping sync between these two layers incurs cost and effort, and this has to be thought through and handled.
Because of these two distinct and fully distributed layers (batch and speed), maintenance and support activities are quite hard.
There are a good number of technologies that have to be mastered to construct a Lambda-Architecture-driven Data Lake. Getting people who have expertise in these technologies can be troublesome for your recruitment division.
Implementing a Lambda Architecture with open sources technologies and then deploying in the cloud can be troublesome. To avoid this, you could very well use cloud technologies to implement Lambda Architecture, but by doing so, the enterprise...

Technology overview for Lambda Architecture

As explained briefly earlier, Lambda Architecture is a pattern with well defined guidelines and is technology agnostic. Looking at its various components/layer, any technology can be brought in to do the required job.

With, emergence of various cloud providers, you could even get ready-made components in cloud (many are cloud dependent) which actually implements the Lambda Architecture. In this book, we are marching ahead to actually create a Data Lake in which the lambda pattern just covers one layer, called Lambda Layer.

Since there are so many choices for technologies, the future chapters are a bit opinionated. When we make each technology, we would give the rationale for our choice, but keeping it as open as possible. We would also give our other technology choices, so that if needed, these technologies can indeed be swapped by the reader if required. Having said that, we will actually implement the Data Lake using the selected technology. The...

Applied lambda

Enterprise-level Data Lake is one of the applications of the Lambda Architecture pattern. In this book, we are going to cover this in more detail. However, there are other use cases where this pattern can be applied and this section tries to cover these.

Enterprise-level log analysis

One of the very common use cases for this pattern is log ingestion and various analytics that surround it. The ELK (Elasticsearch, Logstash, Kibana) stack is a leading one in this space, but this pattern could very well be used. The logs can vary from conventional application logs to different types of logs produced by various software and hardware components. If we need to have an enterprise level log management and analytical capability this pattern is indeed a good choice. These logs are produced in large quantities and at very high velocity. Also these are immutable in nature and does need to have an order in place for analyst (may be a developer of an application or a security data scientists...

Working examples of Lambda Architecture

Here are some of the working examples where Lambda Architecture has been used as a way by which certain use cases have been handled:

Multiple use cases on Twitter: One of the use cases where modified lambda is used in the area of sentiment analysis of tweets.
Multiple use cases in Groupon.
Answers by Crashlytics: Deals with mobile analytics, use Lambda Architecture layers of batch and speed effectively to produce meaningful analytics.
Stack Overflow: A well-known question-answer forum with a huge user community and plenty of activity. For a logged-in user, recommended questions make a new section, where the Lambda Architecture is used. There are other analytics too, such as voting, which uses batch views.
Flickr Magic View: Revised Lambda Architecture to create a magic view by combining bulk and real-time compute (courtesy: code.flickr.com).

Kappa architecture

This book is about building Data Lakes using Lambda Architecture as one of the main layers (Lambda Layer). However, we feel that the readers also need to learn about another minimalist Lambda Architecture under active discussion, namely Kappa architecture. It is more or less similar to lambda, but for the sake of simplicity, the batch layer is removed and only the speed layer is kept. The main idea is to avoid having to compute a batch layer from scratch all the time and try doing almost all of these in real-time or the speed layer. One of the disadvantages of the Lambda Architecture, as detailed above, is to have to keep coding and executing the same logic twice, and this is avoided in the Kappa Architecture.

An image speaks more than a thousand words, and the next diagram compares both the Kappa and Lambda Architectures side by side. In this, you can clearly see that in Kappa, the only missing part is the all-important batch layer:

Figure 11: Kappa (left) and Lambda (right...

Summary

In this chapter, you learned about the Lambda Architecture in detail. In our Data Lake implementation, the Lambda Layer (an implementation of the Lambda Architecture) forms an integral part.

We have taken care to introduce only the theoretical aspects of Lambda Architecture in this chapter, and stayed away from all technologies; we want you to understand that this is a pattern and any technology can be used to implement various parts of this architecture without much trouble. In the next chapter, however, we will start introducing technologies and will introduce places where it can be used.

The first few sections of this chapter introduced you to this pattern and later sections detailed a bit more of each of the components/layers forming a Lambda Architecture. We stated the advantages and disadvantages associated with this pattern and gave a full picture of this pattern in detail before wrapping up.

We hope you now have enough of a background on Data, Data Lake, and Lambda Architecture...

The rest of the chapter is locked

You have been reading a chapter from

Data Lake for Enterprises

Published in: May 2017 Publisher: Packt ISBN-13: 9781787281349

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (3)

Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com

See other products by Vivek Mishra

Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.

See other products by Tomcy John

Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.

See other products by Pankaj Misra