Packt+ | Advance your knowledge in tech

You're reading from Data Lake for Enterprises

Product type Book

Published in May 2017

Publisher Packt

ISBN-13 9781787281349

Pages 596 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (3):

Vivek Mishra

Tomcy John

Pankaj Misra

View More author details

Table of Contents (23) Chapters

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Introduction to Data

Comprehensive Concepts of a Data Lake

Lambda Architecture as a Pattern for Data Lake

Applied Lambda for Data Lake

Data Acquisition of Batch Data using Apache Sqoop

Data Acquisition of Stream Data using Apache Flume

Messaging Layer using Apache Kafka

Data Processing using Apache Flink

Data Store Using Apache Hadoop

Indexed Data Store using Elasticsearch

Data Lake Components Working Together

Data Lake Use Case Suggestions

Chapter 11. Data Lake Components Working Together

Pat on your back for reaching this far! Fabulous!

By this time, if you have followed chapter by chapter and also done your coding side by side, you would have unknowingly implemented almost the complete Data Lake.

Here in this chapter, we are tying some of the loose ends in the Data Lake implemented so far and also making some recommendations and considerations that you can think of while implementing the Data Lake for your organization.

We will start of this chapter with the SCV use case, see where we have reached, and then try closing the gaps. We will then give some aspects of the Data Lake implementation that we haven't covered in detail when we were going through the previous chapters.

We will also give some advice that you could take when going through the Data Lake implementation.

The approach of this book has been that while going through previous part, you would have almost done with the implementation of Data Lake but not really gotten...

Where we stand with Data Lake

This figure shows where we have reached with our Data Lake after covering part 2 of this book:

Figure 01: Data Lake implemented so far in this book

HDFS	Distributed File Storage
MapReduce	Batch Processing Engine
YARN	Resource Negotiator
HBase	Columnar and Key Value NoSQL database that runs on HDFS
Hive	Query engine that provides SQL like access to HDFS
Impala	Fast Query Engine for analytical queries on HDFS
Sqoop	Data Acquisition and Ingestion
Flume	Data Acquisition and Ingestion via streamed flume events
Kafka	Highly Scalable Distributed Messaging Engine
Flink	All purpose Real Time data processing and ingestion with Batch Support
Spark	All purpose Fast Batch Processing and ingestion with support for real time processing via micro-batches
Elasticsearch	Fast Distributed Indexing Engine built on Lucene, also used as a Document based NoSQL data store.

By this time, in your Data Lake data would have flown from various source systems, through various Data Lake components and persisted. You...

Core architecture principles of Data Lake

We did cover some of the core principles that we have followed when we were actually implementing the Data Lake. But, explicitly we haven't mentioned these because bringing these points upfront can be a daunting and might not enlighten your brain as you are just stepping into a Data Lake implementation. Since you now have a base Data Lake working, it's good time to bring these core principles together and we feel these has to be always remembered when bringing in new capabilities and technologies into your Data Lake ecosystem. This again in no way authoritative, rather, it's just some guiding principles that we thought quite useful.

Accept any data in raw format (immutable data) into the Data Lake. All data in an enterprise has value attached to it. Don't try getting the value in the first go, rather just ingest and try deriving its value going forward.
During time of data ingestion don't look for value out of the data getting ingested.
Be ready to...

Challenges faced by enterprise Data Lake

It's good to be aware of challenges that you could face while building and managing an enterprise Data Lake. Here we are only discussing the various technical challenges, adoption and business buy-in for a Data Lake and support for this initiative from higher management and so on is not discussed here. Some of those challenges along with our suggested mitigation are as given:

Challenge #1: If you are using open source freely available technologies for building your Data Lake, keeping up with the pace with that these technologies grow can be quite challenging and daunting task. Mitigation #1: Going with commercial products like Cloudera, Hortonworks and so on can be an option if the Data Lake is adopted in a positive manner by the business.

Challenge #2: If you Data Lake incorporate good amount of technologies to achieve the desired results, keeping with the pace of technology and it's dependencies with other technologies in the Data Lake landscape can...

Expectations from Data Lake

Data lake does cost money to build and manage. So, the expectation from various parties from Data Lake is quite demanding and varied in nature. Let's divide these expectation into two based on parties involved.

Expectation from business users:

Analysis is always running on right data with good quality attributes.
Capability to easily manage data governance.
Setup security measures whereby the data visibility can be controlled in more fine grained fashion. Easy data masking capability, when needed by employing appropriate transformations controlled by authorizations mechanisms.
Self service capability with minimal technical knowledge for a broad spectrum of people.
More easy representation of data lineage and traceability
Should be able to support metadata management

Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility...

Data Lake for other activities

With Data Lake and its huge and expensive infrastructure (in production deployment, ideally we use high-end machines and not commodity hardware), there are potential other uses for which it could be used. The main challenge with high end infrastructure is its effective utilization. While a high end infrastructure may be required for solving a problem, it may not be effectively utilized at all times. This is where we need to think of mechanisms that can help us extract required utilization of the infrastructure.

One of the most practical ways to do this is via multi-tenancy of the Hadoop infrastructure. If we look at Hadoop or any storage systems, there are two fundamental actions performed at the storage layer; one is to read and the other is to write the data for the purpose of data storage and processing.

This can be achieved at a basic level by leveraging security mechanisms supported by various components in the entire infrastructure such that security realms...

Knowing more about data storage

Storage is one of the most critical parts of a Data Lake. Apache Hadoop (HDFS) is the core of data storage for our Data Lake. The following figure sums up this aspect quite clearly, showing batch and stream data storage components in our Data Lake, along with other technologies within Hadoop Ecosystem with regards to various aspects dealing with the storage:

Figure 02: Apache Hadoop (HDFS) as data storage

We will now understand some important concepts in data storage area. We will also concentrate explicitly on batch and speed data and how these gets stored in the Data Lake and also see some specific details in regards to these data types.

Zones in Data Storage

Even though the data in the storage need not follow a certain pattern, but for an organization while going with a Data Lake, it's good to have some clear directions and principles on how the data need to be put in the Data Lake. These are just some of our recommendations that could be considered or kept...

Knowing more about Data processing

Data processing is one of the important capabilities in a Data Lake implementation. Our Data Lake is no exception and does participate in data processing, both in batch and speed layer. In this section we will cover some important topics that needs to be looked upon with respect to Data Lake dealing with data processing. With Hadoop 1.x, MapReduce was one of the main processing done in Hadoop. With Hadoop 2.x and with more data ingestion methodologies, more options in the real time/streaming area have also come in and these two aspects with some important considerations are detailed here.

Data validation and cleansing

Validating data before it gets into the persistence layer of Data Lake is a very important step. Validation in the context of Data Lake means two aspects as follows:

Origin of data: Making sure right data from right source is ingested into the Data Lake. The source from where data originates should be known and also the data coming in also should...

Thoughts on data security

One of the very important capabilities required for a Data Lake implementation in an enterprise is security. In a Data Lake we are bringing in data from around the enterprise into one place. You have convinced all the departments who has agreed to ingest data into the Data Lake that the data in the lake is secured and only authenticated and authorized users have access to the data. So, this aspect needs some serious thought so that data is secured and these departments are quite happy with the access rules for their all important data. In addition to security setup, proper governance through adequate processes also should be setup to make security quite sturdy but quite easy for users having access to it to do their deep analysis work.

By data security, it refers to in-flight transaction data (stream), date at rest (batch), both authentication and authorization (attributes).

Data lake does pose a different risk as it is entrusted to bring data from various silos into...

Thoughts on data encryption

Data in a Data Lake is highly critical for the organization and it has to be secured at all times. In addition, to meet various regulatory and security policies standards within an organization, encryption of data is a must along with authentication and authorization. Encryption should be done to:

Data at rest and
Data in transit

The following figure shows both the data in rest and in transit and how encryption enables securing the data:

Figure 15: Data Encryption

Before we enable authentication and authorization, it's important to secure the channel through that the credentials would pass through. For this the channel should be secured paving way for data in transit to be transferred in an encrypted fashion. Various technologies in the Hadoop ecosystem communicated with one another using a variety of protocols such as RPC, TCP/IP, HTTP(S) and so on According to the protocol, the channel securing methodologies differ and would have to be dealt with accordingly.

Hadoop...

Metadata management and governance

These are two areas, metadata management and governance, in that many technologies in big data space needs to innovate and evolve a lot. Some technologies does provide some limited functionality in these areas but isn't sufficient enough to be called as a solution suited for enterprises. However, recently there are some serious work being undertaken by various players in this area to address these two areas. We will discuss a bit of these in this section. Before going further, let's first understand these two terminologies in detail along with some other making more sense in this area.

Metadata

Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information.

- National Information Standards Organization (http://www.niso.org/publications/press/UnderstandingMetadata.pdf)

As detailed in Wikipedia (https...

Thoughts on Data Auditing

In perspective of Data Lake, auditing is quite an important feature needed. The data comes from various sources, various departments, various asset classification (secret, public and so on) and just because of these variations, some data requires special security requirements and handling. Certain data in the lake need tracking of changes that it undergo as well as who accesses that for various legal and contractual aspects.

In the source system, data is kept for time it is really necessary to carry day to day activity (production period). After that, the data is usually categorized as non-production in nature and archived or taken offline. For a Data Lake, there isn't really a concept of archived data and because of this the data needs access control and auditing (changes that it undergoes like various transformation and so on) at all times. Not all data in the lake might require this, but some data does require it and have to be dealt with.

Doing this is a big ask...

Thoughts on data traceability

Traceability is the ability to verify the history, location, or application of an item by means of documented recorded identification.

- Wikipedia

Data traceability means the path followed by data in moving from one location (origin) to another (destination), various processes and transformation it undergoes while doing so before reaching its intended destination. We have already seen what data lineage is, so what is the difference between lineage and traceability?

Data lineage is often associated with metadata management and governance and has a difference to what data traceability means.

Data lineage is more technical in nature and shows each and every important step the data undergoes when going from origin to destination. This is a very important capability/resource for a technical team but doesn't give much sense to a non-technical business or other users in the enterprise.

Data traceability brings a non-technical layer on top of this to bring enough details...

Knowing more about Serving Layer

The layer in our Data Lake that interacts with the outside world is the serving layer. The layer where data in the lake is served to varied number for people according to the requirement. We will discuss in brief some of the important aspects that needs to be considered in regards to this layer. This layer does employ a number of technologies to help serve data to the end users. Most of the technologie fall in the category of persistent store apt for the data it serves. It can have relational databases, NoSQL databases, document stores, Key-Value stores, Column databases and so on.

Principles of Serving Layer

We have delved a bit deep into the serving layer in part 1 of this book. This is just a recap as these principles drive choice of various technologies in this layer.

Fast access/high performance: capability of serving data at high pace to the end users
Low latency reads and updates: Reading and updating data with lowest latency possible enabling faster results...

Summary

In this chapter, we brought together all the technologies and capabilities that we have discussed throughout Part 2 of this book. We tried to explain some important aspects with the whole Data Lake in mind. We introduced you to certain more capabilities like metadata management, governance, auditing, traceability and so on, which are very important one for a typical implementation within an enterprise. We managed to give our technology opinions for each of these capabilities but kept delving deep into it away. We were not able to get deep into some of the technologies discussed in this chapter intentionally to keep the book concise and to the point on main technologies/capabilities in a Data Lake.

After reading this chapter, you would now have a full picture of an operational Data Lake. You would also have brief idea of some other capabilities needed for an enterprise Data Lake, which are usually omitted when a Data Lake is first implemented in an enterprise.

These additional capabilities...