Data Lake for Enterprises

A practical guide to implementing your enterprise data lake using Lambda Architecture as the base
Preview in Mapt

Data Lake for Enterprises

Tomcy John, Pankaj Misra

A practical guide to implementing your enterprise data lake using Lambda Architecture as the base
Mapt Subscription
FREE
$29.99/m after trial
eBook
$12.95
RRP $18.49
Save 29%
Print + eBook
$22.99
RRP $22.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$12.95
$22.99
$29.99p/m after trial
RRP $18.49
RRP $22.99
Subscription
eBook
Print + eBook
Start 30 Day Trial

Frequently bought together


Data Lake for Enterprises Book Cover
Data Lake for Enterprises
$ 18.49
$ 12.95
Data Lake Development with Big Data Book Cover
Data Lake Development with Big Data
$ 27.99
$ 19.60
Buy 2 for $30.45
Save $16.03
Add to Cart
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 

Book Details

ISBN 139781787281349
Paperback596 pages

Book Description

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together.

This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient.

By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Table of Contents

Chapter 1: Introduction to Data
Exploring data
What is Enterprise Data?
Enterprise Data Management
Big data concepts
Relevance of data
Quality of data
Where does this data live in an enterprise?
Enterprise’s current state
Enterprise digital transformation
Data lake use case enlightenment
Summary
Chapter 2: Comprehensive Concepts of a Data Lake
What is a Data Lake?
How does a Data Lake help enterprises?
How Data Lake works?
Differences between Data Lake and Data Warehouse
Approaches to building a Data Lake
Lambda Architecture-driven Data Lake
Summary
Chapter 3: Lambda Architecture as a Pattern for Data Lake
What is Lambda Architecture?
History of Lambda Architecture
Principles of Lambda Architecture
Components of a Lambda Architecture
Complete working of a Lambda Architecture
Advantages of Lambda Architecture
Disadvantages of Lambda Architectures
Technology overview for Lambda Architecture
Applied lambda
Working examples of Lambda Architecture
Kappa architecture
Summary
Chapter 4: Applied Lambda for Data Lake
Knowing Hadoop distributions
Selection factors for a big data stack for enterprises
Batch layer for data processing
Serving layer
Summary
Chapter 5: Data Acquisition of Batch Data using Apache Sqoop
Context in data lake - data acquisition
Why Apache Sqoop
Workings of Sqoop
Sqoop connectors
Sqoop support for HDFS
Sqoop working example
When to use Sqoop
When not to use Sqoop
Real-time Sqooping: a possibility?
Other options
Summary
Chapter 6: Data Acquisition of Stream Data using Apache Flume
Context in Data Lake: data acquisition
Why Flume?
Flume architecture principles
The Flume Architecture
Flume event - Stream Data
Flume agent
Flume source
Flume Channel
Flume sink
Flume configuration
Flume transaction management
Other flume components
Context Routing
Flume working example
When to use Flume
When not to use Flume
Other options
Summary
Chapter 7: Messaging Layer using Apache Kafka
Context in Data Lake - messaging layer
Why Apache Kafka
Kafka architecture
Other Kafka components
Kafka programming interface
Producer and consumer reliability
Kafka security
Kafka as message-oriented middleware
Scale-out architecture with Kafka
Kafka connect
Kafka working example
When to use Kafka
When not to use Kafka
Other options
Summary
Chapter 8: Data Processing using Apache Flink
Context in a Data Lake - Data Ingestion Layer
Why Apache Flink?
Working of Flink
Flink API’s
Flink working example
When to use Flink
When not to use Flink
Other options
Summary
Chapter 9: Data Store Using Apache Hadoop
Context for Data Lake - Data Storage and lambda Batch layer
Why Hadoop?
Working of Hadoop
Hadoop ecosystem
Hadoop distributions
HDFS and formats
Hadoop for near real-time applications
Hadoop deployment modes
Hadoop working examples
When not to use Hadoop
Other Hadoop Processing Options
Summary
Chapter 10: Indexed Data Store using Elasticsearch
Context in Data Lake: data storage and lambda speed layer
What is Elasticsearch?
Why Elasticsearch
Working of Elasticsearch
Elastic Stack
Elastic Cloud
Elasticsearch DSL (Query DSL)
Nodes in Elasticsearch
Elasticsearch and relational database
Elasticsearch ecosystem
Elasticsearch deployment options
Clients for Elasticsearch
Elasticsearch for fast streaming layer
Elasticsearch as a data source
Elasticsearch for content indexing
Elasticsearch and Hadoop
Elasticsearch working example
Indexing Documents
Getting Indexed Document
Searching Documents
Updating Documents
Deleting a document
Elasticsearch in purview of SCV use case
Chapter 11: Data Lake Components Working Together
Where we stand with Data Lake
Core architecture principles of Data Lake
Challenges faced by enterprise Data Lake
Expectations from Data Lake
Data Lake for other activities
Knowing more about data storage
Knowing more about Data processing
Thoughts on data security
Thoughts on data encryption
Metadata management and governance
Thoughts on Data Auditing
Thoughts on data traceability
Knowing more about Serving Layer
Summary
Chapter 12: Data Lake Use Case Suggestions
Establishing cybersecurity practices in an enterprise
Know the customers dealing with your enterprise
Bring efficiency in warehouse management 
Developing a brand and marketing of the enterprise
Achieve a higher degree of personalization with customers
Bringing IoT data analysis at your fingertips
More practical and useful data archival
Compliment the existing data warehouse infrastructure
Achieving telecom security and regulatory compliance
Summary

What You Will Learn

  • Build an enterprise-level data lake using the relevant big data technologies
  • Understand the core of the Lambda architecture and how to apply it in an enterprise
  • Learn the technical details around Sqoop and its functionalities
  • Integrate Kafka with Hadoop components to acquire enterprise data
  • Use flume with streaming technologies for stream-based processing
  • Understand stream- based processing with reference to Apache Spark Streaming
  • Incorporate Hadoop components and know the advantages they provide for enterprise data lakes
  • Build fast, streaming, and high-performance applications using ElasticSearch
  • Make your data ingestion process consistent across various data formats with configurability
  • Process your data to derive intelligence using machine learning algorithms

Authors

Table of Contents

Chapter 1: Introduction to Data
Exploring data
What is Enterprise Data?
Enterprise Data Management
Big data concepts
Relevance of data
Quality of data
Where does this data live in an enterprise?
Enterprise’s current state
Enterprise digital transformation
Data lake use case enlightenment
Summary
Chapter 2: Comprehensive Concepts of a Data Lake
What is a Data Lake?
How does a Data Lake help enterprises?
How Data Lake works?
Differences between Data Lake and Data Warehouse
Approaches to building a Data Lake
Lambda Architecture-driven Data Lake
Summary
Chapter 3: Lambda Architecture as a Pattern for Data Lake
What is Lambda Architecture?
History of Lambda Architecture
Principles of Lambda Architecture
Components of a Lambda Architecture
Complete working of a Lambda Architecture
Advantages of Lambda Architecture
Disadvantages of Lambda Architectures
Technology overview for Lambda Architecture
Applied lambda
Working examples of Lambda Architecture
Kappa architecture
Summary
Chapter 4: Applied Lambda for Data Lake
Knowing Hadoop distributions
Selection factors for a big data stack for enterprises
Batch layer for data processing
Serving layer
Summary
Chapter 5: Data Acquisition of Batch Data using Apache Sqoop
Context in data lake - data acquisition
Why Apache Sqoop
Workings of Sqoop
Sqoop connectors
Sqoop support for HDFS
Sqoop working example
When to use Sqoop
When not to use Sqoop
Real-time Sqooping: a possibility?
Other options
Summary
Chapter 6: Data Acquisition of Stream Data using Apache Flume
Context in Data Lake: data acquisition
Why Flume?
Flume architecture principles
The Flume Architecture
Flume event - Stream Data
Flume agent
Flume source
Flume Channel
Flume sink
Flume configuration
Flume transaction management
Other flume components
Context Routing
Flume working example
When to use Flume
When not to use Flume
Other options
Summary
Chapter 7: Messaging Layer using Apache Kafka
Context in Data Lake - messaging layer
Why Apache Kafka
Kafka architecture
Other Kafka components
Kafka programming interface
Producer and consumer reliability
Kafka security
Kafka as message-oriented middleware
Scale-out architecture with Kafka
Kafka connect
Kafka working example
When to use Kafka
When not to use Kafka
Other options
Summary
Chapter 8: Data Processing using Apache Flink
Context in a Data Lake - Data Ingestion Layer
Why Apache Flink?
Working of Flink
Flink API’s
Flink working example
When to use Flink
When not to use Flink
Other options
Summary
Chapter 9: Data Store Using Apache Hadoop
Context for Data Lake - Data Storage and lambda Batch layer
Why Hadoop?
Working of Hadoop
Hadoop ecosystem
Hadoop distributions
HDFS and formats
Hadoop for near real-time applications
Hadoop deployment modes
Hadoop working examples
When not to use Hadoop
Other Hadoop Processing Options
Summary
Chapter 10: Indexed Data Store using Elasticsearch
Context in Data Lake: data storage and lambda speed layer
What is Elasticsearch?
Why Elasticsearch
Working of Elasticsearch
Elastic Stack
Elastic Cloud
Elasticsearch DSL (Query DSL)
Nodes in Elasticsearch
Elasticsearch and relational database
Elasticsearch ecosystem
Elasticsearch deployment options
Clients for Elasticsearch
Elasticsearch for fast streaming layer
Elasticsearch as a data source
Elasticsearch for content indexing
Elasticsearch and Hadoop
Elasticsearch working example
Indexing Documents
Getting Indexed Document
Searching Documents
Updating Documents
Deleting a document
Elasticsearch in purview of SCV use case
Chapter 11: Data Lake Components Working Together
Where we stand with Data Lake
Core architecture principles of Data Lake
Challenges faced by enterprise Data Lake
Expectations from Data Lake
Data Lake for other activities
Knowing more about data storage
Knowing more about Data processing
Thoughts on data security
Thoughts on data encryption
Metadata management and governance
Thoughts on Data Auditing
Thoughts on data traceability
Knowing more about Serving Layer
Summary
Chapter 12: Data Lake Use Case Suggestions
Establishing cybersecurity practices in an enterprise
Know the customers dealing with your enterprise
Bring efficiency in warehouse management 
Developing a brand and marketing of the enterprise
Achieve a higher degree of personalization with customers
Bringing IoT data analysis at your fingertips
More practical and useful data archival
Compliment the existing data warehouse infrastructure
Achieving telecom security and regulatory compliance
Summary

Book Details

ISBN 139781787281349
Paperback596 pages
Read More

Read More Reviews

Recommended for You

Data Lake Development with Big Data Book Cover
Data Lake Development with Big Data
$ 27.99
$ 19.60
Making Big Data Work for Your Business Book Cover
Making Big Data Work for Your Business
$ 24.99
$ 17.50
Statistics for Data Science Book Cover
Statistics for Data Science
$ 39.99
$ 28.00
Enterprise Security: A Data-Centric Approach to Securing the Enterprise Book Cover
Enterprise Security: A Data-Centric Approach to Securing the Enterprise
$ 26.99
$ 18.90
Procedural Content Generation for C++ Game Development Book Cover
Procedural Content Generation for C++ Game Development
$ 39.99
$ 28.00
Puppet Techniques for Enterprise Deployments [Video] Book Cover
Puppet Techniques for Enterprise Deployments [Video]
$ 124.99
$ 106.25