Packt+ | Advance your knowledge in tech

You're reading from Data Lake Development with Big Data

Product type Book

Published in Nov 2015

Publisher

ISBN-13 9781785888083

Pages 164 pages

Edition 1st Edition

Languages

Java

Concepts

Big Data

Table of Contents (13) Chapters

Data Lake Development with Big Data

Credits

About the Authors

Acknowledgement

About the Reviewer

www.PacktPub.com

Preface

The Need for Data Lake

Data Intake

Data Integration, Quality, and Enrichment

Data Discovery and Consumption

Data Governance

Index

Key benefits of Data Lake

Having understood the need for the Data Lake and the business/technology context of its evolution, let us now summarize the important benefits in the following list:

Scale as much as you can: Theoretically, the HDFS-based storage of Hadoop gives you the flexibility to support arbitrarily large clusters while maintaining a constant price per performance curve even as it scales. This means, your data storage can be scaled horizontally to cater to any need at a judicious cost. To gain more space, all you have to do is plug in a new cluster and then Hadoop scales seamlessly. Hadoop brings you the incredible facility to run the code close to storage, allowing quicker processing of massive data sets. The usage of Hadoop for underlying storage makes the Data Lake more scalable at a better price point than Data Warehouses by an order of magnitude. This allows for the retention of huge amounts of data.
Plug in disparate data sources: Unlike a data warehouse that can only ingest structured data, a Hadoop-powered Data Lake has an inherent ability to ingest multi-structured and massive datasets from disparate sources. This means that the Data Lake can store literally any type of data such as multimedia, binary, XML, logs, sensor data, social chatter, and so on. This is one huge benefit that removes data silos and enables quick integration of datasets.
Acquire high-velocity data: In order to efficiently stream high-speed data in huge volumes, the Data Lake makes use of tools that can acquire and queue it. The Data Lake utilizes tools such as Kafka, Flume, Scribe, and Chukwa to acquire high-velocity data. This data could be the incessant social chatter in the form of Twitter feeds, WhatsApp messages, or it could be sensor data from the machine exhaust. This ability to acquire high-velocity data and integrate with large volumes of historical data gives Data Lake the edge over Data Warehousing systems, which could not do any of these as effectively.
Add a structure: To make sense of vast amounts of data stored in the Data Lake, we should create some structure around the data and pipe it into analysis applications. Applying a structure on unstructured data can be done while ingesting or after being stored in the Data Lake. A structure such as the metadata of a file, word counts, parts of speech tagging, and so on, can be created out of the unstructured data. The Data Lake gives you a unique platform where we have the ability to apply a structure on varied datasets in the same repository with a richer detail; hence, enabling you to process the combined data in advanced analytic scenarios.
Store in native format: In a Data Warehouse, the data is premodeled as cubes that are the best storage structures for predetermined analysis routines at the time of ingestion. The Data Lake eliminates the need for data to be premodeled; it provides iterative and immediate access to the raw data. This enhances the delivery of analytical insights and offers unmatched flexibility to ask business questions and seek deeper answers.
Don't worry about schema: Traditional data warehouses do not support the schemaless storage of data. The Data Lake leverages Hadoop's simplicity in storing data based on schemaless write and schema-based read modes. This is very helpful for data consumers to perform exploratory analysis and thus, develop new patterns from the data without worrying about its initial structure and ask far-reaching, complex questions to gain actionable intelligence.
Unleash your favorite SQL: Once the data is ingested, cleansed, and stored in a structured SQL storage of the Data Lake, you can reuse the existing PL-SQL scripts. The tools such as HAWQ and IMPALA give you the flexibility to run massively parallel SQL queries while simultaneously integrating with advanced algorithm libraries such as MADLib and applications such as SAS. Performing the SQL processing inside the Data Lake decreases the time to achieving results and also consumes far less resources than performing SQL processing outside of it.
Advanced algorithms: Unlike a data warehouse, the Data Lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms to recognize items of interest that will power real-time decision analytics.
Administrative resources: The Data Lake scores better than a data warehouse in reducing the administrative resources required for pulling, transforming, aggregating, and analyzing data.

You're reading from Data Lake Development with Big Data

Table of Contents (13) Chapters

Key benefits of Data Lake

Personalised recommendations for you