Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Lake Development with Big Data

You're reading from  Data Lake Development with Big Data

Product type Book
Published in Nov 2015
Publisher
ISBN-13 9781785888083
Pages 164 pages
Edition 1st Edition
Languages
Concepts

Chapter 5. Data Governance

In the preceding chapter, you understood the various aspects of Data Consumption in detail, such as Data Discovery and Data Provisioning. We also understood architectural guidance on choosing the Big Data tools and technologies that can be used for Data Discovery and Data Provisioning.

In this chapter, you will understand the details of Data Governance; the following topics will be covered:

  • Learn how to deal with management, usability, security, integrity, and the availability of the data in Data Lake

  • Dive deep into the various Data Governance disciplines such as metadata management, lineage tracking, and data lifecycle management that are commonly applied on the data as it flows through each tier of Data Lake

  • Explore how the current Data Lake could evolve in a futuristic setting

The following figure represents the end-state architecture of the Data Lake as discussed in Chapter 1, The Need for Data Lake. As shown in the following figure, we will discuss the highlighted...

Understanding Data Governance


Let us now understand the definition of Data Governance; how it is critical in an enterprise handling Big Data and how it is different from traditional approaches.

Introduction to Data Governance

Data Governance is a set of formal processes that ensure that data within the enterprise meets the following expectations:

  • It is acquired from reliable sources

  • It meets predefined quality standards

  • It is fit for use for further processing

  • It conforms to well-defined business rules

  • It is defined and modified by the right person.

  • It follows a well-documented change control process

  • It is aligned to the organizational strategy

  • Its trustworthiness remains intact while data flows through various transformation cycles

As we can see from the preceding definition, Data Governance can be simply thought of as a discipline that an organization enforces on data as it flows from ingest to exhaust, making sure that it is not tampered in any way that is risky.

Data Governance is a key driver that...

Data Governance components


Data Governance comprises of metadata management and lineage tracking, Data Security and privacy, and Information Lifecycle Management components. These are common components that cut across the Data Intake, management, and consumption tiers of the Data Lake. In the following sections, let us explore these components in detail.

Metadata management and lineage tracking

Big Data often relies on extracting value from huge volumes of unstructured data. The first thing we do after this data enters the Data Lake is classify it and "understand" it by extracting its metadata. Metadata is the fundamental building block, on which the success of any Data Governance endeavor depends.

Metadata captures vital information about the data as it enters the Data Lake and indexes this information while it is stored so that users can search for metadata before they access the data and perform any manipulation on it. Metadata capture is fundamental to make data more accessible and extract...

Architectural guidance


As evidenced in the previous sections, there are a plethora of options available for data governance; choosing the right tool depends primarily on the use case and the level of governance you are attempting to implement. We also see that the market is flooded with umpteen numbers of tools that make decision making very difficult. The following figure depicts the key aspects that are to be considered while choosing the right tools and technologies for Data Governance:

The key considerations for choosing Data Governance tools

Big Data tools and technologies

This section takes you through an indicative list of Big Data tools and technologies that can be used for your specific use case.

Apache Falcon

Falcon is a framework for data management; it simplifies creation, deployment, and monitoring of data pipelines. Falcon automates data ingestion, metadata tagging, and provides a foundation for ILM and governance capabilities.

Understanding how Falcon works

Falcon framework relies...

The current and future trends


In this section, let us explore where we stand and the current state of things with respect to the Data Lake and explore how the evolving enterprise landscape could potentially use Data Lake to enhance their competitiveness.

As I write this book, I evidenced that the usage of Data Lake was being adopted quickly. The know-how about new features, use cases, and new systems that are integrated with the Data Lake are being pushed into the public domain by a variety of industries and researchers at regular intervals. These developments would have tremendous impact on the way the architecture of the Data Lake evolves over time.

In the current scheme of things, Data Lake implementations across enterprises are dominated by Hadoop being predominantly used as a technology of choice for storing huge volumes of data and running algorithms in batch mode using the MapReduce paradigm. Hadoop has become a go-to tool for integrating and extracting better insights by combining...

Summary


This chapter explained in detail Data Governance and the ways to manage data with focus on its availability, usability, integrity, retention, and security. We started with understanding data governance and why it is needed and then understood how data governance on the Data Lake is far more efficient when compared to traditional governance. We also took a look at a few practical scenarios to comprehend the real-life use cases of Data Governance.

We took a deep dive into Data Governance and its components, such as data security and privacy, metadata management and lineage tracking, Information Lifecycle Management, and how they cut across all the three tiers of Data Lake, such as Data Intake, management, and consumption. In the subsequent sections, we took a look at the various Big Data tools and technologies that can be used to perform data governance to help you in decision making and to arrive at the set of technologies that can be used for specific use cases by giving an overview...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Lake Development with Big Data
Published in: Nov 2015 Publisher: ISBN-13: 9781785888083
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}