Home Data Principles of Data Fabric

Principles of Data Fabric

By Sonia Mezzetta
books-svg-icon Book
eBook $25.99 $17.99
Print $31.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $25.99 $17.99
Print $31.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introducing Data Fabric
About this book
Data can be found everywhere, from cloud environments and relational and non-relational databases to data lakes, data warehouses, and data lakehouses. Data management practices can be standardized across the cloud, on-premises, and edge devices with Data Fabric, a powerful architecture that creates a unified view of data. This book will enable you to design a Data Fabric solution by addressing all the key aspects that need to be considered. The book begins by introducing you to Data Fabric architecture, why you need them, and how they relate to other strategic data management frameworks. You’ll then quickly progress to grasping the principles of DataOps, an operational model for Data Fabric architecture. The next set of chapters will show you how to combine Data Fabric with DataOps and Data Mesh and how they work together by making the most out of it. After that, you’ll discover how to design Data Integration, Data Governance, and Self-Service analytics architecture. The book ends with technical architecture to implement distributed data management and regulatory compliance, followed by industry best practices and principles. By the end of this data book, you will have a clear understanding of what Data Fabric is and what the architecture looks like, along with the level of effort that goes into designing a Data Fabric solution.
Publication date:
April 2023
Publisher
Packt
Pages
188
ISBN
9781804615225

 

Introducing Data Fabric

Data Fabric is a distributed data architecture that connects scattered data across tools and systems with the objective of providing governed access to fit-for-purpose data at speed. Data Fabric focuses on Data Governance, Data Integration, and Self-Service data sharing. It leverages a sophisticated active metadata layer that captures knowledge derived from data and its operations, data relationships, and business context. Data Fabric continuously analyzes data management activities to recommend value-driven improvements. Data Fabric works with both centralized and decentralized data systems and supports diverse operational models. This book focuses on Data Fabric and describes its data management approach, differentiating design, and emphasis on automated Data Governance.

In this chapter, we’ll focus on understanding the definition of Data Fabric and why it’s important, as well as introducing its building blocks. By the end of this chapter, you’ll have an understanding of what a Data Fabric design is and why it’s essential.

In this chapter, we’ll cover the following topics:

  • What is Data Fabric?
  • Why is Data Fabric important?
  • Data Fabric building blocks
  • Operational Data Governance models

Note

The views expressed in the book belong to the author and do not necessarily represent the opinions or views of their employer, IBM.

 

What is Data Fabric?

Data Fabric is a distributed and composable architecture that is metadata and event driven. It’s use case agnostic and excels in managing and governing distributed data. It integrates dispersed data with automation, strong Data Governance, protection, and security. Data Fabric focuses on the Self-Service delivery of governed data.

Data Fabric does not require the migration of data into a centralized data storage layer, nor to a specific data format or database type. It can support a diverse set of data management styles and use cases across industries, such as a 360-degree view of a customer, regulatory compliance, cloud migration, data democratization, and data analytics.

In the next section, we’ll touch on the characteristics of Data Fabric.

What Data Fabric is

Data Fabric is a composable architecture made up of different tools, technologies, and systems. It has an active metadata and event-driven design that automates Data Integration while achieving interoperability. Data Governance, Data Privacy, Data Protection, and Data Security are paramount to its design and to enable Self-Service data sharing. The following figure summarizes the different characteristics that constitute a Data Fabric design.

Figure 1.1 – Data Fabric characteristics

Figure 1.1 – Data Fabric characteristics

Data Fabric takes a proactive and intelligent approach to data management. It monitors and evaluates data operations to learn and suggest future improvements leading to productivity and prosperous decision-making. It approaches data management with flexibility, scalability, automation, and governance in mind and supports multiple data management styles. What distinguishes Data Fabric architecture from others is its inherent nature of embedding Data Governance into the data life cycle as part of its design by leveraging metadata as the foundation. Data Fabric focuses on business controls with an emphasis on robust and efficient data interoperability.

In the next section, we will clarify what is not representative of a Data Fabric design.

What Data Fabric is not

Let’s understand what Data Fabric is not:

  • It is not a single technology, such as data virtualization. While data virtualization is a key Data Integration technology in Data Fabric, the architecture supports several more technologies, such as data replication, ETL/ELT, and streaming.
  • It is not a single tool like a data catalog and it doesn’t have to be a single data storage system like a data warehouse. It represents a diverse set of tools, technologies, and storage systems that work together in a connected ecosystem via a distributed data architecture, with active metadata as the glue.
  • It doesn’t just support centralized data management but also federated and decentralized data management. It excels in connecting distributed data.
  • Data Fabric is not the same as Data Mesh. They are different data architectures that tackle the complexities of distributed data management using different but complementary approaches. We will cover this topic in more depth in Chapter 3, Choosing between Data Fabric and Data Mesh.

The following diagram summarizes what Data Fabric architecture does not constitute:

Figure 1.2 – What Data Fabric is not

Figure 1.2 – What Data Fabric is not

We have discussed in detail what defines Data Fabric and what does not. In the next section, we will discuss why Data Fabric is important.

 

Why is Data Fabric important?

Data Fabric enables businesses to leverage the power of connected, trusted, protected, and secure data no matter where it’s geographically located or stored (cloud, multi-cloud, hybrid cloud, on-premises, or the edge). Data Fabric handles the diversity of data, use cases, and technologies to create a holistic end-to-end picture of data with actionable insights. It addresses the shortcomings of previous data management solutions while considering lessons learned and building on industry best practices. Data Fabric’s approach is based on a common denominator, metadata. Metadata is the secret sauce of Data Fabric architecture, along with automation enabled by machine learning and artificial intelligence (AI), deep Data Governance, and knowledge management. All these aspects lead to the efficient and effective management of data to achieve business outcomes, therefore cutting down on operational costs and increasing profit margins through strategic decision-making.

Some of the key benefits of Data Fabric are as follows:

  • It addresses data silos with actionable insights from a connected view of disparate data across environments (cloud, multi-cloud, hybrid cloud, on-premises, or the edge) and geographies
  • Data democratization leads to a shorter time to business value with frictionless Self-Service data access
  • It establishes trusted, secure, and reliable data via automated Data Governance and knowledge management
  • It enables a business user with intuitive discovery, understanding, and access to data while addressing a technical user’s needs, supporting various data processing techniques in order to manage data. Such approaches are batch or real time, including ETL/ELT, data virtualization, change data capture, and streaming

Now that we have a view of why Data Fabric is important and how it takes a modern approach to data management, let’s review some of the drawbacks of earlier data management approaches.

Drawbacks of centralized data management

Data is spread everywhere: on-premises, across cloud environments, and on different types of databases, such as SQL, NoSQL, data lakes, data warehouses, and data lakehouses. Many of the challenges associated with this in the past decade, such as data silos, still exist today. The traditional data management approach to analytics is to move data into a centralized data storage system. Moving data into one central system facilitates control and decreases the necessary checkpoints across the large number of different environments and data systems. Thinking about this logically, it makes total sense. If you think about everyday life, we are successful at controlling and containing things if they are in one central place.

As an example, consider the shipment of goods from a warehouse to a store that requires inspection during delivery. Inspecting the shipment of goods in one store will require a smaller number of people and resources as opposed to accomplishing this for 100 stores located across different locations. Seamless management and quality control become a lot harder to achieve across the board. The same applies to data management, and this is what led to the solution of centralized data management.

While centralized data management was the de facto approach for decades and is still used today, it has several shortcomings. Data movement and integration come at an expensive cost, especially when dealing with on-premises data storage solutions. It heavily relies on data duplication to satisfy a diverse set of use cases requiring different contexts. Complex and performance-intensive data pipelines built to enable data movement require intricate maintenance and significant infrastructure investments, especially if automation or governance is nowhere in the picture. In a traditional operating model, IT departments centrally manage technical platforms for business domains. In the past and still today, this model creates bottlenecks in the delivery of and access to data, minimizing the time to value.

Enterprise data warehouses

Enterprise data warehouses are complex systems that require consensus across business domains on common definitions of data. An enterprise data model is tightly coupled to data assets. Any changes to the physical data model without proper dependency management breaks downstream consumption. There are also challenges in Data Quality, such as data duplication and the lack of business skills to manage data within the technical platform team.

Data lakes

Data lakes came after data warehouses to offer a flexible way of loading data quickly without the restrictions of upfront data modeling. Data lakes can load raw data as is and later worry about its transformation and proper data modeling. Data lakes are typically managed in NoSQL databases or file-based distributed storage such as Hadoop. Data lakes support semi-structured and unstructured data in addition to structured data. Challenges with data lakes come from the very fact that they bypass the need to model data upfront, therefore creating unusable data without any proper business context. Such data lakes have been referred to as data swamps, where the data stored has no business value.

Data lakehouses

Data lakehouses is a new technology and is a combination of both Data Warehouse and Data Lake design. Data lakehouses support structured, unstructured and semi-structured data and are capable of addressing data science and business intelligence use cases.

Decentralized data management

While there are several great capabilities in centralized data systems, such as data warehouses, data lakes, and data lakehouses, the reality is, we are at a time where all these systems have a role and create the need for decentralized data management. A single centralized data management system is not equipped to handle all possible use cases in an organization and at the same time excel in proper data management. I’m not saying there is no need for a centralized data system, but rather, it can represent a progression. For example, a small company might start with one centralized system that fits their business needs, and as they grow, they evolve into more decentralized data management.

Another example is a business domain within a large company that might own and manage a data lake, or a data lakehouse that needs to co-exist with several other data systems owned by other business domains. This again represents decentralized data management. Cloud technologies have further provoked the proliferation of data. There is a multitude of cloud providers with their own set of capabilities and cost incentives, leading to organizations having multi-cloud and hybrid cloud environments.

We have evolved from a world of centralized data management as the best practice to a world in which decentralized data management is necessary. There is a seat at the table for all types of centralized systems. What’s important is for these systems to have a data architecture that connects data in an intelligent and cohesive manner. This means a data architecture with the right level of control and rigor while balancing quick access to trusted data, which is where Data Fabric architecture plays a major role.

In the next section, let’s briefly discuss considerations in building Data Fabric architecture.

Building Data Fabric architecture

Building Data Fabric architecture is not an easy undertaking. It’s not a matter of building a simple 1-2-3 application or applying specific technologies. It requires collaboration, business alignment, and strategic thinking about the design of the data architecture; the careful evaluation and selection of different tools, data storage systems, and technologies; and thought into when to buy or build. Metadata is the common thread that ties data together in a Data Fabric design. Metadata must be embedded into every aspect of the life cycle of data from start to finish. Data Fabric actively manages metadata, which enables scalability and automation and creates a design that can handle the growing demands of businesses. It offers a future-proof design that can grow to add subsequent tools and technologies.

Now, with this in mind, let’s introduce a bird’s-eye view of a Data Fabric design by discussing its building blocks.

 

Data Fabric building blocks

Data Fabric’s building blocks represent groupings of different components and characteristics. They are high-level blocks that describe a package of capabilities that address specific business needs. The building blocks are Data Governance and its knowledge layer, Data Integration, and Self-Service. Figure 1.3 illustrates the key architecture building blocks in a Data Fabric design.

Figure 1.3 – Data Fabric building blocks

Figure 1.3 – Data Fabric building blocks

Data Fabric’s building blocks have foundational principles that must be enforced in its design. Let’s introduce what they are in the following subsection.

Data Fabric principles

Data Fabric’s foundational principles ensure that the data architecture is on the right path to deliver high-value and high-quality data management that ensures data is secure, and protected. The following list introduces the principles that need to be incorporated as part of a Data Fabric design. In Chapter 6, Designing a Data Fabric Architecture, we’ll discuss each principle in more depth:

  • Data are Assets that can evolve into Data Products (TOGAF & Data Mesh): Represents a transition where assets have active product management across its life cycle from creation to end of life, specific value proposition and are enabled for high scale data sharing.
  • Data is shared (TOGAF): Empower high-quality data sharing
  • Data is accessible (TOGAF): Ease of access to data
  • Data product owner (TOGAF and Data Mesh): The Data Product owner manages the life cycle of a Data Product, and is accountable for the quality, business value, and success of data
  • Common vocabulary and data definitions (TOGAF): Business language and definitions associated with data
  • Data security (TOGAF): Data needs to have the right level of Data Privacy, Data Protection and Data Security.
  • Interoperable (TOGAF): Defined data standards that achieve data interoperability
  • Always be architecting (Google): Continuously evolve a data architecture to keep up with business and technology changes
  • Design for automation (Google): Automate repeatable tasks to accelerate time to value

These principles have been referenced directly or inspired the creation of new principles. from different noteworthy sources: TOGAF (https://pubs.opengroup.org/togaf-standard/adm-techniques/chap02.html#tag_02_06_02), Google (https://cloud.google.com/blog/products/application-development/5-principles-for-cloud-native-architecture-what-it-is-and-how-to-master-it), and Data Mesh, created by Zhamak Dehghani. They capture the essence of what is necessary for a modern Data Fabric architecture. I have slightly modified a couple of the principles to better align with today’s data trends and needs.

Let’s briefly discuss the four Vs in big data management, which are important dimensions that need to be considered in the design of Data Fabric.

The four Vs

In data management, the four Vs – Volume, Variety, Velocity, and Veracity (https://www.forbes.com/sites/forbestechcouncil/2022/08/23/understanding-the-4-vs-of-big-data/?sh=2187093b5f0a) – represent dimensions of data that need to be addressed as part of Data Fabric architecture. Different levels of focus are needed across each building block. Let’s briefly introduce each dimension:

  • Volume: The size of data impacts Data Integration and Self-Service approaches. It requires a special focus on performance and capacity. Social media and IoT data have led to the creation of enormous volumes of data in today’s data era. The size of data is at an infinite point. Classifying data to enable its prioritization is necessary. Not all data requires the same level of Data Governance rigor and focus. For example, operational customer data requires high rigor when compared to an individual’s social media status.
  • Variety: Data has distinct data formats, such as structured, semi-structured, and unstructured data. Data variety dictates technical approaches that can be taken in its integration, governance, and sharing. Typically, structured data is a lot easier to manage compared to unstructured data.
  • Velocity: The speed at which data is collected and processed, such as batch or real time, is a factor in how Data Governance can be applied and enabling Data Integration technologies. For example, real-time data will require a streaming tool. Data Governance aspects, such as Data Quality and Data Privacy, require a different approach when compared to batch processing due to its incomplete nature.
  • Veracity: Data Governance centered around Data Quality and data provenance plays a major role in supporting this dimension. Veracity measures the degree to which data can be trusted and relied upon to make business decisions.

Now that we have a summary of the four Vs, let’s review the building blocks (Data Governance, Data Integration, and Self-Service) of a Data Fabric design.

Important note

I have intentionally not focused on the tools or dived into the implementation details of Data Fabric architecture. My intention is to first introduce the concepts, groups of capabilities, and objectives of Data Fabric architecture at a bird’s-eye-view level. In later chapters, we’ll dive into the specific capabilities of each building block and present a Data Fabric reference architecture example.

Data Governance

Data Governance aims to define standards and policies that achieve data trust via protected, secure, and high-quality or fit-for-purpose data. Data Fabric enables efficient Data Integration by leveraging automation while making data interoperable. To support a mature Data Integration approach, the right level of Data Governance is required. It’s of no use to have a design approach that beautifully integrates dispersed data only to find out that the cohesive data is of poor quality and massively violates data compliance stipulations. Costs saved in superior Data Integration would be short-lived if organizations are then slapped with millions of dollars in Data Privacy violation fines.

Data Governance isn’t new; it has been around for decades. However, in the past, it was viewed as a burden by technologists and as a major obstacle in the deployment of successful software solutions. The impression has been that a governance authority sets rules and boundaries restricting progress with unnecessary rigor. What is drastically different today is the shift in perception. Data Governance has evolved to make it easy to access high-quality, trusted data via automation and more mature technologies despite the need to enforce security and regulation requirements. Another factor is the recognition of the impact of not having the right level of governance in a world where data managed in the right way can lead to major data monetization.

Data Governance today is recognized as a critical success factor in data initiatives. Data is predicted to grow at an exponentially fast rate. According to IDC, from 2021 to 2026, it will grow at a CAGR of 21.2%, potentially reaching 221,178 exabytes by 2026 (https://www.idc.com/getdoc.jsp?containerId=US49018922). This has pushed Data Governance to be front and center in achieving successful data management. To keep up with all the challenges introduced by big data, Data Governance has gone through a modernization evolution. In the following sections, we will dive into the pillars that represent Data Governance in Data Fabric architecture.

The Data Governance pillars established in the past are very relevant today with a modern twist. All Data Governance pillars are critical. However, Data Privacy, Protection and Security as well as Data Quality are prioritized due to data sovereignty requirements that need to address state-, country-, and union-based laws and regulations (such as the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA)), as well as the driving need for high-quality data to achieve data monetization. Data Lineage, Master Data Management, and Metadata Management are fundamental pillars that have dependencies on one another. The priority of these pillars will vary by use case and organizational objectives. All pillars serve a vital purpose and provide the necessary guardrails to manage, govern, and protect data.

Let’s have a look at each Data Governance pillar.

Data Privacy, Protection, and Security

Whether data is used knowingly or unknowingly for unethical purposes, it needs to be protected by defining the necessary data policies and enforcement to automatically shield it from unauthorized access. Business controls need to be applied in an automated manner during Data Integration activities such as data ingestion, data movement, and data access. Manual or ad hoc enforcement is not feasible, especially when it comes to a large data volume. There need to be defined intelligent data policies that mask or encrypt data throughout its life cycle. There are several considerations that need to be taken regarding data access, such as the duration of access, data retention policies, security, and data privacy.

Data Quality

Data Quality was important 20 years ago and is even more critical today. This pillar establishes trust in using data to make business decisions. Today’s world is measured by the speed of data monetization. How quickly can you access data – whether that’s making predictions that generate sales via a machine learning model or identifying customer purchasing patterns via business intelligence reports? Data trust is an fundamental part of achieving data monetization – data needs to be accurate, timely, complete, unique, and consistent. A focus on Data Quality avoids data becoming stale and untrustworthy. Previous approaches only focused on Data Quality at rest, such as within a particular data repository that is more passive in nature. While this is still required today, it needs to be combined with applying Data Quality to in-flight data. Data Quality needs to keep up with the fluidity of data and take a proactive approach. This means Data Quality needs to be embedded into operational processes where it can be continuously monitored. This is called data observability.

Data Lineage

Data Lineage establishes confidence in the use of data. It accomplishes this by providing an audit trail starting with the data source, target destinations, and changes to data along the way, such as transformations. Data Lineage creates an understanding of the evolution of data throughout its life cycle. Data provenance is especially important for demonstrating compliance with regulatory policies. Data Lineage must keep track of the life cycle of data by capturing its history from start to finish across all data sources, processes, and systems with details of each activity at a granular attribute level.

Master Data Management

As far back as I remember, it has always been a challenge to reconcile, integrate, and manage master data. Master Data Management creates a 360-degree trusted view of scattered master data. Master data, such as that of customers, products, and employees, is consolidated with the intention of generating insights leading to business growth. Master Data Management requires a mature level of data modeling that evaluates several data identifiers, business rules, and data designs across different systems. Metaphorically, you can envision a keyring with an enormous number of keys. Each key represents a unique master data identity, such as a customer. The keyring represents the necessary data harmonization that is realized via Master Data Management across all systems that manage that customer. This is a huge undertaking that requires a mature level of Data Governance and includes Data Quality analysis and data cleansing, such as data deduplication, data standards, and business rules validation.

Metadata Management

The high-quality collection, integration, storage, and standardization of metadata comprise Metadata Management. Metadata, like data, needs to be accurate, complete, and reliable. Data must be mapped with business semantics to make it easy to discover, understand, and use. Metadata needs to be collected by following established interoperability standards across all tools, processes, and system touchpoints to build knowledge surrounding the data. There are four types of metadata that must be actively collected:

  • Technical metadata: Data structure based on where data resides. This includes details such as source, schema, file, table, and attributes.
  • Business metadata: Provides business context to the data. This includes the capturing of business terms, synonyms, descriptions, data domains, and rules surrounding the business language and data use in an organization.
  • Operational metadata: Metadata produced in a transactional setting, such as during runtime execution. This includes data processing-based activities, such as pipeline execution, with details such as start and end time, data scope, owner, job status, logs, and error messages.
  • Social metadata: Social-based activities surrounding data. This metadata type has become more prominent as it focuses on putting yourself in a customer’s shoes and understanding what they care about. What are their interests and social patterns? An example includes statistics that track the number of times data was accessed, by how many users, and for what reason.

Metadata Management is the underpinning of Data Fabric’s knowledge layer. Let’s discuss this layer in the next section.

Knowledge layer

The knowledge layer manages semantics, knowledge, relationships, data, and different types of metadata. It is one of the differentiating qualities of Data Fabric design when compared to other data architecture approaches. Metadata is managed and collected across the entire data ecosystem. A multitude of data relationships is managed across a variety of data and metadata types (technical, business, operational, and social) with the objective of deriving knowledge with an accurate, complete metadata view. The underlying technology is typically a knowledge graph.

The right balance of automation and business domain input is necessary in the knowledge layer. At the end of the day, technology can automate repetitive tasks, classify sensitive data, map business terms, and infer relationships, but it cannot generate tribal knowledge. Business context needs to be captured and modeled effectively in the form of ontologies, taxonomies, and other business data models that are then reflected as part of the knowledge layer. Intelligent agents proactively monitor and analyze technical, business, operational, and social metadata to derive insights and take action with the goal of driving operational improvements. This is where active metadata plays a major role. According to Gartner, “Active metadata is the continuous analysis of all available users, data management, systems/infrastructure and data governance experience reports to determine the alignment and exception cases between data as designed versus actual experience” (https://www.gartner.com/document/4004082?ref=XMLDELV).

Let’s take a look at what active metadata is.

Moving from passive to active metadata

To understand active metadata, you must first understand what defines passive metadata and why it’s insufficient on its own to handle today’s big data demands.

Passive metadata

Passive metadata is the basic entry point of Metadata Management. Data catalogs are the de facto tools used to capture metadata about data. Basic metadata collection includes technical and business metadata, such as business descriptions. Passive metadata primarily relies on a human in the loop, such as a data steward, to manually manage and enter metadata via a tool such as a data catalog. While a data catalog is a great tool, it requires the right processes, automation, and surrounding ecosystem to scale and address business needs. Primarily relying on a human in the loop for Metadata Management creates a major dependency on the availability of data stewards in executing metadata curation activities. These efforts are labor intensive and can take months or years to complete. Even if initially successful, metadata will quickly get out of sync and become stale. This will diminish the value and trust in the metadata.

Another point here is metadata is continuously generated by different tools and typically sits in silos somewhere without any context or connection to the rest of the data ecosystem, which is another example of passive metadata. Tribal knowledge that comes from a human in the loop will always be critical. However, it needs to be combined with automation along with other advanced technologies, such as machine learning, AI, and graph databases.

Active metadata

Now that you understand passive metadata, let’s dive into active metadata. It’s all about the word active. It’s focused on acting on the findings from passive metadata. Active metadata contains insights about passive metadata, such as Data Quality score, inferred relationships, and usage information. Active metadata processes and analyzes passive metadata (technical, business, operational, and social) to identify trends and patterns. It uses technologies such as machine learning and AI as part of a recommendation engine to suggest improvements and more efficient ways of executing data management. It can also help with discovering new relationships leveraging graph databases. Examples of active metadata in action are failed data pipelines leading to incomplete data where downstream consumption is automatically disabled, and where a dataset with high quality is suggested instead of the current one with low Data Quality, or if Service Level Agreements are not met by a data producer. Alerts that generate action by both data producer and consumer represents active metadata.

Event-based model

The missing link in the active Metadata Management story is the ability to manage metadata live and have the latest and greatest metadata. To drive automation, an event-based model that manages metadata and triggers instantaneous updates for metadata is necessary. This facilitates metadata collection, integration, and storage being completed in a near real-time manner and distributed and processed by subscribed tools and services. This is required in order to achieve active metadata in a Data Fabric design.

Let’s review the next Data Fabric building block, Data Integration.

Data Integration

Data interoperability permits data to be shared and used across different data systems. As data is structured in different ways and formats across different systems, it’s necessary to establish Data Integration standards to enable the ease of data exchange and data sharing. A semantic understanding needs to be coupled with Data Integration in order to enable data discovery, understanding, and use. This helps achieve data interoperability. Another aspect of Data Integration in Data Fabric design is leveraging DataOps best practices and principles as part of the development cycle. This will be discussed in more depth in Chapter 4, Introducing DataOps.

Data ingestion

Data ingestion is the process of consuming data from a source and storing it in a target system. Diverse data ingestion approaches and source interfaces need to be supported in a Data Fabric design that aligns with the business strategy. A traditional approach is batch processing, where data is grouped in large volumes and then processed later based on a schedule. Batch processing usually takes place during offline hours and does not impact system performance, although it can be done ad hoc. It also offers the opportunity to correct data integrity issues before they escalate. Another approach is real-time processing, such as in streaming technology. As transactions or events occur, data is immediately processed in response to an event. Real-time data is processed as smaller data chunks containing the latest data. However, data processing requires intricate management to correlate, group, and map data with the right level of context, state, and completeness. It also requires applying a different Data Governance approach when compared to batch processing. These are all factors that a Data Fabric design considers.

Data transformation

Data transformation takes a source data model that is different from a target data model and reformats the source data model to fit the target model. Typically, extract, transform, and load (ETL) is the traditional approach to achieve this. This is largely applied in relational database management systems (RDBMSs) that require data to be loaded based on the specifications of a target data model, as opposed to NoSQL systems, where the extract, load, and transform (ELT) approach is executed. In ELT, there isn’t a prerequisite of modeling data to fit a target model like in data lake. Due to the inherent nature of NoSQL databases, the structure is more flexible. Another data transformation approach is programmatic, where concepts such as managing data as code apply. When it comes to data transformation, a variety of data formats, such as structured, semi-structured, and unstructured, needs to be processed and handled. Data cleansing is an example of when data transformation will be necessary.

What is data as code?

Data as code represents a set of best practices and principles as part of a DataOps Framework. It focuses on managing data similar to how code is managed where concepts such as version control, continuous integration, continuous delivery and continuous deployment are applied. Policy as code is another variation of Data as code.

Data Integration can be achieved through physical Data Integration, where one or more datasets are moved from a source to a target system using ETL/ELT or programmatic approaches. It can also take place virtually via data virtualization technology that creates a distributed query that unifies data, building a logical view across diverse source systems. This approach does not create a data copy. In a Data Fabric design, all these integration and processing approaches should be supported and integrated to work cohesively with the Data Governance building block, the knowledge layer, and the Self-Service building block.

Self-Service

Self-Service data sharing is a key objective in an organization’s digital transformation journey. It’s the target of why all the management, rigor, governance, security, and all these controls are in place. To share and deliver trusted data that could be used to create profit. Having access to the right data of high-quality quickly when needed is golden. Data Products are business-ready, reusable data that has been operationalized to achieve high-scale data sharing. The goal of a Data Product is to derive business value from data. In Chapter 3, Choosing between Data Fabric and Data Mesh, and Chapter 7, Designing Data Governance, we touch more on Data Products. Self-service data moves away from a model that relies on a central IT team to deliver data, to one that democratizes data for a diverse set of use cases with high quality, and governance. Making data easily accessible by technical users such as data engineers, data scientists, software developers or business roles such as business analysts is the end goal.

A Data Fabric design enables technical and business users to quickly search, find, and access the data they need in a Self-Service manner. On the other hand, it also enables data producers to manage data with the right level of protection and security before they open the doors to the data kingdom. The Self-Service building block works proactively and symmetrically with the Data Integration and Data Governance building blocks. This is where the true power of a Data Fabric design is exhibited.

 

Operational Data Governance models

Data Fabric architecture supports diverse Data Governance models. It has the necessary architecture robustness, components, and capabilities to support three types of models:

  • Centralized Data Governance: A central Data Governance organization that is accountable for decision-making, data policy management, and enforcement across all business units within an organization. This represents a traditional model.
  • Decentralized Data Governance: An organization where a business domain is accountable for and manages the Data Governance program. The business domain is accountable only for their domain’s decision-making and data policy management and enforcement. They operate independently in terms of data management and Data Governance from the rest of the organization.
  • Federated Data Governance: In a federated Data Governance model, a federated group of business domain representatives across an organization and governance roles is accountable for global-level concerns and decision-making on Data Governance matters. The federated governance team is not responsible for enforcement or local policy management. Business domains manage their data, local data policies, and enforcement.

Let’s summarize what we have covered in this chapter in the next section.

 

Summary

In this chapter, we defined Data Fabric and its characteristics–highly automated, use case agnostic, Self-Service, strong Data Governance, Privacy, and Protection, supports decentralized, federated, and centralized Data Governance, event and active metadata driven, interoperable, and has a composable architecture. We discussed why this architecture is important, including its ability to effectively address data silos, data democratization, establish trusted fit for purpose data while addressing business and technical needs. We introduced the core building blocks of a Data Fabric design (Data Governance, Data Integration, and Self-Service) and its principles. We reviewed the role and value of Data Fabric’s knowledge layer, uses active metadata as the glue that intelligently connects data across an ecosystem. We also defined the various Data Governance models that Data Fabric supports.

In the next chapter, we will dive into the business value that Data Fabric architecture offers. We will discuss how the components that make up the backbone of Data Fabric derive business value that leads to profit and cost savings.

About the Author
  • Sonia Mezzetta

    Sonia Mezzetta is a senior certified IBM architect working as a Data Fabric Program Director. She has an eye for detail and enjoys problem solving data pain points. She started her data management career in IBM as a data architect specializing in enterprise architectures. She is an expert in Data Fabric, DataOps, Data Governance, and Data Analytics. With over 20 years of experience, she has designed and architected several Enterprise data solutions. She has authored numerous data management white papers and has a master's and bachelor's degree in Computer Science. Sonia is originally from New York City, and currently resides in the area of Westchester County, New York.

    Browse publications by this author
Principles of Data Fabric
Unlock this book and the full library FREE for 7 days
Start now