Through this book, we are embarking on a huge task of implementing a technology masterpiece for your enterprise. In this journey, you will not only have to learn many new tools and technologies but also have to know a good amount of jargon and theoretical stuff. This will surely help you in your journey to reach the ultimate goal of creating the masterpiece, namely Data lake.
This part of the book aims at preparing you for a tough road ahead so that you are quite clear in the head as to what you want to achieve. The concept of a Data lake has evolved over time in enterprises, starting with concepts of data warehouse which contained data for long term retention and stored differently for reporting and historic needs. Then the concept of data mart came into existence which would expose small sets of data with enterprise relevant attributes. Data lake evolved with these concepts as a central data repository for an enterprise that could capture data as is, produce processed data, and serve the most relevant enterprise information.
The topic or technology of Data lake is not new, but very few enterprises have implemented a fully functional Data lake in their organization. Through this book, we want enterprises to start thinking seriously on investing in a Data lake. Also, with the help of you engineers, we want to give the top management in your organization a glimpse of what can be achieved by creating a Data lake which can then be used to implement a use case more relevant to your own enterprise.
So, fasten your seatbelt, hold on tight, and let's start the journey!
Rest assured that after completing this book, you will help your enterprise (small or big) to think and model their business in a data-centric approach, using Data lake as its technical nucleus.
The intent of this chapter is to give the reader insight into data, big data, and some of the important details in connection with data. The chapter gives some important textbook-based definitions, which need to be understood in depth so that the reader is convinced about how data is relevant to an enterprise. The reader would also have grasped the main crux of the difference between data and big data. The chapter soon delves into the types of data in depth and where we can find in an enterprise.
The latter part of the chapter tries to enlighten the user with the current state of enterprises with regard to data management and also tries to give a high-level glimpse on what enterprises are looking to transform themselves into, with data at the core. The whole book is based on a real-life example, and the last section is dedicated to explaining this example in more detail. The example is detailed in such a manner that the reader would get a good amount of concepts implemented in the form of this example.
Data refers to a set of values of qualitative or quantitative variables.
Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.
- Wikipedia
Data can be broadly categorized into three types:
- Structured data
- Unstructured data
- Semi-structured data
Structured data is data that we conventionally capture in a business application in the form of data residing in a relational database (relational database management system (RDBMS)) or non-relational database (NoSQL - originally referred to as non SQL).
Structured data can again be broadly categorized into two, namely raw and cleansed data. Data that is taken in as it is, without much cleansing or filtering, is called raw data. Data that is taken in with a lot of cleansing and filtering, catering to a particular analysis by business users, is called cleansed data.
All the other data, which doesn’t fall in the category of structured, can be called unstructured data. Data collected in the form of videos, images, and so on are examples of unstructured data.
There is a third category called semi-structured data, which has come into existence because of the Internet and is becoming more and more predominant with the evolution of social sites. The Wikipedia definition of semi-structured data is as follows:
Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.
Some of the examples of semi-structured data are the well-known data formats, namely JavaScript Object Notation (JSON) and Extensible Markup Language (XML).
The following figure (Figure 01) covers whatever we discussed on different types of data, in a pictorial fashion. Please don't get confused by seeing spreadsheets and text files in the structured section. This is because the data presented in the following figure is in the form of a record, which, indeed, qualifies it to be structured data:

Figure 01: Types of Data
Enterprise data refers to data shared by employees and their partners in an organization, across various departments and different locations, spread across different continents. This is data that is valuable to the enterprise, such as financial data, business data, employee personal data, and so on, and the enterprise spends considerable time and money to keep this data secure and clean in all aspects.
During all this, this so-called enterprise data passes the current state and becomes stale, or rather dead, and lives in some form of storage, which is hard to analyze and retrieve. This is where the significance of this data and having a single place to analyze it in order to discover various future business opportunities leads to the implementation of a Data lake.
Enterprise data falls into three major high-level categories, as detailed next:
- Master data refers to the data that details the main entities within an enterprise. Looking at the master data, one can, in fact, find the business that the enterprise is involved in. This data is usually managed and owned by different departments. The other categories of data, as follows, need the master data to make meaningful values of them.
- Transaction data refers to the data that various applications (internal and external) produce while transacting various business processes within an enterprise. This also includes people-related data, which, in a way, doesn’t categorize itself as business data but is significant. This data, when analyzed, can give businesses many optimization techniques to be employed. This data also depends and often refers to the master data.
- Analytic data refers to data that is actually derived from the preceding two kinds of enterprise data. This data gives enough insight into various entities (master data) in the enterprise and can also combine with transaction data to make positive recommendations, which can be implemented by the enterprise, after performing the necessary due diligence.
The previously explained different types of enterprise data are very significant to the enterprise, because of which most enterprises have a process for the management of these types of data, commonly known as enterprise data management. This aspect is explained in more detail in the following section.
The following diagram shows the various enterprise data types available and how they interact with each other:

Figure 02: Different types of Enterprise Data
The preceding figure shows that master data is being utilized by both transaction and analytic data. Analytic data also depends on transaction data for deriving meaningful insights as needed by users who use these data for various clients.