Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Lake for Enterprises
Data Lake for Enterprises

Data Lake for Enterprises: Lambda Architecture for building enterprise data systems

By Vivek Mishra , Tomcy John , Pankaj Misra
$36.99
Book May 2017 596 pages 1st Edition
eBook
$29.99 $20.98
Print
$36.99
Subscription
$15.99 Monthly
eBook
$29.99 $20.98
Print
$36.99
Subscription
$15.99 Monthly

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : May 31, 2017
Length 596 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781787281349
Vendor :
Apache
Category :
Table of content icon View table of contents Preview book icon Preview Book

Data Lake for Enterprises

Chapter 1. Introduction to Data

Through this book, we are embarking on a huge task of implementing a technology masterpiece for your enterprise. In this journey, you will not only have to learn many new tools and technologies but also have to know a good amount of jargon and theoretical stuff. This will surely help you in your journey to reach the ultimate goal of creating the masterpiece, namely Data lake.

This part of the book aims at preparing you for a tough road ahead so that you are quite clear in the head as to what you want to achieve. The concept of a Data lake has evolved over time in enterprises, starting with concepts of data warehouse which contained data for long term retention and stored differently for reporting and historic needs. Then the concept of data mart came into existence which would expose small sets of data with enterprise relevant attributes. Data lake evolved with these concepts as a central data repository for an enterprise that could capture data as is, produce processed data, and serve the most relevant enterprise information.

The topic or technology of Data lake is not new, but very few enterprises have implemented a fully functional Data lake in their organization. Through this book, we want enterprises to start thinking seriously on investing in a Data lake. Also, with the help of you engineers, we want to give the top management in your organization a glimpse of what can be achieved by creating a Data lake which can then be used to implement a use case more relevant to your own enterprise.

So, fasten your seatbelt, hold on tight, and let's start the journey!

Rest assured that after completing this book, you will help your enterprise (small or big) to think and model their business in a data-centric approach, using Data lake as its technical nucleus.

The intent of this chapter is to give the reader insight into data, big data, and some of the important details in connection with data. The chapter gives some important textbook-based definitions, which need to be understood in depth so that the reader is convinced about how data is relevant to an enterprise. The reader would also have grasped the main crux of the difference between data and big data. The chapter soon delves into the types of data in depth and where we can find in an enterprise.

The latter part of the chapter tries to enlighten the user with the current state of enterprises with regard to data management and also tries to give a high-level glimpse on what enterprises are looking to transform themselves into, with data at the core. The whole book is based on a real-life example, and the last section is dedicated to explaining this example in more detail. The example is detailed in such a manner that the reader would get a good amount of concepts implemented in the form of this example.

Exploring data


Data refers to a set of values of qualitative or quantitative variables.

Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

- Wikipedia

Data can be broadly categorized into three types:

  • Structured data
  • Unstructured data
  • Semi-structured data

Structured data is data that we conventionally capture in a business application in the form of data residing in a relational database (relational database management system (RDBMS)) or non-relational database (NoSQL - originally referred to as non SQL).

Structured data can again be broadly categorized into two, namely raw and cleansed data. Data that is taken in as it is, without much cleansing or filtering, is called raw data. Data that is taken in with a lot of cleansing and filtering, catering to a particular analysis by business users, is called cleansed data.

All the other data, which doesn’t fall in the category of structured, can be called unstructured data. Data collected in the form of videos, images, and so on are examples of unstructured data.

There is a third category called semi-structured data, which has come into existence because of the Internet and is becoming more and more predominant with the evolution of social sites. The Wikipedia definition of semi-structured data is as follows:

Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

Some of the examples of semi-structured data are the well-known data formats, namely JavaScript Object Notation (JSON) and Extensible Markup Language (XML).

The following figure (Figure 01) covers whatever we discussed on different types of data, in a pictorial fashion. Please don't get confused by seeing spreadsheets and text files in the structured section. This is because the data presented in the following figure is in the form of a record, which, indeed, qualifies it to be structured data:

Figure 01: Types of Data

What is Enterprise Data?


Enterprise data refers to data shared by employees and their partners in an organization, across various departments and different locations, spread across different continents. This is data that is valuable to the enterprise, such as financial data, business data, employee personal data, and so on, and the enterprise spends considerable time and money to keep this data secure and clean in all aspects.

During all this, this so-called enterprise data passes the current state and becomes stale, or rather dead, and lives in some form of storage, which is hard to analyze and retrieve. This is where the significance of this data and having a single place to analyze it in order to discover various future business opportunities leads to the implementation of a Data lake.

Enterprise data falls into three major high-level categories, as detailed next:

  • Master data refers to the data that details the main entities within an enterprise. Looking at the master data, one can, in fact, find the business that the enterprise is involved in. This data is usually managed and owned by different departments. The other categories of data, as follows, need the master data to make meaningful values of them.
  • Transaction data refers to the data that various applications (internal and external) produce while transacting various business processes within an enterprise. This also includes people-related data, which, in a way, doesn’t categorize itself as business data but is significant. This data, when analyzed, can give businesses many optimization techniques to be employed. This data also depends and often refers to the master data.
  • Analytic data refers to data that is actually derived from the preceding two kinds of enterprise data. This data gives enough insight into various entities (master data) in the enterprise and can also combine with transaction data to make positive recommendations, which can be implemented by the enterprise, after performing the necessary due diligence.

The previously explained different types of enterprise data are very significant to the enterprise, because of which most enterprises have a process for the management of these types of data, commonly known as enterprise data management. This aspect is explained in more detail in the following section.

The following diagram shows the various enterprise data types available and how they interact with each other:

Figure 02: Different types of Enterprise Data

The preceding figure shows that master data is being utilized by both transaction and analytic data. Analytic data also depends on transaction data for deriving meaningful insights as needed by users who use these data for various clients.

Enterprise Data Management


Ability of an organization to precisely define, easily integrate and effectively retrieve data for both internal applications and external communication

- Wikipedia

EDM emphasizes data precision, granularity and meaning and is concerned with how the content is integrated into business applications as well as how it is passed along from one business process to another.

- Wikipedia

As the preceding wikipedia definition clearly states, EDM is the process or strategy of determining how this enterprise data needs to be stored, where it has to be stored, and what technologies it has to use to store and retrieve this data in an enterprise. Being very valuable, this data has to be secured using the right controls and needs to be managed and owned in a defined fashion. It also defines how the data can be taken out to communicate with both internal and external applications alike. Furthermore, the policies and processes around the data exchange have to be well defined.

Looking at the previous paragraph, it seems that it is very easy to have EDM in place for an enterprise, but in reality, it is very difficult. In an enterprise, there are multiple departments, and each department churns out data; based on the significance of these departments, the data churned would also be very relevant to the organization as a whole. Because of the distinction and data relevance, the owner of each data in EDM has different interests, causing conflicts and thus creating problems in the enterprise. This calls for various policies and procedures along with ownership of each data in EDM.

In the context of this book, learning about enterprise data, enterprise data management, and issues around maintaining an EDM are quite significant. This is the reason why it's good to know these aspects at the start of the book itself. In the following sections we will discuss big data concepts and ways in which big data can be incorporated into enterprise data management and extend its capabilities with opportunities that could not be imagined without big data technologies.

Big data concepts


Let me start this section by giving the Wikipedia definition for Big Data:

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

- Wikipedia

Let's try explaining, the two sentences that are given in the preceding Wikipedia definition. Earlier, big data referred to any data that is large and complex in nature. There isn't any specified size of data for it to be called big data. This data was considered so big that conventional data processing applications found it difficult to use it in a meaningful fashion. In the last decade or so, many technologies have evolved in this space in order to analyze such big data in the enterprise. Nowadays, the term big data is used to refer to any sort of analysis method that can comprehend and extract this complex data and make valuable use of it in the enterprise.

Big data and 4Vs

Whenever you encountered the term big data being overly used, you must have come across an important aspect with regard to it, called 4Vs (until recently, it was 3Vs, and then the fourth, very significant, V got introduced). The 4Vs, namely variety, velocity, volume, and veracity (in no particular order) determine whether the data we call Big Data really qualifies to be called big:

  • Variety: In the context of big data, variety has a very important place. Variety refers to vivid types of data and the myriad sources from which these are arrived at. With the proliferation of technologies and the ever-growing number of applications (enterprise and different personal ones), there is high emphasis on data variety. This is not going to come down any time soon; rather, this is set to increase over a period of time, for sure. Broadly, data types can be categorized into structured and unstructured. Applications during this time deal mainly with structured data stored mostly in a relational database management system (RDBMS). This is very common, but nowadays, there has been the need to look at more unstructured data, and some of the examples can be video content, image content, file content in the form of binaries, and so on.
  • Velocity: In the context of big data, velocity is referred to in two aspects. First is the rate at which the data is generated, and second is the capability by which the enormous amount of data can be analyzed in real time to derive some meaningful conclusions. As the proverb goes, Time is money, this V is a very important aspect, which makes it easy to take quick decisions in real time. This aspect is one of the strongholds of some of the businesses, especially retail. Giving the customer a personalized and timely offer can be the deciding factor of the customer buying a product from you or ditching you to select a more favorable one.
  • Volume: In the context of big data, volume refers to the amount/scale of data that needs to be analyzed for a meaningful result to be derived. There isn't a quantitative figure that categorizes a data to be falling into big data. But usually, this volume is definitely more than what a conventional application is handling as of now. So, in general, this is quite big and does pose a problem for a traditional application to deal with in a day-to-day fashion (OLTP - OnLine Transaction Processing). For many businesses, analyzing and making use of social data has become a necessity. These social apps (Facebook, Google+, LinkedIn, and so on) have billions of registered users producing billions of data (structured and unstructured) in a day-to-day fashion. In addition to this, there are applications that themselves produce a huge amount of data in the form of conventional transactions and other analytics (behavioral, location-based, and so on). Also, with the growing number of wearables and sensors that emit data every millisecond, the volume aspect is going to be very important, and this is not going to come down any time soon.

As detailed in the previous section, until recently, there used to be 3Vs. But quite recently, the fourth V was introduced by IBM, namely veracity. For data growing at an exponential rate and as deduced from different reliable and unreliable sources, the significance of this V is huge.

You must have already heard/read of fake news/material being circulated in various social media when there is something important happening in the world. This V brings this a very important aspect of accuracy in big data. With proliferation of data, especially in social channels, this V is going to be very important, and rather than 3Vs, it is leaning highly towards 4Vs of Big Data.

  • Veracity: In the context of big data, veracity refers to accuracy of data being analyzed to get to a meaningful result. With a variety of sources, especially the not-so-reliable user-entered unstructured data, the data coming from some of these channels has to be consumed in a judicial manner. If an enterprise wants to use this data to generate business, its authenticity has to be verified to an even greater extent.

Big Data and its significant 4V's are shown in a pictorial representation, as follows:

Figure 03: 4V's of Big Data

Figure 03 clearly shows what the 4V's are and what each of these V's means, with adequate bullet points for easy understanding.

Relevance of data


To any enterprise, data is very important. Enterprises have been collecting a good amount of past data and keeping it in a data warehouse for analysis. This proves the importance of data for enterprises for past data analysis and using this for future enterprise growth. In the last decade or so, with the proliferation of social media and myriads of applications (internal to the enterprise and external cloud offerings), the data collected has grown exponentially. This data is increasing in amount as the day goes by, but enterprises are finding it really difficult to make use of these high volumes of diverse data in an effective manner. Data relevance is at the highest for enterprises nowadays as they are now trying to make use of this collected data to transform or energize their existing business.

A business user when fed with these huge amounts of data and right tools can derive real good value. For example, if customer-related data from various applications flows into a place where this data can be analyzed, this data could give a good amount of valuable insights, such as who is the customer who engages with various website pages of the enterprise and how. These derivations can be used as a way in which they can look at either changing their existing business model or tweaking certain business processes to derive maximum profit for the enterprise. For example, looking at various insights from centralized customer data, a new business model can be thought through, say in the form of starting to look at giving loyalty points to such customers. This data can also be made use of, giving more personalized offers closer to customer recommendations. For example, looking at the customer behavior, rather than giving a general offer, more personalized offers suiting the customer's needs could be offered. However, these are fully dependent on the business, and there isn't one approach fitting all the scenarios. These data can, however, be transformed and cleansed to make them more usable for a business user through different data visualizations techniques available as of now in the form of different types of graphs and charts.

Data is relevant, but where exactly this data lives in an enterprise is detailed in the following section.

Vit Soupal (Head of Big Data, Deutsche Telekom AG) in one of his blogs defines these 4V’s of big data as technical parameters and defines another three V’s bringing in business into context. We thought that we would not cover these additional V’s in our book, but these are definitely required for Data lake (Big Data) to be successful in an enterprise.

These additional 3 Vs (business parameters) are as follows:

  • Vision: Every enterprise embarking on Big Data (Data lake) should have a well-defined vision and should also be ready to transform processes to make full use of it. Also, management in the enterprise should fully understand this and should be in a position to make decisions considering its merits.
  • Visualization: Data lake is expected to have a huge amount of data. Some will make a lot of sense and some won't at certain points in time. Data scientists work on these data and derive meaningful deductions, and these need to be communicated in an effective manner to the management. For Big Data to be successful, visualization of meaningful data in various formats is required and mandated.
  • Value: Big Data should be of value to the enterprise. These values could bring about changes in business processes or bring in new innovative solutions (say IoT) and entirely transform the business model.

Vit also gives a very good way of representing these 7 V’s as shown in the following figure:

Figure 04: 7 V's of big data

Figure 04 shows that Big Data becomes successful in an enterprise only if both business and technical attributes are met. 

The preceding figure (Figure 04) conveys that Data lake needs to have a well-defined vision and then a different variety of data flows with different velocity and volume into the lake. The data coming into the lake has different quality attributes (veracity). The data in Data lake requires various kinds of visualization to be really useful to various departments and higher management. These useful visualizations will derive various value to the organization and would also help in making various decisions helpful to the enterprise. Technical attributes in a Data lake are needed for sure (variety, velocity, volume, and veracity), but business attributes/parameters are very much required (vision, visualization, and value), and these make Data lake a success in the enterprise.

Quality of data


There is no doubt that high-quality data (cleansed data) is an irresistible asset to an organization. But in the same way, bad quality or mediocre quality data, if used to make decisions for an enterprise, cannot only be bad for your enterprise but can also tarnish the brand value of your enterprise, which is very hard to get back. The data, in general, becomes not so usable if it is inconsistent, duplicate, ambiguous, and incomplete. Business users wouldn't consider using these data if they do not have a pleasant experience while using these data for various analyzes. That's when we realize the importance of the fourth V, namely veracity.

Quality of data is an assessment of data to ascertain its fit for the purpose in a given context, where it is going to be used. There are various characteristics based on which data quality can be ascertained. Some of which, not in any particular order, are as follows:

  • Correctness/accuracy: This measures the degree to which the collected data describes the real-world entity that's being captured.
  • Completeness: This is measured by counting the attributes captured during the data-capturing process to the expected/defined attributes.
  • Consistency: This is measured by comparing the data captured in multiple systems, converging them, and showing a single picture (single source of truth).
  • Timeliness: This is measured by the ability to provide high-quality data to the right people in the right context at a specified/defined time.
  • Metadata: This is measured by the amount of additional data about captured data. As the term suggests, it is data about data, which is useful for defining or getting more value about the data itself.
  • Data lineage: Keeping track of data across a data life cycle can have immense benefits to the organization. Such traceability of data can provide very interesting business insights to an organization.

There are characteristics/dimensions other than what have been described in the preceding section, which can also determine the quality of data. But this is just detailed in the right amount here so that at least you have this concept clear in the head; these will become clearer as you go through the next chapters in this book.

Where does this data live in an enterprise?


The data in an enterprise lives in different formats in the form of raw data, binaries (images and videos), and so on, and in different application's persistent storage internally within an organization or externally in a private or public cloud. Let's first classify these different types of data. One way to categorize where the data lives is as follows:

  • Intranet (within enterprise)
  • Internet (external to enterprise)

Another way in which data living in an enterprise can be categorized is in the form of different formats in which they exist, as follows:

  • Data stores or persistent stores (RDBMS or NoSQL)
  • Traditional data warehouses (making use of RDBMS, NoSQL etc.)
  • File stores

Now let's get into a bit more detail about these different data categories.

Intranet (within enterprise)

In simple terms, enterprise data that only exists and lives within its own private network falls in the category of intranet.

Various applications within an enterprise exist within the enterprise's own network, and access is denied to others apart from designated employees. Due to this reason, the data captured using these applications lives within an enterprise in a secure and private fashion.

The applications churning out this data can be data of the employees or various transactional data captured while using enterprises in day-to-day applications.

Technologies used to establish intranet for an enterprise include Local Area Network (LAN) and Wide Area Network (WAN). Also, there are multiple application platforms that can be used within an enterprise, enabling intranet culture within the enterprise and its employees. The data could be stored in a structured format in different stores, such as traditional RDBMS and NoSQL databases. In addition to these stores, there lies unstructured data in the form of different file types. Also, most enterprises have traditional data warehouses, where data is cleansed and made ready to be analyzed.

Internet (external to enterprise)

A decade or so ago, most enterprises had their own data centers, and almost all the data would reside in that. But with the evolution of cloud, enterprises are looking to put some data outside their own data center into cloud, with security and controls in place, so that the data is never accessed by unauthorized people. Going the cloud way also takes a good amount of operational costs away from the enterprise, and that is one of the biggest advantages. Let's get into the subcategories in this space in more detail.

Business applications hosted in cloud

With the various options provided by cloud providers, such as SaaS, PaaS, IaaS, and so on, there are ways in which business applications can be hosted in cloud, taking care of all the essential enterprise policies and governance. Because of this, many enterprises have chosen this as a way to host internally developed applications in these cloud providers. Employees use these applications from the cloud and go about doing their day-to-day operations very similar to how they would have for a business application hosted within an enterprise's own data center.

Third–party cloud solutions

With so many companies now providing their applications/services hosted in cloud, enterprises needing them could use these as is and not worry about maintaining and managing on-premises infrastructure. These products, just by being on the cloud, provide enterprises with huge incentives with regard to how they charge for these services.

Due to this benefit, enterprises favorably choose these cloud products, and due to its mere nature, the enterprises now save their data (very specific to their business)in the cloud on someone else infrastructure, with the cloud provider having full control on how these data live in there.

Google BigQuery is one such piece of software, which, as a service product, allows us to export the enterprise data to their cloud, running this software for various kinds of analysis work. The good thing about these products is that after the analysis, we can decide on whether to keep this data for future use or just discard it. Due to the elastic (ability to expand and contract at will, with regard to hardware in this case) nature of cloud, you can very well ask for a big machine if your analysis is complex, and after use, you can just discard or reduce these servers back to their old configuration.

Due to this nature, Google BigQuery calls itself anEnterprise Cloud Data Warehouse, and it does stay true to its promise. It gives speed and scale to enterprises along with the important security, reliability, and availability. It also gives integration with other similar software products again in cloud for various other needs.

Google BigQuery is just one example; there are other similar software available in cloud with varying degrees of features. Enterprises nowadays need to do many things quickly, and they don't want to spend time doing research on this and hosting these in their own infrastructure due to various overheads; these solutions give all they want without much trouble and at a very handy price tag.

The list of such solutions at this stage is ever growing, and I don't think that naming these is required. So we picked BigQuery as an example to explain this very nature.

Similar to software as a service available in the cloud, there are many business applications available in cloud as services. One such example is Salesforce. Basically, Salesforce is a Customer Relationship Management (CRM) solution, but it does have many packaged features in it. It's not a sales pitch, but I just want to give some very important features such business applications in cloud bring to the enterprise. Salesforce brings all the customer information together and allows enterprises to build a customer-centric business model from sales, business analysis, and customer service.

Being in cloud, it also brings many of the features that software as a service in cloud brings.

Because of the ever-increasing impact of cloud on enterprises, a good amount of enterprise data now lives on the Internet (in cloud), obviously taking care of privacy and other common features an enterprise data should comply with to safeguard enterprise’s business objectives.

Social data (structured and unstructured)

Social connection of an enterprise nowadays is quite crucial, and even though enterprise data doesn’t live in social sites, it does have rich information fed by the real customer on enterprise business and its services.

Comments and suggestions on these special sites can indeed be used to reinvent the way enterprises do business and interact with the customers.

Comments in these social sites can damage the reputation and brand of an enterprise if no due diligence in taken on these comments from customers. The enterprise takes these social sites really seriously nowadays, because of which even though it doesn't have enterprise data, it does have customer reviews and comments, which, in a way, show how customer perceive the brand.

Because of this nature, I would like to classify this data also as enterprise data fed in by non-enterprise users. Its very important to take care of the fourth V, namely veracity in big data while analyzing this data as there are people out there who want to use these just as channels to get some undue advantages while dealing with the enterprise in the process of the business.Another way of categorizing enterprise data is by the way the data is finally getting stored. Let's see this categorization in more detail in the following section.

Data stores or persistent stores (RDBMS or NoSQL)

This data, whether on premises (enterprise infrastructure) or in cloud, is stored as structured data in the so-called traditional RDBMS or new generation NoSQL persistent stores. This data comes into these stores through business applications, and most of the data is scattered in nature, and enterprises can easily find a sense of each and every data captured without much trouble. The main issue when data is stored in a traditional RDBMS kind of store is when the amount of data grows beyond an acceptable state. In that situation, the amount of analysis that we can make of the data takes a good amount of effort and time. Because of this, enterprises force themselves to segregate this data into production (data that can be queried and made use of by the business application) and non-production (data that is old and not in the production system, rather moved to a different storage).

Because of this segregation, analysis usually spans a few years and doesn't give enterprises a large span of how the business was dealing with certain business parameters. Say for example, if the production has five years of sales data, and 15 years of sales data is in the non-production storage, the users, when dealing with sales data analysis, just have a view of the last five years of data. There might be trends that are changing every five years, and this can only be known when we do an analysis of 20 years of sales data. Most of the time, because of RDBMS, storing and analyzing huge data is not possible. Even if this is possible, it is time consuming and doesn't give a great deal of flexibility, which an analyst looks for. This renders to the analyst a certain restricted analysis, which can be a big problem if the enterprise is looking into this data for business process tweaks.

The so-called new generation NoSQL (different databases in this space have different capabilities) gives more flexibility on analysis and the amount of data storage. It also gives the kind of performance and other aspects that analysts look for, but it still lacks certain aspects.

Even though the data is stored in an individual business application, it doesn't have a single view from various business application data, and that is what implementing a proper Data lake would bring into the enterprise.

Traditional data warehouse

As explained in the previous section, due to the amount of data captured in production business applications, almost all the time, the data in production is segregated from non-production. The non-production data usually lives in different forms/areas of the enterprise and flows into a different data store (usually RDBMS or NoSQL) called the data warehouse. Usually, the data is cleansed and cut out as required by the data analyst. Cutting out the data again puts a boundary on the type of analysis an analyst can do on the data. In most cases, there should be hidden gems of data that haven’t flown into the data warehouse, which would result in more analysis, using which the enterprises can tweak certain processes; however, since they are cleansed and cut out, this innovative analysis never happens. This aspect is also something that needs correction. The Data lake approach explained in this book allows the analyst to bring in any data captured in the production business application to do any analysis as the case may be.

The way these data warehouses are created today is by employing an ETL (Extract, Transform, Load) from the production database to the data warehouse database. ETL is entrusted with cleaning the data as needed by the analyst who works with these data warehouses for various analyses.

File stores

Business applications are ever changing, and new applications allow the end users to capture data in different formats apart from keying in data (using a keyboard), which are structured in nature.

Another way in which the end users now feed in data is in the form of documents in different formats. Some of the well-known formats are as follows:

  • Different document formats (PDF, DOC, XLS, and so on)
  • Binary formats
    • Image-based formats (JPG, PNG, and so on)
    • Audio formats (MP3, RAM, AC3)
    • Video formats (MP4, MPEG, MKV)

As you saw in the previous sections, dealing with structured data itself is in question, and now we are bringing in the analysis of unstructured data. But analysis of this data is also as important nowadays as structured ones. By implementing Data lake, we could bring in new technologies surrounding this lake, which will allow us to make some good value out of this unstructured data as well, using the latest and greatest technologies in this space.

Apart from various file formats and data living in it, we have many applications that allow end users to capture a huge amount of data in the form of sentences, which also need analysis. To deal with these comments from end users manually is a Herculean task, and in this modern age, we need to decipher the sentences/comments in an automatic fashion and get a view of their sentiment. Again, there are many such technologies available that can make sense of this data (free flowing text) and help enterprises deal with it in the right fashion.

For example, if we do have a suggestion capturing system in place for an enterprise and (let's say) we have close to 1000 suggestions that we get in a day, because of the nature of the business, it's very hard to get into the filtering of these suggestions. Here, we could use technologies aiding in the sentiment analysis of these comments, and according to the rating these analysis tools provide, perform an initial level of filtering and then hand it over to the human who can understand and make use of it.

Enterprise’s current state


As explained briefly in the previous sections, the current state of enterprise data in an organization can be summarized in bullets points as follows:

  • Conventional DW (Data Warehouse) /BI (Business Intelligence):
    • Refined/ cleansed data transferred from production business application using ETL.
    • Data earlier than a certain period would have already been transferred to a storage, which is hard to retrieve, such as magnetic tape storage.
    • Some of its notable deficiencies are as follows:
      • A subset of production data in a cleansed format exists in DW; for any new element in DW, effort has to be made
      • A subset of the data is again in DW, and the rest gets transferred to permanent storage
      • Usually, analysis is really slow, and it is optimized again to perform queries, which are, to an extent, defined
  • Siloed Big Data:
    • Some departments would have taken the right step in building big data. But departments generally don’t collaborate with each other, and this big data becomes siloed and doesn't give the value of a true big data for the enterprise.
    • Some of its deficiencies are as follows:
      • Because of its siloed nature, the analyst is again constrained and not able to mix and match data between departments.
      • A good amount of money would have been spent to build and maintain/manage this and usually over a period of time is not sustainable.
  • Myriad of non-connected applications:
    • There is a good amount of applications on premises and on cloud.
    • Applications apart from churning structured data also produce unstructured data.
    • Some of the deficiencies are as follows:
      • Don't talk to each other
      • Even if it talks, data scientists are not able to use it in an effective way to transform the enterprise in a meaningful way
      • Replication of technology usage for handling many aspects in each business application

We wouldn't say that creating or investing in Data lake is a silver bullet to solve all the aforementioned deficiencies. But it is definitely a step in the right direction, and every enterprise should at least spend some time discussing whether this is indeed required, and if it is a yes, don't deliberate over it too much and take the next step in the path of implementation.

Data lake is an enterprise initiative, and when built, it has to be with the consent of all the stakeholders, and it should have buy-ins from the top executives. It can definitely find ways to improve processes by which enterprises do business. It can help the higher management know more about their business and can increase the success rate of the decision-making process.

Enterprise digital transformation


Digital transformation is the application of digital technologies to fundamentally impact all aspects of business and society.

- infoworld.com

Digital transformation (DX) is an industry buzzword and a very strong initiative that every enterprise is taking without much deliberation. As the word suggests, it refers to transforming enterprises with information technology as one of its core pillars. Investing in technologies would definitely happen as part of this initiative, but data is one of the key aspects in achieving the so-called transformation.

Enterprises has known the importance of data and its analysis more and more in recent times, and that has definitely made every enterprise think out-of-the-box; this initiative is a way to establish data at the center.

As part of this business transformation, enterprises should definitely have Data lake as one of the core investments, with every department agreeing to share their data to flow into this Data lake, without much prejudice or pride.

Enterprises embarking on this journey

A Forrester Consulting research study commissioned by Accenture Interactive found that the key drivers of digital transformation are profitability, customer satisfaction, and increased speed-to-market.

Many enterprises are, in fact, already on the path of digital transformation. It is no more a buzzword, and enterprises are indeed making every effort to transform themselves using technology as one of the drivers and, you guessed it right, the other one being data.

Enterprises taking this path have clearly defined objectives. Obviously, this changes according to the enterprise and the business they are in. But some of the common ones, in no particular order, are as follows:

  • Radically improve customer experience
  • Reduce cost
  • Increase in revenue
  • Bring in differentiation from competitors
  • Tweak business processes and, in turn, change the business model

Some examples

There are a number of clear examples about what enterprises want to achieve in this space, some of which are as follows:

  • Ability to segment customers and give them personalized products. Targeting campaigns to the right person at the right time.
  • Bringing in more technologies and reducing manual work, basically digitizing many aspects in the enterprise.
  • Using social information clubbed together with enterprise data to make some important decisions.
  • Predicting the future in a more quantitative fashion and taking necessary steps and also preparing accordingly, well in advance, obviously.
  • Taking business global using technology as an important vehicle.

The next section details one of the use cases that enterprises want to achieve as part of digital transformation, with data as the main contributor. Understanding the use case is important as this is the use case we will try to implement in this book throughout.

Data lake use case enlightenment


We saw the importance of data in an enterprise. What enterprises face today is how to mine this data for information that can be used in favor of the business.

Even if we are able to bring this data into one place somehow, it's quite difficult to deal with this huge quantity of data and that too in a reasonable time. This is when the significance of Data lake comes into the picture. The next chapter details, in a holistic fashion, what Data lake is. Before getting there, let's detail the use case that we are trying to achieve throughout this book, with Data lake taking the center stage.

Data lake implementation using modern technologies would bring in many benefits, some of which are given as follows:

  • Ability for business users, using various analyzes, to find various important aspects in the business with regard to people, processes, and also a good insight into various customers
  • Allowing the business to do these analytics in a modest time frame rather than waiting for weeks or months
  • Performance and quickness of data analysis in the hands of business users to quickly tweak business processes

The use case that we will be covering throughout this book is called Single Customer View. Single Customer View (SCV) is a well-known term in the industry, and so it has quite a few definitions, one of which is as follows:

A Single Customer View is an aggregated, consistent and holistic representation of the data known by an organisation about its customers.

- Wikipedia

Enterprises keeps customer data in varying degrees siloed in different business applications. The use case aims at collating these varying degrees of data from these business applications into one and helping the analysts looking at this data create a single customer view with all the distinct data collected. This single view brings in the capability of segmenting customers and helping the business to target the right customers with the right content.

The significance of this use case for the enterprise can be narrowed down to points as listed next:

  • Customer segmentation
  • Collating information
  • Improving customer relations and, in turn, bringing is retention
  • Deeper analytics/insight, and so on

Conceptually, the following figure (Figure 05) summarizes the use case that we plan to implement throughout this book. Structured, semi-structured, and unstructured data is fed into the Data lake. From the Data lake, the Single Customer View (SCV) is derived in a holistic fashion. The various data examples are also depicted in each category, which we will implement in this book. Doing so gives a full use of a Data lake in an enterprise and is more realistic:

Figure 05: Conceptual view of Data lake use case for SCV

Figure 05 shows that our Data lake acquires data from various sources (variety), has different velocities and volumes. This is more a conceptual high-level view of what we will be achieving after going through the whole book.

We are really excited, and we hope you are, too!

Summary


In this chapter, we delved deep into some of the common terminologies in the industry. We started the chapter by understanding data, enterprise data, and all important big data. We then dealt with the relevance of data, including various data quality attributes. After that, we went into the details of the different types of data classification and where the data lives in an enterprise.

In the sections that followed, we dealt with digital transformation and finished the chapter with the use case that we will be implementing throughout this book in more detail.

After completing this chapter, you will have clearly grasped many terminologies and will also have a good understanding of the significance of data and Data lake to an enterprise. Along with that, you will have a very good idea about the use case and its relevance.

In the next chapter, we will delve deep into Data lake and also the pattern that we will use in implementing Data lake in more detail.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base
  • Delve into the big data technologies required to meet modern day business strategies
  • A highly practical guide to implementing enterprise data lakes with lots of examples and real-world use-cases

Description

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

What you will learn

Build an enterprise-level data lake using the relevant big data technologies Understand the core of the Lambda architecture and how to apply it in an enterprise Learn the technical details around Sqoop and its functionalities Integrate Kafka with Hadoop components to acquire enterprise data Use flume with streaming technologies for stream-based processing Understand stream- based processing with reference to Apache Spark Streaming Incorporate Hadoop components and know the advantages they provide for enterprise data lakes Build fast, streaming, and high-performance applications using ElasticSearch Make your data ingestion process consistent across various data formats with configurability Process your data to derive intelligence using machine learning algorithms

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : May 31, 2017
Length 596 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781787281349
Vendor :
Apache
Category :

Table of Contents

23 Chapters
Title Page Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
Foreword Chevron down icon Chevron up icon
About the Authors Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Customer Feedback Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Part 1 - Overview Chevron down icon Chevron up icon
Part 2 - Technical Building blocks of Data Lake Chevron down icon Chevron up icon
Part 3 - Bringing It All Together Chevron down icon Chevron up icon
Introduction to Data Chevron down icon Chevron up icon
Comprehensive Concepts of a Data Lake Chevron down icon Chevron up icon
Lambda Architecture as a Pattern for Data Lake Chevron down icon Chevron up icon
Applied Lambda for Data Lake Chevron down icon Chevron up icon
Data Acquisition of Batch Data using Apache Sqoop Chevron down icon Chevron up icon
Data Acquisition of Stream Data using Apache Flume Chevron down icon Chevron up icon
Messaging Layer using Apache Kafka Chevron down icon Chevron up icon
Data Processing using Apache Flink Chevron down icon Chevron up icon
Data Store Using Apache Hadoop Chevron down icon Chevron up icon
Indexed Data Store using Elasticsearch Chevron down icon Chevron up icon
Data Lake Components Working Together Chevron down icon Chevron up icon
Data Lake Use Case Suggestions Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela